How does the bounding box regressor work in Fast R-CNN?












5












$begingroup$


In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent the probability of output classes. Rather, they are physical values representing position and size of a bounding box.



The exact method of how this regression learning happens is not clear to me. Linear regression and image classification by deep learning are well explained separately earlier. But how the linear regression algorithm works in the CNN settings is not explained so clearly.



Can you explain the basic concept for easy understanding?










share|improve this question









$endgroup$

















    5












    $begingroup$


    In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent the probability of output classes. Rather, they are physical values representing position and size of a bounding box.



    The exact method of how this regression learning happens is not clear to me. Linear regression and image classification by deep learning are well explained separately earlier. But how the linear regression algorithm works in the CNN settings is not explained so clearly.



    Can you explain the basic concept for easy understanding?










    share|improve this question









    $endgroup$















      5












      5








      5


      2



      $begingroup$


      In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent the probability of output classes. Rather, they are physical values representing position and size of a bounding box.



      The exact method of how this regression learning happens is not clear to me. Linear regression and image classification by deep learning are well explained separately earlier. But how the linear regression algorithm works in the CNN settings is not explained so clearly.



      Can you explain the basic concept for easy understanding?










      share|improve this question









      $endgroup$




      In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent the probability of output classes. Rather, they are physical values representing position and size of a bounding box.



      The exact method of how this regression learning happens is not clear to me. Linear regression and image classification by deep learning are well explained separately earlier. But how the linear regression algorithm works in the CNN settings is not explained so clearly.



      Can you explain the basic concept for easy understanding?







      image-recognition object-recognition yolo faster-rcnn






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 20 '18 at 7:25









      Saptarshi RoySaptarshi Roy

      12718




      12718






















          2 Answers
          2






          active

          oldest

          votes


















          3












          $begingroup$

          The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.



          The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Yes. It was my mistake to mention the regressor as linear.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:12










          • $begingroup$
            Did I answer your question though?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:42










          • $begingroup$
            After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:49








          • 2




            $begingroup$
            In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:58










          • $begingroup$
            It is somewhat clearer than before.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 9:27





















          0












          $begingroup$


          A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:




          enter image description hereenter image description here




          Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:




          enter image description here




          After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!




          enter image description here




          However, the same author has changed the loss function again in the upcoming paper faster-RCNN
          Later, in FCN
          Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!







          share|improve this answer









          $endgroup$














            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30557%2fhow-does-the-bounding-box-regressor-work-in-fast-r-cnn%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3












            $begingroup$

            The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.



            The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Yes. It was my mistake to mention the regressor as linear.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:12










            • $begingroup$
              Did I answer your question though?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:42










            • $begingroup$
              After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:49








            • 2




              $begingroup$
              In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:58










            • $begingroup$
              It is somewhat clearer than before.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 9:27


















            3












            $begingroup$

            The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.



            The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Yes. It was my mistake to mention the regressor as linear.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:12










            • $begingroup$
              Did I answer your question though?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:42










            • $begingroup$
              After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:49








            • 2




              $begingroup$
              In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:58










            • $begingroup$
              It is somewhat clearer than before.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 9:27
















            3












            3








            3





            $begingroup$

            The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.



            The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.






            share|improve this answer









            $endgroup$



            The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.



            The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Apr 20 '18 at 7:58









            David MasipDavid Masip

            2,5511428




            2,5511428












            • $begingroup$
              Yes. It was my mistake to mention the regressor as linear.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:12










            • $begingroup$
              Did I answer your question though?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:42










            • $begingroup$
              After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:49








            • 2




              $begingroup$
              In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:58










            • $begingroup$
              It is somewhat clearer than before.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 9:27




















            • $begingroup$
              Yes. It was my mistake to mention the regressor as linear.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:12










            • $begingroup$
              Did I answer your question though?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:42










            • $begingroup$
              After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 8:49








            • 2




              $begingroup$
              In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
              $endgroup$
              – David Masip
              Apr 20 '18 at 8:58










            • $begingroup$
              It is somewhat clearer than before.
              $endgroup$
              – Saptarshi Roy
              Apr 20 '18 at 9:27


















            $begingroup$
            Yes. It was my mistake to mention the regressor as linear.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:12




            $begingroup$
            Yes. It was my mistake to mention the regressor as linear.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:12












            $begingroup$
            Did I answer your question though?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:42




            $begingroup$
            Did I answer your question though?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:42












            $begingroup$
            After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:49






            $begingroup$
            After your comment (and few subsequent google search), I have understood that NN can very well solve regression problems by replacing the last layer. But the intuitive understanding of how the exact value of lengths coming is still not there. For example, the layers of CNN indicate different features of an image (features like edges, color etc.). The training method finds the correct filters (weights) to extract only the relevant features to discriminate the positive examples from the negative ones. I was looking for a similar explanation for the regression part.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 8:49






            2




            2




            $begingroup$
            In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:58




            $begingroup$
            In the regression setting, the training method finds the correct filters (weights) to extract the relevant features to find the position of the top left edge, as well as the height and the width. In the end, what you have is a cost function that measures how good you are doing on predicting these features. And that is what deep learning is all about: give me a differentiable cost function, some labelled images and I'll find you a way to predict the labels. Is this more clear?
            $endgroup$
            – David Masip
            Apr 20 '18 at 8:58












            $begingroup$
            It is somewhat clearer than before.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 9:27






            $begingroup$
            It is somewhat clearer than before.
            $endgroup$
            – Saptarshi Roy
            Apr 20 '18 at 9:27













            0












            $begingroup$


            A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:




            enter image description hereenter image description here




            Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:




            enter image description here




            After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!




            enter image description here




            However, the same author has changed the loss function again in the upcoming paper faster-RCNN
            Later, in FCN
            Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!







            share|improve this answer









            $endgroup$


















              0












              $begingroup$


              A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:




              enter image description hereenter image description here




              Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:




              enter image description here




              After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!




              enter image description here




              However, the same author has changed the loss function again in the upcoming paper faster-RCNN
              Later, in FCN
              Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!







              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$


                A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:




                enter image description hereenter image description here




                Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:




                enter image description here




                After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!




                enter image description here




                However, the same author has changed the loss function again in the upcoming paper faster-RCNN
                Later, in FCN
                Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!







                share|improve this answer









                $endgroup$




                A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:




                enter image description hereenter image description here




                Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:




                enter image description here




                After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!




                enter image description here




                However, the same author has changed the loss function again in the upcoming paper faster-RCNN
                Later, in FCN
                Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!








                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered yesterday









                anuanu

                1688




                1688






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30557%2fhow-does-the-bounding-box-regressor-work-in-fast-r-cnn%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)