How to get out of local minimums on stochastic gradient descent?












1












$begingroup$


I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.



If you start at a point, for instance: (red)
enter image description here



When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?



I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.










share|improve this question









New contributor




Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.



    If you start at a point, for instance: (red)
    enter image description here



    When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?



    I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.










    share|improve this question









    New contributor




    Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.



      If you start at a point, for instance: (red)
      enter image description here



      When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?



      I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.










      share|improve this question









      New contributor




      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.



      If you start at a point, for instance: (red)
      enter image description here



      When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?



      I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.







      machine-learning neural-network gradient-descent






      share|improve this question









      New contributor




      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 2 hours ago







      Lewis Kelsey













      New contributor




      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 3 hours ago









      Lewis KelseyLewis Kelsey

      63




      63




      New contributor




      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          Stohastic gradient descent loss landscape vs. gradient descent loss landscape




          I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.




          So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).



          Neural network architecture and loss landscape



          In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.



          Local minimums vs global optimum



          Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44324%2fhow-to-get-out-of-local-minimums-on-stochastic-gradient-descent%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            Stohastic gradient descent loss landscape vs. gradient descent loss landscape




            I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.




            So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).



            Neural network architecture and loss landscape



            In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.



            Local minimums vs global optimum



            Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              Stohastic gradient descent loss landscape vs. gradient descent loss landscape




              I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.




              So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).



              Neural network architecture and loss landscape



              In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.



              Local minimums vs global optimum



              Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                Stohastic gradient descent loss landscape vs. gradient descent loss landscape




                I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.




                So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).



                Neural network architecture and loss landscape



                In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.



                Local minimums vs global optimum



                Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.






                share|improve this answer









                $endgroup$



                Stohastic gradient descent loss landscape vs. gradient descent loss landscape




                I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.




                So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).



                Neural network architecture and loss landscape



                In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.



                Local minimums vs global optimum



                Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 1 hour ago









                Antonio JurićAntonio Jurić

                3007




                3007






















                    Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.













                    Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.












                    Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44324%2fhow-to-get-out-of-local-minimums-on-stochastic-gradient-descent%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana