Optimization methods used in machine learning












1












$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24












  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15












  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29


















1












$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24












  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15












  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29
















1












1








1


1



$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.







machine-learning neural-network training






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 22 '18 at 16:49









InquisitiveInquirerInquisitiveInquirer

1061




1061





bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24












  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15












  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29
















  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24












  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15












  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29










1




1




$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24






$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24














$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12




$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12




1




1




$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15






$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15














$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24




$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24












$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29






$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29












1 Answer
1






active

oldest

votes


















0












$begingroup$

In my reply here



Does gradient descent always converge to an optimum?



it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






share|improve this answer









$endgroup$














    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    In my reply here



    Does gradient descent always converge to an optimum?



    it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      In my reply here



      Does gradient descent always converge to an optimum?



      it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        In my reply here



        Does gradient descent always converge to an optimum?



        it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






        share|improve this answer









        $endgroup$



        In my reply here



        Does gradient descent always converge to an optimum?



        it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 '18 at 13:40









        TuyenTuyen

        313




        313






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana