Is Gradient Descent central to every optimizer?












4












$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$












  • $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    10 hours ago






  • 4




    $begingroup$
    I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
    $endgroup$
    – Esmailian
    9 hours ago
















4












$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$












  • $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    10 hours ago






  • 4




    $begingroup$
    I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
    $endgroup$
    – Esmailian
    9 hours ago














4












4








4


3



$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$




I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.







machine-learning neural-network deep-learning optimization gradient-descent






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 9 hours ago









doppelgreener

1053




1053










asked 13 hours ago









InAFlashInAFlash

3021213




3021213












  • $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    10 hours ago






  • 4




    $begingroup$
    I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
    $endgroup$
    – Esmailian
    9 hours ago


















  • $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    10 hours ago






  • 4




    $begingroup$
    I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
    $endgroup$
    – Esmailian
    9 hours ago
















$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago




$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago




4




4




$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago




$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago










4 Answers
4






active

oldest

votes


















9












$begingroup$

No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



Rastrigin benchmark function






share|improve this answer








New contributor




jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













  • $begingroup$
    thank you very much. Loved your answer
    $endgroup$
    – InAFlash
    5 hours ago



















5












$begingroup$

According to the title:

No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



According to the body:

Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



Some extra notes




  1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


  2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


  3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







share|improve this answer











$endgroup$





















    2












    $begingroup$

    Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
    When you can't compute the gradient then you'll fall back to Derivative-free optimization.
    That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
    And there's also the branch of combinatorial optimization where a whole different set of tools is used.






    share|improve this answer








    New contributor




    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$





















      1












      $begingroup$

      The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



      On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



      While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



      Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



      Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






      share|improve this answer











      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        9












        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer








        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$













        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          5 hours ago
















        9












        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer








        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$













        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          5 hours ago














        9












        9








        9





        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer








        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function







        share|improve this answer








        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 9 hours ago









        jeb02jeb02

        1061




        1061




        New contributor




        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.












        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          5 hours ago


















        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          5 hours ago
















        $begingroup$
        thank you very much. Loved your answer
        $endgroup$
        – InAFlash
        5 hours ago




        $begingroup$
        thank you very much. Loved your answer
        $endgroup$
        – InAFlash
        5 hours ago











        5












        $begingroup$

        According to the title:

        No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



        According to the body:

        Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



        It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



        Some extra notes




        1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


        2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


        3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







        share|improve this answer











        $endgroup$


















          5












          $begingroup$

          According to the title:

          No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



          According to the body:

          Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



          It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



          Some extra notes




          1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


          2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


          3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







          share|improve this answer











          $endgroup$
















            5












            5








            5





            $begingroup$

            According to the title:

            No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



            According to the body:

            Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



            It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



            Some extra notes




            1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


            2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


            3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







            share|improve this answer











            $endgroup$



            According to the title:

            No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



            According to the body:

            Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



            It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



            Some extra notes




            1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


            2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


            3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.








            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 5 hours ago

























            answered 12 hours ago









            EsmailianEsmailian

            7728




            7728























                2












                $begingroup$

                Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                share|improve this answer








                New contributor




                Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$


















                  2












                  $begingroup$

                  Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                  When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                  That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                  And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                  share|improve this answer








                  New contributor




                  Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$
















                    2












                    2








                    2





                    $begingroup$

                    Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                    When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                    That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                    And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                    share|improve this answer








                    New contributor




                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.






                    $endgroup$



                    Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                    When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                    That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                    And there's also the branch of combinatorial optimization where a whole different set of tools is used.







                    share|improve this answer








                    New contributor




                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.









                    share|improve this answer



                    share|improve this answer






                    New contributor




                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.









                    answered 6 hours ago









                    ChristianChristian

                    211




                    211




                    New contributor




                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.





                    New contributor





                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.






                    Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.























                        1












                        $begingroup$

                        The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                        On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                        While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                        Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                        Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                        share|improve this answer











                        $endgroup$


















                          1












                          $begingroup$

                          The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                          On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                          While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                          Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                          Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                          share|improve this answer











                          $endgroup$
















                            1












                            1








                            1





                            $begingroup$

                            The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                            On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                            While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                            Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                            Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                            share|improve this answer











                            $endgroup$



                            The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                            On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                            While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                            Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                            Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited 10 hours ago

























                            answered 11 hours ago









                            MediaMedia

                            7,23562161




                            7,23562161






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                How to label and detect the document text images

                                Vallis Paradisi

                                Tabula Rosettana