Is Gradient Descent central to every optimizer?
$begingroup$
I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
machine-learning neural-network deep-learning optimization gradient-descent
$endgroup$
add a comment |
$begingroup$
I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
machine-learning neural-network deep-learning optimization gradient-descent
$endgroup$
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
4
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago
add a comment |
$begingroup$
I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
machine-learning neural-network deep-learning optimization gradient-descent
$endgroup$
I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
machine-learning neural-network deep-learning optimization gradient-descent
machine-learning neural-network deep-learning optimization gradient-descent
edited 9 hours ago
doppelgreener
1053
1053
asked 13 hours ago
InAFlashInAFlash
3021213
3021213
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
4
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago
add a comment |
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
4
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
4
4
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam
, Adagrad
, and RMSProp
all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.
Check out this visualization of Bayesian Optimization in action:
There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.
Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.
To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.
New contributor
$endgroup$
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
add a comment |
$begingroup$
According to the title:
No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.
According to the body:
Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).
It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.
Some extra notes
The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.
The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).
Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.
$endgroup$
add a comment |
$begingroup$
Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.
New contributor
$endgroup$
add a comment |
$begingroup$
The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.
On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.
While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.
Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.
Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam
, Adagrad
, and RMSProp
all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.
Check out this visualization of Bayesian Optimization in action:
There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.
Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.
To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.
New contributor
$endgroup$
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
add a comment |
$begingroup$
No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam
, Adagrad
, and RMSProp
all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.
Check out this visualization of Bayesian Optimization in action:
There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.
Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.
To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.
New contributor
$endgroup$
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
add a comment |
$begingroup$
No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam
, Adagrad
, and RMSProp
all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.
Check out this visualization of Bayesian Optimization in action:
There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.
Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.
To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.
New contributor
$endgroup$
No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam
, Adagrad
, and RMSProp
all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.
Check out this visualization of Bayesian Optimization in action:
There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.
Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.
To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.
New contributor
New contributor
answered 9 hours ago
jeb02jeb02
1061
1061
New contributor
New contributor
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
add a comment |
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago
add a comment |
$begingroup$
According to the title:
No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.
According to the body:
Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).
It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.
Some extra notes
The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.
The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).
Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.
$endgroup$
add a comment |
$begingroup$
According to the title:
No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.
According to the body:
Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).
It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.
Some extra notes
The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.
The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).
Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.
$endgroup$
add a comment |
$begingroup$
According to the title:
No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.
According to the body:
Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).
It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.
Some extra notes
The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.
The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).
Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.
$endgroup$
According to the title:
No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.
According to the body:
Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).
It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.
Some extra notes
The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.
The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).
Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.
edited 5 hours ago
answered 12 hours ago
EsmailianEsmailian
7728
7728
add a comment |
add a comment |
$begingroup$
Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.
New contributor
$endgroup$
add a comment |
$begingroup$
Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.
New contributor
$endgroup$
add a comment |
$begingroup$
Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.
New contributor
$endgroup$
Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.
New contributor
New contributor
answered 6 hours ago
ChristianChristian
211
211
New contributor
New contributor
add a comment |
add a comment |
$begingroup$
The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.
On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.
While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.
Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.
Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.
$endgroup$
add a comment |
$begingroup$
The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.
On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.
While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.
Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.
Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.
$endgroup$
add a comment |
$begingroup$
The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.
On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.
While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.
Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.
Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.
$endgroup$
The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.
On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.
While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.
Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.
Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.
edited 10 hours ago
answered 11 hours ago
MediaMedia
7,23562161
7,23562161
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago
4
$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago