Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago

4

$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago

add a comment |

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago

4

$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago

add a comment |

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

machine-learning neural-network deep-learning optimization gradient-descent

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

edited 9 hours ago

doppelgreener

1053

edited 9 hours ago

doppelgreener

1053

edited 9 hours ago

doppelgreener

1053

asked 13 hours ago

InAFlash

3021213

asked 13 hours ago

InAFlash

3021213

asked 13 hours ago

InAFlash

3021213

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago

4

$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago

add a comment |

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
10 hours ago

4

$begingroup$
I suggest making the title more specific since the answer to EVERY optimizer is obviously no.
$endgroup$
– Esmailian
9 hours ago

I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .

– frank
10 hours ago

I suggest making the title more specific since the answer to EVERY optimizer is obviously no.

– Esmailian
9 hours ago

add a comment |

4 Answers
4

active

oldest

votes

No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.

To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.

Rastrigin benchmark function

answered 9 hours ago

jeb02

1061

New contributor

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago

add a comment |

According to the title:

No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.

According to the body:

Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).

It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

add a comment |

Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.

answered 6 hours ago

Christian

211

New contributor

add a comment |

The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.

On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.

While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.

Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered 9 hours ago

jeb02

1061

New contributor

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago

add a comment |

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered 9 hours ago

jeb02

1061

New contributor

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago

add a comment |

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered 9 hours ago

jeb02

1061

New contributor

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered 9 hours ago

jeb02

1061

New contributor

answered 9 hours ago

jeb02

1061

New contributor

answered 9 hours ago

jeb02

1061

answered 9 hours ago

jeb02

1061

New contributor

jeb02 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago

add a comment |

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
5 hours ago

thank you very much. Loved your answer

– InAFlash
5 hours ago

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

edited 5 hours ago

answered 12 hours ago

Esmailian

7728

answered 12 hours ago

Esmailian

7728

answered 12 hours ago

Esmailian

7728

add a comment |

answered 6 hours ago

Christian

211

New contributor

add a comment |

answered 6 hours ago

Christian

211

New contributor

add a comment |

answered 6 hours ago

Christian

211

New contributor

answered 6 hours ago

Christian

211

New contributor

answered 6 hours ago

Christian

211

New contributor

answered 6 hours ago

Christian

211

answered 6 hours ago

Christian

211

New contributor

Christian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

edited 10 hours ago

answered 11 hours ago

Media

7,23562161

answered 11 hours ago

Media

7,23562161

answered 11 hours ago

Media

7,23562161

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk