Optimization methods used in machine learning
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
|
show 1 more comment
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
machine-learning neural-network training
asked Feb 22 '18 at 16:49
InquisitiveInquirerInquisitiveInquirer
1061
1061
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
1
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
1 Answer
1
active
oldest
votes
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
answered Nov 23 '18 at 13:40
TuyenTuyen
313
313
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29