How to get out of local minimums on stochastic gradient descent?
$begingroup$
I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.
If you start at a point, for instance: (red)
When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?
I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
machine-learning neural-network gradient-descent
New contributor
$endgroup$
add a comment |
$begingroup$
I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.
If you start at a point, for instance: (red)
When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?
I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
machine-learning neural-network gradient-descent
New contributor
$endgroup$
add a comment |
$begingroup$
I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.
If you start at a point, for instance: (red)
When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?
I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
machine-learning neural-network gradient-descent
New contributor
$endgroup$
I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.
If you start at a point, for instance: (red)
When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?
I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
machine-learning neural-network gradient-descent
machine-learning neural-network gradient-descent
New contributor
New contributor
edited 2 hours ago
Lewis Kelsey
New contributor
asked 3 hours ago
Lewis KelseyLewis Kelsey
63
63
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Stohastic gradient descent loss landscape vs. gradient descent loss landscape
I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).
Neural network architecture and loss landscape
In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.
Local minimums vs global optimum
Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44324%2fhow-to-get-out-of-local-minimums-on-stochastic-gradient-descent%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Stohastic gradient descent loss landscape vs. gradient descent loss landscape
I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).
Neural network architecture and loss landscape
In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.
Local minimums vs global optimum
Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.
$endgroup$
add a comment |
$begingroup$
Stohastic gradient descent loss landscape vs. gradient descent loss landscape
I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).
Neural network architecture and loss landscape
In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.
Local minimums vs global optimum
Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.
$endgroup$
add a comment |
$begingroup$
Stohastic gradient descent loss landscape vs. gradient descent loss landscape
I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).
Neural network architecture and loss landscape
In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.
Local minimums vs global optimum
Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.
$endgroup$
Stohastic gradient descent loss landscape vs. gradient descent loss landscape
I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.
So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).
Neural network architecture and loss landscape
In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.
Local minimums vs global optimum
Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.
answered 1 hour ago
Antonio JurićAntonio Jurić
3007
3007
add a comment |
add a comment |
Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44324%2fhow-to-get-out-of-local-minimums-on-stochastic-gradient-descent%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown