How to get out of local minimums on stochastic gradient descent?

I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.

If you start at a point, for instance: (red)
enter image description here

When you compute the gradient of the error function and step in the direction of greatest descent, you'd end up in that immediate local minimum. AFAIK, you'd get stuck there. How do neural network trainers go about this? Do they start from a new random configuration of weights at each batch and see if the cost is smaller, or is there some way of immediately getting to the global minimum?

I've heard of a method resetting the learning rate to 'pop' out of local minimums but I'm not sure how that works when the gradient is 0. I've also heard that stochastic gradient descent is more reliable than gradient descent at finding global minimums, but I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

add a comment |

I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.

If you start at a point, for instance: (red)
enter image description here

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

add a comment |

I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.

If you start at a point, for instance: (red)
enter image description here

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

I'm not programming a neural network but I'm looking at it from a non-hands-on, theoretical point of view and I'm currently wondering how to escape a local minimum and how to get to a global minimum.

If you start at a point, for instance: (red)
enter image description here

machine-learning neural-network gradient-descent

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

edited 2 hours ago

asked 3 hours ago

Lewis Kelsey

New contributor

asked 3 hours ago

Lewis Kelsey

asked 3 hours ago

Lewis Kelsey

New contributor

Lewis Kelsey is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

Stohastic gradient descent loss landscape vs. gradient descent loss landscape

I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

So, stochastic gradient descent is more able to avoid local minimum because the landscape of batch loss function is different than the loss function of whole dataset (the case when you calculate the losses on all data and then update parameters). That means the gradient on the whole dataset could be 0 at some point, but at that same point, the gradient of the batch could be different (so we hope to go in other direction than the local minimum).

Neural network architecture and loss landscape

In order to escape the local minimum, your neural architecture can also help. For example, see this work: Visualizing the Loss Landscape of Neural Nets. It shows that skip connections can smoothen your loss landscape and, hence, help the optimizers to find the global minimum more easily.

Local minimums vs global optimum

Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.

answered 1 hour ago

Antonio Jurić

3007

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44324%2fhow-to-get-out-of-local-minimums-on-stochastic-gradient-descent%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Stohastic gradient descent loss landscape vs. gradient descent loss landscape

I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

Neural network architecture and loss landscape

Local minimums vs global optimum

Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.

answered 1 hour ago

Antonio Jurić

3007

add a comment |

Stohastic gradient descent loss landscape vs. gradient descent loss landscape

I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

Neural network architecture and loss landscape

Local minimums vs global optimum

Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.

answered 1 hour ago

Antonio Jurić

3007

add a comment |

Stohastic gradient descent loss landscape vs. gradient descent loss landscape

I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

Neural network architecture and loss landscape

Local minimums vs global optimum

Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.

answered 1 hour ago

Antonio Jurić

3007

Stohastic gradient descent loss landscape vs. gradient descent loss landscape

I don't know how using the training data in batches rather than all at once allows it to steer around local minimum in the example, which is clearly steeper than the path to the global minimum behind it.

Neural network architecture and loss landscape

Local minimums vs global optimum

Finally, there are some works suggesting that the local minimums have almost the same function value as the global optimum. See this question and answer.

answered 1 hour ago

Antonio Jurić

3007

answered 1 hour ago

Antonio Jurić

3007

answered 1 hour ago

Antonio Jurić

3007

answered 1 hour ago

Antonio Jurić

3007

add a comment |

Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Lewis Kelsey is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk