What is the advantage of keeping batch size a power of 2?
$begingroup$
While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.
This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?
machine-learning training
$endgroup$
add a comment |
$begingroup$
While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.
This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?
machine-learning training
$endgroup$
add a comment |
$begingroup$
While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.
This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?
machine-learning training
$endgroup$
While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.
This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?
machine-learning training
machine-learning training
asked Jul 5 '17 at 5:43
James BondJames Bond
177139
177139
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.
You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.
Say you've got 16 PP.
You can map 16 VP on them : 1 VP is mapped onto 1 PP.
You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.
Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.
If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).
This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.
More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.
Nothing to do with learning. This is only programming stuff.
$endgroup$
add a comment |
$begingroup$
@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20179%2fwhat-is-the-advantage-of-keeping-batch-size-a-power-of-2%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.
You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.
Say you've got 16 PP.
You can map 16 VP on them : 1 VP is mapped onto 1 PP.
You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.
Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.
If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).
This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.
More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.
Nothing to do with learning. This is only programming stuff.
$endgroup$
add a comment |
$begingroup$
This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.
You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.
Say you've got 16 PP.
You can map 16 VP on them : 1 VP is mapped onto 1 PP.
You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.
Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.
If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).
This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.
More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.
Nothing to do with learning. This is only programming stuff.
$endgroup$
add a comment |
$begingroup$
This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.
You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.
Say you've got 16 PP.
You can map 16 VP on them : 1 VP is mapped onto 1 PP.
You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.
Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.
If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).
This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.
More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.
Nothing to do with learning. This is only programming stuff.
$endgroup$
This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.
You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.
Say you've got 16 PP.
You can map 16 VP on them : 1 VP is mapped onto 1 PP.
You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.
Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.
If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).
This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.
More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.
Nothing to do with learning. This is only programming stuff.
answered Jul 5 '17 at 18:31
jcm69jcm69
21623
21623
add a comment |
add a comment |
$begingroup$
@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 1 hour ago
1west1west
11
11
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20179%2fwhat-is-the-advantage-of-keeping-batch-size-a-power-of-2%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown