What is the advantage of keeping batch size a power of 2?

While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.

This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?

asked Jul 5 '17 at 5:43

James Bond

177139

add a comment |

asked Jul 5 '17 at 5:43

James Bond

177139

add a comment |

asked Jul 5 '17 at 5:43

James Bond

177139

machine-learning training

asked Jul 5 '17 at 5:43

James Bond

177139

asked Jul 5 '17 at 5:43

James Bond

177139

asked Jul 5 '17 at 5:43

James Bond

177139

asked Jul 5 '17 at 5:43

James Bond

177139

asked Jul 5 '17 at 5:43

James Bond

177139

add a comment |

2 Answers
2

active

oldest

votes

This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

Say you've got 16 PP.

You can map 16 VP on them : 1 VP is mapped onto 1 PP.

You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

Etc.
During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

Nothing to do with learning. This is only programming stuff.

answered Jul 5 '17 at 18:31

jcm69

21623

add a comment |

@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?

answered 1 hour ago

1west

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20179%2fwhat-is-the-advantage-of-keeping-batch-size-a-power-of-2%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Jul 5 '17 at 18:31

jcm69

21623

add a comment |

answered Jul 5 '17 at 18:31

jcm69

21623

add a comment |

answered Jul 5 '17 at 18:31

jcm69

21623

answered Jul 5 '17 at 18:31

jcm69

21623

answered Jul 5 '17 at 18:31

jcm69

21623

answered Jul 5 '17 at 18:31

jcm69

21623

answered Jul 5 '17 at 18:31

jcm69

21623

add a comment |

@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?

answered 1 hour ago

1west

New contributor

add a comment |

@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?

answered 1 hour ago

1west

New contributor

add a comment |

@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?

answered 1 hour ago

1west

New contributor

@jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?

answered 1 hour ago

1west

New contributor

answered 1 hour ago

1west

New contributor

answered 1 hour ago

1west

answered 1 hour ago

1west

New contributor

1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk