What is the advantage of keeping batch size a power of 2?












9












$begingroup$


While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.



This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?










share|improve this question









$endgroup$

















    9












    $begingroup$


    While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.



    This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?










    share|improve this question









    $endgroup$















      9












      9








      9


      1



      $begingroup$


      While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.



      This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?










      share|improve this question









      $endgroup$




      While training models in machine learning, why is it sometimes advantageous to keep the batch size to a power of 2? I thought it would be best to use a size that is the largest fit in your GPU memory / RAM.



      This answer claims that for some packages, a power of 2 is better as a batch size. Can someone provide a detailed explanation / link to a detailed explanation for this? Is this true for all optimisation algorithms (gradient descent, backpropagation, etc) or only some of them?







      machine-learning training






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jul 5 '17 at 5:43









      James BondJames Bond

      177139




      177139






















          2 Answers
          2






          active

          oldest

          votes


















          10












          $begingroup$

          This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

          You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

          Say you've got 16 PP.

          You can map 16 VP on them : 1 VP is mapped onto 1 PP.

          You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

          Etc.
          During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

          If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

          This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

          More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

          Nothing to do with learning. This is only programming stuff.






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            @jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?






            share|improve this answer








            New contributor




            1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20179%2fwhat-is-the-advantage-of-keeping-batch-size-a-power-of-2%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              10












              $begingroup$

              This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

              You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

              Say you've got 16 PP.

              You can map 16 VP on them : 1 VP is mapped onto 1 PP.

              You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

              Etc.
              During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

              If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

              This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

              More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

              Nothing to do with learning. This is only programming stuff.






              share|improve this answer









              $endgroup$


















                10












                $begingroup$

                This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

                You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

                Say you've got 16 PP.

                You can map 16 VP on them : 1 VP is mapped onto 1 PP.

                You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

                Etc.
                During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

                If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

                This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

                More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

                Nothing to do with learning. This is only programming stuff.






                share|improve this answer









                $endgroup$
















                  10












                  10








                  10





                  $begingroup$

                  This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

                  You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

                  Say you've got 16 PP.

                  You can map 16 VP on them : 1 VP is mapped onto 1 PP.

                  You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

                  Etc.
                  During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

                  If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

                  This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

                  More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

                  Nothing to do with learning. This is only programming stuff.






                  share|improve this answer









                  $endgroup$



                  This is a problem of alignment of the virtual processors (VP) onto the physical processors (PP) of the GPU. Since the number of PP is often a power of 2, using a number of VP different from a power of 2 leads to poor performance.

                  You can see the mapping of the VP onto the PP as a pile of slices of size the number of PP.

                  Say you've got 16 PP.

                  You can map 16 VP on them : 1 VP is mapped onto 1 PP.

                  You can map 32 VP on them : 2 slices of 16 VP, 1 PP will be responsible for 2 VP.

                  Etc.
                  During execution, each PP will execute the job of the 1st VP he is responsible for, then the job of the 2nd VP etc.

                  If you use 17 VP, each PP will execute the job of their 1st PP, then 1 PP will execute the job of the 17th AND the other ones will do nothing (precised below).

                  This is due to the SIMD paradigm (called vector in the 70s) used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data. See https://en.wikipedia.org/wiki/SIMD.

                  More precisely, in the example with 17 VP, once the job of the 1st slice done (by all the PPs doing the job of their 1st VP), all the PP will do the same job (2nd VP), but only one has some data to work on.

                  Nothing to do with learning. This is only programming stuff.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jul 5 '17 at 18:31









                  jcm69jcm69

                  21623




                  21623























                      0












                      $begingroup$

                      @jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?






                      share|improve this answer








                      New contributor




                      1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.






                      $endgroup$


















                        0












                        $begingroup$

                        @jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?






                        share|improve this answer








                        New contributor




                        1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.






                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          @jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?






                          share|improve this answer








                          New contributor




                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          $endgroup$



                          @jcm69 would it be more accurate to say that batch sizes should then be a multiple of the number of PP? That is, in your example we could map 16x3=48 VP to 16 PP?







                          share|improve this answer








                          New contributor




                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.









                          share|improve this answer



                          share|improve this answer






                          New contributor




                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.









                          answered 1 hour ago









                          1west1west

                          11




                          11




                          New contributor




                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.





                          New contributor





                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          1west is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20179%2fwhat-is-the-advantage-of-keeping-batch-size-a-power-of-2%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Callistus I

                              Tabula Rosettana

                              How to label and detect the document text images