Unbalanced multiclass data with XGBoost












11












$begingroup$


I have 3 classes with this distribution:



Class 0: 0.1169
Class 1: 0.7668
Class 2: 0.1163


And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.



But how is it handled for 'multiclass' case, and how can I properly set it?










share|improve this question











$endgroup$

















    11












    $begingroup$


    I have 3 classes with this distribution:



    Class 0: 0.1169
    Class 1: 0.7668
    Class 2: 0.1163


    And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.



    But how is it handled for 'multiclass' case, and how can I properly set it?










    share|improve this question











    $endgroup$















      11












      11








      11


      6



      $begingroup$


      I have 3 classes with this distribution:



      Class 0: 0.1169
      Class 1: 0.7668
      Class 2: 0.1163


      And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.



      But how is it handled for 'multiclass' case, and how can I properly set it?










      share|improve this question











      $endgroup$




      I have 3 classes with this distribution:



      Class 0: 0.1169
      Class 1: 0.7668
      Class 2: 0.1163


      And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.



      But how is it handled for 'multiclass' case, and how can I properly set it?







      classification xgboost multiclass-classification unbalanced-classes






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 30 '18 at 14:29









      рüффп

      205415




      205415










      asked Jan 16 '17 at 12:53









      shdashda

      185210




      185210






















          3 Answers
          3






          active

          oldest

          votes


















          12












          $begingroup$

          scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:



          sum(negative instances) / sum(positive instances)


          For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:



          xgboost.DMatrix(..., weight = *weight array for individual weights*)


          You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.






          share|improve this answer











          $endgroup$





















            3












            $begingroup$

            Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!



            I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes -



            https://datascience.stackexchange.com/a/9493/37156



            Thanks wacax






            share|improve this answer









            $endgroup$





















              0












              $begingroup$

              This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.



              weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.



              For example, if we have three imbalanced classes with ratios



              class A = 10%
              class B = 30%
              class C = 60%


              Their weights would be (dividing the smallest class by others)



              class A = 1.000
              class B = 0.333
              class C = 0.167


              Then, if training data is



              index   class
              0 A
              1 A
              2 B
              3 C
              4 B


              we build the weight vector as follows:



              index   class    weight
              0 A 1.000
              1 A 1.000
              2 B 0.333
              3 C 0.167
              4 B 0.333





              share|improve this answer











              $endgroup$














                Your Answer





                StackExchange.ifUsing("editor", function () {
                return StackExchange.using("mathjaxEditing", function () {
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                });
                });
                }, "mathjax-editing");

                StackExchange.ready(function() {
                var channelOptions = {
                tags: "".split(" "),
                id: "557"
                };
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function() {
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled) {
                StackExchange.using("snippets", function() {
                createEditor();
                });
                }
                else {
                createEditor();
                }
                });

                function createEditor() {
                StackExchange.prepareEditor({
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader: {
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                },
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                });


                }
                });














                draft saved

                draft discarded


















                StackExchange.ready(
                function () {
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16342%2funbalanced-multiclass-data-with-xgboost%23new-answer', 'question_page');
                }
                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                12












                $begingroup$

                scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:



                sum(negative instances) / sum(positive instances)


                For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:



                xgboost.DMatrix(..., weight = *weight array for individual weights*)


                You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.






                share|improve this answer











                $endgroup$


















                  12












                  $begingroup$

                  scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:



                  sum(negative instances) / sum(positive instances)


                  For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:



                  xgboost.DMatrix(..., weight = *weight array for individual weights*)


                  You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.






                  share|improve this answer











                  $endgroup$
















                    12












                    12








                    12





                    $begingroup$

                    scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:



                    sum(negative instances) / sum(positive instances)


                    For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:



                    xgboost.DMatrix(..., weight = *weight array for individual weights*)


                    You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.






                    share|improve this answer











                    $endgroup$



                    scale_pos_weight is used for binary classification as you stated. It is a more generalized solution to handle imbalanced classes. A good approach when assigning a value to scale_pos_weight is:



                    sum(negative instances) / sum(positive instances)


                    For your specific case, there is another option in order to weight individual data points and take their weights into account while working with the booster, and let the optimization happen regarding their weights so that each point is represented equally. You just need to simply use:



                    xgboost.DMatrix(..., weight = *weight array for individual weights*)


                    You can define the weights as you like and by doing so, you can even handle imbalances within classes as well as imbalances across different classes.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Aug 26 '17 at 8:40









                    bstockton

                    1578




                    1578










                    answered May 8 '17 at 9:42









                    Kerem TKerem T

                    16123




                    16123























                        3












                        $begingroup$

                        Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!



                        I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes -



                        https://datascience.stackexchange.com/a/9493/37156



                        Thanks wacax






                        share|improve this answer









                        $endgroup$


















                          3












                          $begingroup$

                          Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!



                          I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes -



                          https://datascience.stackexchange.com/a/9493/37156



                          Thanks wacax






                          share|improve this answer









                          $endgroup$
















                            3












                            3








                            3





                            $begingroup$

                            Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!



                            I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes -



                            https://datascience.stackexchange.com/a/9493/37156



                            Thanks wacax






                            share|improve this answer









                            $endgroup$



                            Everyone stumbles upon this question when dealing with unbalanced multiclass classification problem using XGBoost in R. I did too!



                            I was looking for an example to better understand how to apply it. Invested almost an hour to find the link mentioned below. For all those who are looking for an example, here goes -



                            https://datascience.stackexchange.com/a/9493/37156



                            Thanks wacax







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Feb 25 '18 at 13:27









                            Krithi07Krithi07

                            4117




                            4117























                                0












                                $begingroup$

                                This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.



                                weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.



                                For example, if we have three imbalanced classes with ratios



                                class A = 10%
                                class B = 30%
                                class C = 60%


                                Their weights would be (dividing the smallest class by others)



                                class A = 1.000
                                class B = 0.333
                                class C = 0.167


                                Then, if training data is



                                index   class
                                0 A
                                1 A
                                2 B
                                3 C
                                4 B


                                we build the weight vector as follows:



                                index   class    weight
                                0 A 1.000
                                1 A 1.000
                                2 B 0.333
                                3 C 0.167
                                4 B 0.333





                                share|improve this answer











                                $endgroup$


















                                  0












                                  $begingroup$

                                  This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.



                                  weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.



                                  For example, if we have three imbalanced classes with ratios



                                  class A = 10%
                                  class B = 30%
                                  class C = 60%


                                  Their weights would be (dividing the smallest class by others)



                                  class A = 1.000
                                  class B = 0.333
                                  class C = 0.167


                                  Then, if training data is



                                  index   class
                                  0 A
                                  1 A
                                  2 B
                                  3 C
                                  4 B


                                  we build the weight vector as follows:



                                  index   class    weight
                                  0 A 1.000
                                  1 A 1.000
                                  2 B 0.333
                                  3 C 0.167
                                  4 B 0.333





                                  share|improve this answer











                                  $endgroup$
















                                    0












                                    0








                                    0





                                    $begingroup$

                                    This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.



                                    weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.



                                    For example, if we have three imbalanced classes with ratios



                                    class A = 10%
                                    class B = 30%
                                    class C = 60%


                                    Their weights would be (dividing the smallest class by others)



                                    class A = 1.000
                                    class B = 0.333
                                    class C = 0.167


                                    Then, if training data is



                                    index   class
                                    0 A
                                    1 A
                                    2 B
                                    3 C
                                    4 B


                                    we build the weight vector as follows:



                                    index   class    weight
                                    0 A 1.000
                                    1 A 1.000
                                    2 B 0.333
                                    3 C 0.167
                                    4 B 0.333





                                    share|improve this answer











                                    $endgroup$



                                    This answer by @KeremT is correct. I provide an example for those who still have problems with the exact implementation.



                                    weight parameter in XGBoost is per instance not per class. Therefore, we need to assign the weight of each class to its instances, which is the same thing.



                                    For example, if we have three imbalanced classes with ratios



                                    class A = 10%
                                    class B = 30%
                                    class C = 60%


                                    Their weights would be (dividing the smallest class by others)



                                    class A = 1.000
                                    class B = 0.333
                                    class C = 0.167


                                    Then, if training data is



                                    index   class
                                    0 A
                                    1 A
                                    2 B
                                    3 C
                                    4 B


                                    we build the weight vector as follows:



                                    index   class    weight
                                    0 A 1.000
                                    1 A 1.000
                                    2 B 0.333
                                    3 C 0.167
                                    4 B 0.333






                                    share|improve this answer














                                    share|improve this answer



                                    share|improve this answer








                                    edited yesterday

























                                    answered yesterday









                                    EsmailianEsmailian

                                    3,021320




                                    3,021320






























                                        draft saved

                                        draft discarded




















































                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid



                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.


                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function () {
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16342%2funbalanced-multiclass-data-with-xgboost%23new-answer', 'question_page');
                                        }
                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        How to label and detect the document text images

                                        Vallis Paradisi

                                        Tabula Rosettana