Classifier for large number of labels












0












$begingroup$


I have a merchants dataset with 800,000 samples and 18,000 labels. Each sample is associated with a single label and the labels are independent.



An example sample looks like



desc: int'l 0028240525 amazon uk retail amazon.co.uk => label: Amazon



In addition to the existing samples there will be new retailers added to the dataset. In this case there may well only be a single sample for that new retailer.



To summarise, I need a classifier that




  1. handles a large number of labels (~18,000, independent, single label per sample)

  2. is able to classify undersampled labels (i.e. a single retailer)


Is there an approach that will handle both? Perhaps two separate classifiers makes more sense?









share







New contributor




Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    I have a merchants dataset with 800,000 samples and 18,000 labels. Each sample is associated with a single label and the labels are independent.



    An example sample looks like



    desc: int'l 0028240525 amazon uk retail amazon.co.uk => label: Amazon



    In addition to the existing samples there will be new retailers added to the dataset. In this case there may well only be a single sample for that new retailer.



    To summarise, I need a classifier that




    1. handles a large number of labels (~18,000, independent, single label per sample)

    2. is able to classify undersampled labels (i.e. a single retailer)


    Is there an approach that will handle both? Perhaps two separate classifiers makes more sense?









    share







    New contributor




    Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      I have a merchants dataset with 800,000 samples and 18,000 labels. Each sample is associated with a single label and the labels are independent.



      An example sample looks like



      desc: int'l 0028240525 amazon uk retail amazon.co.uk => label: Amazon



      In addition to the existing samples there will be new retailers added to the dataset. In this case there may well only be a single sample for that new retailer.



      To summarise, I need a classifier that




      1. handles a large number of labels (~18,000, independent, single label per sample)

      2. is able to classify undersampled labels (i.e. a single retailer)


      Is there an approach that will handle both? Perhaps two separate classifiers makes more sense?









      share







      New contributor




      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have a merchants dataset with 800,000 samples and 18,000 labels. Each sample is associated with a single label and the labels are independent.



      An example sample looks like



      desc: int'l 0028240525 amazon uk retail amazon.co.uk => label: Amazon



      In addition to the existing samples there will be new retailers added to the dataset. In this case there may well only be a single sample for that new retailer.



      To summarise, I need a classifier that




      1. handles a large number of labels (~18,000, independent, single label per sample)

      2. is able to classify undersampled labels (i.e. a single retailer)


      Is there an approach that will handle both? Perhaps two separate classifiers makes more sense?







      machine-learning logistic-regression naive-bayes-classifier





      share







      New contributor




      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 17 hours ago









      Oliver Searle-BarnesOliver Searle-Barnes

      1




      1




      New contributor




      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Oliver Searle-Barnes is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          For multiclass classification problems there are multiple algorithms which are inherently built in a way to be able to solve them.
          Some examples: kNN, naive bayes, decision trees...



          For the performance to be accurate on all labels and for the classifier to show little bias, you can use other approaches: you can oversample minority classes or undersample majority classes, in a way that all the labels have the same number of points associated with them.



          Here you can find some interesting answers about how to fight against class imbalances on decision tree classification: https://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data






          share|improve this answer








          New contributor




          Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            Oliver Searle-Barnes is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46518%2fclassifier-for-large-number-of-labels%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            For multiclass classification problems there are multiple algorithms which are inherently built in a way to be able to solve them.
            Some examples: kNN, naive bayes, decision trees...



            For the performance to be accurate on all labels and for the classifier to show little bias, you can use other approaches: you can oversample minority classes or undersample majority classes, in a way that all the labels have the same number of points associated with them.



            Here you can find some interesting answers about how to fight against class imbalances on decision tree classification: https://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data






            share|improve this answer








            New contributor




            Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              1












              $begingroup$

              For multiclass classification problems there are multiple algorithms which are inherently built in a way to be able to solve them.
              Some examples: kNN, naive bayes, decision trees...



              For the performance to be accurate on all labels and for the classifier to show little bias, you can use other approaches: you can oversample minority classes or undersample majority classes, in a way that all the labels have the same number of points associated with them.



              Here you can find some interesting answers about how to fight against class imbalances on decision tree classification: https://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data






              share|improve this answer








              New contributor




              Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                1












                1








                1





                $begingroup$

                For multiclass classification problems there are multiple algorithms which are inherently built in a way to be able to solve them.
                Some examples: kNN, naive bayes, decision trees...



                For the performance to be accurate on all labels and for the classifier to show little bias, you can use other approaches: you can oversample minority classes or undersample majority classes, in a way that all the labels have the same number of points associated with them.



                Here you can find some interesting answers about how to fight against class imbalances on decision tree classification: https://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data






                share|improve this answer








                New contributor




                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                For multiclass classification problems there are multiple algorithms which are inherently built in a way to be able to solve them.
                Some examples: kNN, naive bayes, decision trees...



                For the performance to be accurate on all labels and for the classifier to show little bias, you can use other approaches: you can oversample minority classes or undersample majority classes, in a way that all the labels have the same number of points associated with them.



                Here you can find some interesting answers about how to fight against class imbalances on decision tree classification: https://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data







                share|improve this answer








                New contributor




                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 10 hours ago









                Fábio ColaçoFábio Colaço

                462




                462




                New contributor




                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                Fábio Colaço is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






















                    Oliver Searle-Barnes is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    Oliver Searle-Barnes is a new contributor. Be nice, and check out our Code of Conduct.













                    Oliver Searle-Barnes is a new contributor. Be nice, and check out our Code of Conduct.












                    Oliver Searle-Barnes is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46518%2fclassifier-for-large-number-of-labels%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana