Can this be a case of multi-class skewness?












0












$begingroup$


I have been working on an email data set, and trying to predict the owner team for it. But my prediction accuracy is just 58%. I have implemented data cleansing, null value removals, duplicate removal, excluding stopwords, and then calculating tf-idf value to get my final document term matrix.



Tried various classifiers (but with default settings) but max I achieved is 58% accuracy. Might give a try to hyperparameter tuning.



Could you please give me some pointers, based on your previous experience what possibly could be going wrong? I'm simultaneously doing my own research and going through various blogs and articles.



One thought - Could it be a case of data skewness?



My output classes are distributed like this -
enter image description here



Your thoughts please?










share|improve this question









$endgroup$

















    0












    $begingroup$


    I have been working on an email data set, and trying to predict the owner team for it. But my prediction accuracy is just 58%. I have implemented data cleansing, null value removals, duplicate removal, excluding stopwords, and then calculating tf-idf value to get my final document term matrix.



    Tried various classifiers (but with default settings) but max I achieved is 58% accuracy. Might give a try to hyperparameter tuning.



    Could you please give me some pointers, based on your previous experience what possibly could be going wrong? I'm simultaneously doing my own research and going through various blogs and articles.



    One thought - Could it be a case of data skewness?



    My output classes are distributed like this -
    enter image description here



    Your thoughts please?










    share|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      I have been working on an email data set, and trying to predict the owner team for it. But my prediction accuracy is just 58%. I have implemented data cleansing, null value removals, duplicate removal, excluding stopwords, and then calculating tf-idf value to get my final document term matrix.



      Tried various classifiers (but with default settings) but max I achieved is 58% accuracy. Might give a try to hyperparameter tuning.



      Could you please give me some pointers, based on your previous experience what possibly could be going wrong? I'm simultaneously doing my own research and going through various blogs and articles.



      One thought - Could it be a case of data skewness?



      My output classes are distributed like this -
      enter image description here



      Your thoughts please?










      share|improve this question









      $endgroup$




      I have been working on an email data set, and trying to predict the owner team for it. But my prediction accuracy is just 58%. I have implemented data cleansing, null value removals, duplicate removal, excluding stopwords, and then calculating tf-idf value to get my final document term matrix.



      Tried various classifiers (but with default settings) but max I achieved is 58% accuracy. Might give a try to hyperparameter tuning.



      Could you please give me some pointers, based on your previous experience what possibly could be going wrong? I'm simultaneously doing my own research and going through various blogs and articles.



      One thought - Could it be a case of data skewness?



      My output classes are distributed like this -
      enter image description here



      Your thoughts please?







      machine-learning python multiclass-classification accuracy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 days ago









      ranit.branit.b

      427




      427






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Just telling the accuracy does not mean anything in classification problems. The first thing you must do is to calculate your baseline, that is, what is the percentage of your majority class? For the bar plot above it is difficult to measure, can you tell us in percentage, instead of counts? In this way we can assess better your results.



          Also, have you plot a confusion matrix? In this way you can see where your model are getting more wrong and try to infer why this is happening.



          And yes, since you have too many classes to predict and most of them are with low representativeness this will be difficult to overcome. Maybe you can try things such as Oversampling, Undersampling techniques considering one-vs-all approach. This is just an idea, I haven't yet encountering a problem with so many classes to predict.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
            $endgroup$
            – ranit.b
            2 days ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46802%2fcan-this-be-a-case-of-multi-class-skewness%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          Just telling the accuracy does not mean anything in classification problems. The first thing you must do is to calculate your baseline, that is, what is the percentage of your majority class? For the bar plot above it is difficult to measure, can you tell us in percentage, instead of counts? In this way we can assess better your results.



          Also, have you plot a confusion matrix? In this way you can see where your model are getting more wrong and try to infer why this is happening.



          And yes, since you have too many classes to predict and most of them are with low representativeness this will be difficult to overcome. Maybe you can try things such as Oversampling, Undersampling techniques considering one-vs-all approach. This is just an idea, I haven't yet encountering a problem with so many classes to predict.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
            $endgroup$
            – ranit.b
            2 days ago
















          1












          $begingroup$

          Just telling the accuracy does not mean anything in classification problems. The first thing you must do is to calculate your baseline, that is, what is the percentage of your majority class? For the bar plot above it is difficult to measure, can you tell us in percentage, instead of counts? In this way we can assess better your results.



          Also, have you plot a confusion matrix? In this way you can see where your model are getting more wrong and try to infer why this is happening.



          And yes, since you have too many classes to predict and most of them are with low representativeness this will be difficult to overcome. Maybe you can try things such as Oversampling, Undersampling techniques considering one-vs-all approach. This is just an idea, I haven't yet encountering a problem with so many classes to predict.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
            $endgroup$
            – ranit.b
            2 days ago














          1












          1








          1





          $begingroup$

          Just telling the accuracy does not mean anything in classification problems. The first thing you must do is to calculate your baseline, that is, what is the percentage of your majority class? For the bar plot above it is difficult to measure, can you tell us in percentage, instead of counts? In this way we can assess better your results.



          Also, have you plot a confusion matrix? In this way you can see where your model are getting more wrong and try to infer why this is happening.



          And yes, since you have too many classes to predict and most of them are with low representativeness this will be difficult to overcome. Maybe you can try things such as Oversampling, Undersampling techniques considering one-vs-all approach. This is just an idea, I haven't yet encountering a problem with so many classes to predict.






          share|improve this answer









          $endgroup$



          Just telling the accuracy does not mean anything in classification problems. The first thing you must do is to calculate your baseline, that is, what is the percentage of your majority class? For the bar plot above it is difficult to measure, can you tell us in percentage, instead of counts? In this way we can assess better your results.



          Also, have you plot a confusion matrix? In this way you can see where your model are getting more wrong and try to infer why this is happening.



          And yes, since you have too many classes to predict and most of them are with low representativeness this will be difficult to overcome. Maybe you can try things such as Oversampling, Undersampling techniques considering one-vs-all approach. This is just an idea, I haven't yet encountering a problem with so many classes to predict.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 2 days ago









          Victor OliveiraVictor Oliveira

          1365




          1365












          • $begingroup$
            Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
            $endgroup$
            – ranit.b
            2 days ago


















          • $begingroup$
            Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
            $endgroup$
            – ranit.b
            2 days ago
















          $begingroup$
          Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
          $endgroup$
          – ranit.b
          2 days ago




          $begingroup$
          Thanks Victor. I read few articles which introduced me to Oversampling and Undersampling, SMOTE, etc., but that's entirely new for me. Will do more research and get back with specifics. I'll amend my question by adding the Confusion matrix in sometime.
          $endgroup$
          – ranit.b
          2 days ago


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46802%2fcan-this-be-a-case-of-multi-class-skewness%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)