imbalanced dataset in text classififaction












0












$begingroup$


I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure word count in each class



Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?










share|improve this question







New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Could you please post your approach/code here?
    $endgroup$
    – Sunil
    11 hours ago










  • $begingroup$
    which code??I ask a general question based on number of samples??
    $endgroup$
    – mtesta010
    10 hours ago
















0












$begingroup$


I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure word count in each class



Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?










share|improve this question







New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Could you please post your approach/code here?
    $endgroup$
    – Sunil
    11 hours ago










  • $begingroup$
    which code??I ask a general question based on number of samples??
    $endgroup$
    – mtesta010
    10 hours ago














0












0








0





$begingroup$


I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure word count in each class



Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?










share|improve this question







New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure word count in each class



Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?







python nlp class-imbalance imbalanced-learn






share|improve this question







New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 13 hours ago









mtesta010mtesta010

11




11




New contributor




mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






mtesta010 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Could you please post your approach/code here?
    $endgroup$
    – Sunil
    11 hours ago










  • $begingroup$
    which code??I ask a general question based on number of samples??
    $endgroup$
    – mtesta010
    10 hours ago


















  • $begingroup$
    Could you please post your approach/code here?
    $endgroup$
    – Sunil
    11 hours ago










  • $begingroup$
    which code??I ask a general question based on number of samples??
    $endgroup$
    – mtesta010
    10 hours ago
















$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago




$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago












$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago




$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago










2 Answers
2






active

oldest

votes


















0












$begingroup$

I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.






share|improve this answer








New contributor




matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













  • $begingroup$
    count of unique words after remove stop words,the count differ because posts lengths are different
    $endgroup$
    – mtesta010
    10 hours ago



















0












$begingroup$

Thank you for your message Ahmed. There are things to point out:





  1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.

  2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.

  3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.


But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.



Hope it helped. If you write about the task you want to do I can post the solution here.



Cheers,






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45163%2fimbalanced-dataset-in-text-classififaction%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.






    share|improve this answer








    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$













    • $begingroup$
      count of unique words after remove stop words,the count differ because posts lengths are different
      $endgroup$
      – mtesta010
      10 hours ago
















    0












    $begingroup$

    I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.






    share|improve this answer








    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$













    • $begingroup$
      count of unique words after remove stop words,the count differ because posts lengths are different
      $endgroup$
      – mtesta010
      10 hours ago














    0












    0








    0





    $begingroup$

    I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.






    share|improve this answer








    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$



    I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.







    share|improve this answer








    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    share|improve this answer



    share|improve this answer






    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    answered 10 hours ago









    matzematze

    112




    112




    New contributor




    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





    New contributor





    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    matze is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.












    • $begingroup$
      count of unique words after remove stop words,the count differ because posts lengths are different
      $endgroup$
      – mtesta010
      10 hours ago


















    • $begingroup$
      count of unique words after remove stop words,the count differ because posts lengths are different
      $endgroup$
      – mtesta010
      10 hours ago
















    $begingroup$
    count of unique words after remove stop words,the count differ because posts lengths are different
    $endgroup$
    – mtesta010
    10 hours ago




    $begingroup$
    count of unique words after remove stop words,the count differ because posts lengths are different
    $endgroup$
    – mtesta010
    10 hours ago











    0












    $begingroup$

    Thank you for your message Ahmed. There are things to point out:





    1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.

    2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.

    3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.


    But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.



    Hope it helped. If you write about the task you want to do I can post the solution here.



    Cheers,






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      Thank you for your message Ahmed. There are things to point out:





      1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.

      2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.

      3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.


      But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.



      Hope it helped. If you write about the task you want to do I can post the solution here.



      Cheers,






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        Thank you for your message Ahmed. There are things to point out:





        1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.

        2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.

        3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.


        But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.



        Hope it helped. If you write about the task you want to do I can post the solution here.



        Cheers,






        share|improve this answer









        $endgroup$



        Thank you for your message Ahmed. There are things to point out:





        1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.

        2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.

        3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.


        But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.



        Hope it helped. If you write about the task you want to do I can post the solution here.



        Cheers,







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 4 hours ago









        Kasra ManshaeiKasra Manshaei

        3,7041035




        3,7041035






















            mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.













            mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.












            mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45163%2fimbalanced-dataset-in-text-classififaction%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana