imbalanced dataset in text classififaction
$begingroup$
I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure
Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?
python nlp class-imbalance imbalanced-learn
New contributor
$endgroup$
add a comment |
$begingroup$
I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure
Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?
python nlp class-imbalance imbalanced-learn
New contributor
$endgroup$
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure
Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?
python nlp class-imbalance imbalanced-learn
New contributor
$endgroup$
I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure
Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?
python nlp class-imbalance imbalanced-learn
python nlp class-imbalance imbalanced-learn
New contributor
New contributor
New contributor
asked 13 hours ago
mtesta010mtesta010
11
11
New contributor
New contributor
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.
New contributor
$endgroup$
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
Thank you for your message Ahmed. There are things to point out:
Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.- What analysis is going to be done? In some cases you need posts and in some you need these keywords.
- What method is going to be done for that analysis? Some methods get keywords as input and some get posts.
But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.
Hope it helped. If you write about the task you want to do I can post the solution here.
Cheers,
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45163%2fimbalanced-dataset-in-text-classififaction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.
New contributor
$endgroup$
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.
New contributor
$endgroup$
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.
New contributor
$endgroup$
I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.
New contributor
New contributor
answered 10 hours ago
matzematze
112
112
New contributor
New contributor
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
$begingroup$
count of unique words after remove stop words,the count differ because posts lengths are different
$endgroup$
– mtesta010
10 hours ago
add a comment |
$begingroup$
Thank you for your message Ahmed. There are things to point out:
Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.- What analysis is going to be done? In some cases you need posts and in some you need these keywords.
- What method is going to be done for that analysis? Some methods get keywords as input and some get posts.
But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.
Hope it helped. If you write about the task you want to do I can post the solution here.
Cheers,
$endgroup$
add a comment |
$begingroup$
Thank you for your message Ahmed. There are things to point out:
Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.- What analysis is going to be done? In some cases you need posts and in some you need these keywords.
- What method is going to be done for that analysis? Some methods get keywords as input and some get posts.
But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.
Hope it helped. If you write about the task you want to do I can post the solution here.
Cheers,
$endgroup$
add a comment |
$begingroup$
Thank you for your message Ahmed. There are things to point out:
Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.- What analysis is going to be done? In some cases you need posts and in some you need these keywords.
- What method is going to be done for that analysis? Some methods get keywords as input and some get posts.
But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.
Hope it helped. If you write about the task you want to do I can post the solution here.
Cheers,
$endgroup$
Thank you for your message Ahmed. There are things to point out:
Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.- What analysis is going to be done? In some cases you need posts and in some you need these keywords.
- What method is going to be done for that analysis? Some methods get keywords as input and some get posts.
But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.
Hope it helped. If you write about the task you want to do I can post the solution here.
Cheers,
answered 4 hours ago
Kasra ManshaeiKasra Manshaei
3,7041035
3,7041035
add a comment |
add a comment |
mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
mtesta010 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45163%2fimbalanced-dataset-in-text-classififaction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Could you please post your approach/code here?
$endgroup$
– Sunil
11 hours ago
$begingroup$
which code??I ask a general question based on number of samples??
$endgroup$
– mtesta010
10 hours ago