Using SMOTE for Synthetic Data generation to improve performance on unbalanced data
$begingroup$
I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.
I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 %
to (4444+4488)/(21392+4488) = 34.513 %
. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).
I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.
bigdata training sampling smote ai
$endgroup$
add a comment |
$begingroup$
I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.
I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 %
to (4444+4488)/(21392+4488) = 34.513 %
. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).
I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.
bigdata training sampling smote ai
$endgroup$
add a comment |
$begingroup$
I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.
I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 %
to (4444+4488)/(21392+4488) = 34.513 %
. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).
I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.
bigdata training sampling smote ai
$endgroup$
I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.
I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 %
to (4444+4488)/(21392+4488) = 34.513 %
. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).
I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.
bigdata training sampling smote ai
bigdata training sampling smote ai
asked 2 days ago
JChatJChat
133
133
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.
After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn
So, basically, you would have something like this:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.
$endgroup$
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47228%2fusing-smote-for-synthetic-data-generation-to-improve-performance-on-unbalanced-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.
After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn
So, basically, you would have something like this:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.
$endgroup$
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
add a comment |
$begingroup$
First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.
After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn
So, basically, you would have something like this:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.
$endgroup$
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
add a comment |
$begingroup$
First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.
After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn
So, basically, you would have something like this:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.
$endgroup$
First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.
After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn
So, basically, you would have something like this:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.
answered 2 days ago
Victor OliveiraVictor Oliveira
3157
3157
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
add a comment |
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47228%2fusing-smote-for-synthetic-data-generation-to-improve-performance-on-unbalanced-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown