Using SMOTE for Synthetic Data generation to improve performance on unbalanced data

I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.

I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).

I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.

asked 2 days ago

JChat

133

add a comment |

I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.

asked 2 days ago

JChat

133

add a comment |

I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.

asked 2 days ago

JChat

133

I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.

bigdata training sampling smote ai

asked 2 days ago

JChat

133

asked 2 days ago

JChat

133

asked 2 days ago

JChat

133

asked 2 days ago

JChat

133

asked 2 days ago

JChat

133

add a comment |

1 Answer
1

active

oldest

votes

First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.

After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn

So, basically, you would have something like this:

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE



X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)

X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.

answered 2 days ago

Victor Oliveira

3157

$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago

$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago

$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago

$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47228%2fusing-smote-for-synthetic-data-generation-to-improve-performance-on-unbalanced-d%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn

So, basically, you would have something like this:

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE



X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)

X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.

answered 2 days ago

Victor Oliveira

3157

$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago

$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago

$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago

$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago

add a comment |

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn

So, basically, you would have something like this:

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE



X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)

X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.

answered 2 days ago

Victor Oliveira

3157

$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago

$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago

$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago

$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago

add a comment |

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn

So, basically, you would have something like this:

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE



X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)

X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.

answered 2 days ago

Victor Oliveira

3157

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn

So, basically, you would have something like this:

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE



X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)

X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)

Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.

answered 2 days ago

Victor Oliveira

3157

answered 2 days ago

Victor Oliveira

3157

answered 2 days ago

Victor Oliveira

3157

answered 2 days ago

Victor Oliveira

3157

$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago

$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago

$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago

$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago

add a comment |

$begingroup$
Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
$endgroup$
– JChat
2 days ago

$begingroup$
For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
$endgroup$
– Victor Oliveira
2 days ago

$begingroup$
Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
$endgroup$
– JChat
2 days ago

$begingroup$
But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
$endgroup$
– Victor Oliveira
2 days ago

Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.

– JChat
2 days ago

For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.

– Victor Oliveira
2 days ago

Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?

– JChat
2 days ago

But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.

– Victor Oliveira
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk