Using SMOTE for Synthetic Data generation to improve performance on unbalanced data












1












$begingroup$


I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.



I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).



I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.










share|improve this question









$endgroup$

















    1












    $begingroup$


    I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.



    I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).



    I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.










    share|improve this question









    $endgroup$















      1












      1








      1





      $begingroup$


      I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.



      I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).



      I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.










      share|improve this question









      $endgroup$




      I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.



      I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).



      I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.







      bigdata training sampling smote ai






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 days ago









      JChatJChat

      133




      133






















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.



          After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.



          https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn



          So, basically, you would have something like this:



          from sklearn.model_selection import train_test_split
          from imblearn.over_sampling import SMOTE

          X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
          X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


          Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
            $endgroup$
            – Victor Oliveira
            2 days ago










          • $begingroup$
            Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
            $endgroup$
            – Victor Oliveira
            2 days ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47228%2fusing-smote-for-synthetic-data-generation-to-improve-performance-on-unbalanced-d%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2












          $begingroup$

          First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.



          After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.



          https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn



          So, basically, you would have something like this:



          from sklearn.model_selection import train_test_split
          from imblearn.over_sampling import SMOTE

          X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
          X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


          Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
            $endgroup$
            – Victor Oliveira
            2 days ago










          • $begingroup$
            Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
            $endgroup$
            – Victor Oliveira
            2 days ago
















          2












          $begingroup$

          First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.



          After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.



          https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn



          So, basically, you would have something like this:



          from sklearn.model_selection import train_test_split
          from imblearn.over_sampling import SMOTE

          X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
          X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


          Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
            $endgroup$
            – Victor Oliveira
            2 days ago










          • $begingroup$
            Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
            $endgroup$
            – Victor Oliveira
            2 days ago














          2












          2








          2





          $begingroup$

          First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.



          After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.



          https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn



          So, basically, you would have something like this:



          from sklearn.model_selection import train_test_split
          from imblearn.over_sampling import SMOTE

          X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
          X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


          Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.






          share|improve this answer









          $endgroup$



          First of all, you have to split your data set into train/test splits before doing any over/under sampling. If you do any strategy based on your approaches, and then split data you will bias your model and that is wrong simply because you are introducing points on your future test set that does not exist and your scores estimations would be imperfect.



          After splitting you data then, you will use only SMOTE on train set. If you use SMOTE from imblearn, it will automatically balance the classes for you. Also, you can use some parameter to change that if you dont want perfect balancing, or try different strategies.



          https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn



          So, basically, you would have something like this:



          from sklearn.model_selection import train_test_split
          from imblearn.over_sampling import SMOTE

          X_train, X_test, y_train, y_test = train_test_split(X, y, split_size=0.3)
          X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


          Then you continue fitting your model on X_resample, y_resample. Above, X is your features matrix and y is your target labels.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 2 days ago









          Victor OliveiraVictor Oliveira

          3157




          3157












          • $begingroup$
            Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
            $endgroup$
            – Victor Oliveira
            2 days ago










          • $begingroup$
            Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
            $endgroup$
            – Victor Oliveira
            2 days ago


















          • $begingroup$
            Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
            $endgroup$
            – Victor Oliveira
            2 days ago










          • $begingroup$
            Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
            $endgroup$
            – JChat
            2 days ago










          • $begingroup$
            But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
            $endgroup$
            – Victor Oliveira
            2 days ago
















          $begingroup$
          Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
          $endgroup$
          – JChat
          2 days ago




          $begingroup$
          Thanks for your answer. If I get it right, does it mean the resampled X and Y would be the resampled version of the entire training data, with better balance in classes of Y? Also, I then need to use the resampled training data to train my Machine Learning mode, or is it like I need to merge the original train data and resampled train data (to get larger train data), and then use it to train my ML model? Can you kindly clarify this please? I am confuwed regarding which approach looks correct amongst these.
          $endgroup$
          – JChat
          2 days ago












          $begingroup$
          For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
          $endgroup$
          – Victor Oliveira
          2 days ago




          $begingroup$
          For the first question: Yes, it would be the resampled data with better class balancing. No, you do not need to merge the training and resampled data, the resampled data already contains all data for training + the new ones generate. It is simple as that, these frameworks does all the work for us haha.
          $endgroup$
          – Victor Oliveira
          2 days ago












          $begingroup$
          Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
          $endgroup$
          – JChat
          2 days ago




          $begingroup$
          Great. So I wonder if there is a possibility to generate more data as well using this technique ( I mean increasing the size of the training data)?
          $endgroup$
          – JChat
          2 days ago












          $begingroup$
          But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
          $endgroup$
          – Victor Oliveira
          2 days ago




          $begingroup$
          But that what is happening. You are creating more data points for the minority class. You check shape attribute before and after resample you will see that data has changed. If you want to add more data that, I would not recommend as you are introducing noise to your data set. Also, look to the imlearn docs to see what options do you have to test.
          $endgroup$
          – Victor Oliveira
          2 days ago


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47228%2fusing-smote-for-synthetic-data-generation-to-improve-performance-on-unbalanced-d%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Vallis Paradisi

          Tabula Rosettana