Binary Classification on small dataset < 200 samples












1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$








  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11
















1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$








  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11














1












1








1





$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$




I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.







machine-learning python classification predictive-modeling scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 '17 at 2:43







Archit Garg

















asked Jan 12 '17 at 1:02









Archit GargArchit Garg

10614




10614








  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11














  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11








1




1




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40












$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52




$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52












$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51




$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51












$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11




$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11










2 Answers
2






active

oldest

votes


















2












$begingroup$

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






share|improve this answer











$endgroup$





















    1












    $begingroup$

    Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






    share|improve this answer











    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2












      $begingroup$

      This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



      Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






      share|improve this answer











      $endgroup$


















        2












        $begingroup$

        This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



        Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






        share|improve this answer











        $endgroup$
















          2












          2








          2





          $begingroup$

          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






          share|improve this answer











          $endgroup$



          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 13 '17 at 12:44









          Community

          1




          1










          answered Jan 12 '17 at 21:17









          Constantin WeisserConstantin Weisser

          464




          464























              1












              $begingroup$

              Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






              share|improve this answer











              $endgroup$


















                1












                $begingroup$

                Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                share|improve this answer











                $endgroup$
















                  1












                  1








                  1





                  $begingroup$

                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                  share|improve this answer











                  $endgroup$



                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited yesterday









                  Blenzus

                  366




                  366










                  answered Jan 13 '17 at 4:17









                  Himanshu RaiHimanshu Rai

                  1,29748




                  1,29748






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Vallis Paradisi

                      Tabula Rosettana