Binary Classification on small dataset

Binary Classification on small dataset < 200 samples

I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-

I have performed 3 Fold cross validation and got following accuracy results using various models:-



LinearSVC:

0.873

DecisionTreeClassifier:

0.840

Gaussian Naive Bayes:

0.845

Logistic Regression:

0.867

Gradient Boosting Classifier

0.867

Support vector classifier rbf:

0.818

Random forest:

0.867

K-nearest-neighbors:

0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

I have performed 3 Fold cross validation and got following accuracy results using various models:-



LinearSVC:

0.873

DecisionTreeClassifier:

0.840

Gaussian Naive Bayes:

0.845

Logistic Regression:

0.867

Gradient Boosting Classifier

0.867

Support vector classifier rbf:

0.818

Random forest:

0.867

K-nearest-neighbors:

0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

I have performed 3 Fold cross validation and got following accuracy results using various models:-



LinearSVC:

0.873

DecisionTreeClassifier:

0.840

Gaussian Naive Bayes:

0.845

Logistic Regression:

0.867

Gradient Boosting Classifier

0.867

Support vector classifier rbf:

0.818

Random forest:

0.867

K-nearest-neighbors:

0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

I have performed 3 Fold cross validation and got following accuracy results using various models:-



LinearSVC:

0.873

DecisionTreeClassifier:

0.840

Gaussian Naive Bayes:

0.845

Logistic Regression:

0.867

Gradient Boosting Classifier

0.867

Support vector classifier rbf:

0.818

Random forest:

0.867

K-nearest-neighbors:

0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

machine-learning python classification predictive-modeling scikit-learn

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

asked Jan 12 '17 at 1:02

Archit Garg

10614

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.

– Himanshu Rai
Jan 12 '17 at 6:40

Could you give some more context as to what was sampled and which concept you are trying to label?

– S van Balen
Jan 12 '17 at 13:52

@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?

– Archit Garg
Jan 13 '17 at 2:51

Adding an answer.

– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

2 Answers
2

active

oldest

votes

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

answered Jan 12 '17 at 21:17

Constantin Weisser

464

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

edited yesterday

Blenzus

366

edited yesterday

Blenzus

366

edited yesterday

Blenzus

366

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk