Bad classification performance of logistic regression on imbalanced data in testing as compared to training

I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.

When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.

I was wondering what kind of methods could I try to improve the performance?

I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.

Thanks!!

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05

$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57

$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25

$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15

$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41

add a comment |

I was wondering what kind of methods could I try to improve the performance?

I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.

Thanks!!

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05

$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57

$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25

$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15

$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41

add a comment |

I was wondering what kind of methods could I try to improve the performance?

I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.

Thanks!!

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

I was wondering what kind of methods could I try to improve the performance?

I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.

Thanks!!

classification logistic-regression unbalanced-classes

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

edited Apr 4 '17 at 6:55

asked Mar 27 '17 at 18:48

Alice

31110

asked Mar 27 '17 at 18:48

Alice

31110

asked Mar 27 '17 at 18:48

Alice

31110

$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05

$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57

$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25

$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15

$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41

add a comment |

$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05

$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57

$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25

$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15

$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41

Did you center and scale your features?

– stmax
Mar 27 '17 at 19:05

You would probably have to share some of your results to get useful answers.

– oW_
Mar 27 '17 at 19:57

@stmax most of my features are dummies. so I didn't center and scale them.

– Alice
Mar 27 '17 at 20:25

I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?

– vitez koja
May 18 '18 at 13:15

Try the stratified sampling, it is usually usefull for imbalanced classification.

– Moon
Jan 14 at 21:41

add a comment |

3 Answers
3

active

oldest

votes

I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.

It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.

Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.

Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.

answered Apr 4 '17 at 16:10

D.W.

2,103628

add a comment |

The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.

Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

answered Mar 27 '17 at 20:28

JahKnows

4,787525

$begingroup$
Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
$endgroup$
– Alice
Mar 27 '17 at 22:19

1

$begingroup$
@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
$endgroup$
– K3---rnc
Mar 28 '17 at 1:02

$begingroup$
Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
$endgroup$
– JahKnows
Mar 28 '17 at 2:22

$begingroup$
Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
$endgroup$
– D.W.
Apr 4 '17 at 16:02

add a comment |

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
- Try a tree-based algorithm OR
- Use PCA which is linear and test Logistic Regression again.

answered yesterday

FrancoSwiss

7115

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f17910%2fbad-classification-performance-of-logistic-regression-on-imbalanced-data-in-test%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.

Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

answered Apr 4 '17 at 16:10

D.W.

2,103628

add a comment |

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.

Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

answered Apr 4 '17 at 16:10

D.W.

2,103628

add a comment |

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.

Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

answered Apr 4 '17 at 16:10

D.W.

2,103628

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.

Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

answered Apr 4 '17 at 16:10

D.W.

2,103628

answered Apr 4 '17 at 16:10

D.W.

2,103628

answered Apr 4 '17 at 16:10

D.W.

2,103628

answered Apr 4 '17 at 16:10

D.W.

2,103628

add a comment |

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

answered Mar 27 '17 at 20:28

JahKnows

4,787525

$begingroup$
Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
$endgroup$
– Alice
Mar 27 '17 at 22:19

1

$begingroup$
@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
$endgroup$
– K3---rnc
Mar 28 '17 at 1:02

$begingroup$
Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
$endgroup$
– JahKnows
Mar 28 '17 at 2:22

$begingroup$
Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
$endgroup$
– D.W.
Apr 4 '17 at 16:02

add a comment |

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

answered Mar 27 '17 at 20:28

JahKnows

4,787525

$begingroup$
Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
$endgroup$
– Alice
Mar 27 '17 at 22:19

1

$begingroup$
@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
$endgroup$
– K3---rnc
Mar 28 '17 at 1:02

$begingroup$
Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
$endgroup$
– JahKnows
Mar 28 '17 at 2:22

$begingroup$
Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
$endgroup$
– D.W.
Apr 4 '17 at 16:02

add a comment |

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

answered Mar 27 '17 at 20:28

JahKnows

4,787525

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

answered Mar 27 '17 at 20:28

JahKnows

4,787525

answered Mar 27 '17 at 20:28

JahKnows

4,787525

answered Mar 27 '17 at 20:28

JahKnows

4,787525

answered Mar 27 '17 at 20:28

JahKnows

4,787525

$begingroup$
Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
$endgroup$
– Alice
Mar 27 '17 at 22:19

1

$begingroup$
@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
$endgroup$
– K3---rnc
Mar 28 '17 at 1:02

$begingroup$
Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
$endgroup$
– JahKnows
Mar 28 '17 at 2:22

$begingroup$
Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
$endgroup$
– D.W.
Apr 4 '17 at 16:02

add a comment |

$begingroup$
Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
$endgroup$
– Alice
Mar 27 '17 at 22:19

1

$begingroup$
@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
$endgroup$
– K3---rnc
Mar 28 '17 at 1:02

$begingroup$
Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
$endgroup$
– JahKnows
Mar 28 '17 at 2:22

$begingroup$
Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
$endgroup$
– D.W.
Apr 4 '17 at 16:02

Thanks! I am only using 200 features. Wouldn't KNN takes too much time?

– Alice
Mar 27 '17 at 22:19

@Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.

– K3---rnc
Mar 28 '17 at 1:02

Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf

– JahKnows
Mar 28 '17 at 2:22

Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.

– D.W.
Apr 4 '17 at 16:02

add a comment |

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
- Try a tree-based algorithm OR
- Use PCA which is linear and test Logistic Regression again.

answered yesterday

FrancoSwiss

7115

add a comment |

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
- Try a tree-based algorithm OR
- Use PCA which is linear and test Logistic Regression again.

answered yesterday

FrancoSwiss

7115

add a comment |

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
- Try a tree-based algorithm OR
- Use PCA which is linear and test Logistic Regression again.

answered yesterday

FrancoSwiss

7115

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
- Try a tree-based algorithm OR
- Use PCA which is linear and test Logistic Regression again.

answered yesterday

FrancoSwiss

7115

answered yesterday

FrancoSwiss

7115

answered yesterday

FrancoSwiss

7115

answered yesterday

FrancoSwiss

7115

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk