Bad classification performance of logistic regression on imbalanced data in testing as compared to training












2












$begingroup$


I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.



When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.



I was wondering what kind of methods could I try to improve the performance?



I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.



Thanks!!










share|improve this question











$endgroup$












  • $begingroup$
    Did you center and scale your features?
    $endgroup$
    – stmax
    Mar 27 '17 at 19:05










  • $begingroup$
    You would probably have to share some of your results to get useful answers.
    $endgroup$
    – oW_
    Mar 27 '17 at 19:57










  • $begingroup$
    @stmax most of my features are dummies. so I didn't center and scale them.
    $endgroup$
    – Alice
    Mar 27 '17 at 20:25










  • $begingroup$
    I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
    $endgroup$
    – vitez koja
    May 18 '18 at 13:15










  • $begingroup$
    Try the stratified sampling, it is usually usefull for imbalanced classification.
    $endgroup$
    – Moon
    Jan 14 at 21:41
















2












$begingroup$


I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.



When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.



I was wondering what kind of methods could I try to improve the performance?



I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.



Thanks!!










share|improve this question











$endgroup$












  • $begingroup$
    Did you center and scale your features?
    $endgroup$
    – stmax
    Mar 27 '17 at 19:05










  • $begingroup$
    You would probably have to share some of your results to get useful answers.
    $endgroup$
    – oW_
    Mar 27 '17 at 19:57










  • $begingroup$
    @stmax most of my features are dummies. so I didn't center and scale them.
    $endgroup$
    – Alice
    Mar 27 '17 at 20:25










  • $begingroup$
    I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
    $endgroup$
    – vitez koja
    May 18 '18 at 13:15










  • $begingroup$
    Try the stratified sampling, it is usually usefull for imbalanced classification.
    $endgroup$
    – Moon
    Jan 14 at 21:41














2












2








2


4



$begingroup$


I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.



When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.



I was wondering what kind of methods could I try to improve the performance?



I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.



Thanks!!










share|improve this question











$endgroup$




I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.



When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.



I was wondering what kind of methods could I try to improve the performance?



I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.



Thanks!!







classification logistic-regression unbalanced-classes






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 4 '17 at 6:55







Alice

















asked Mar 27 '17 at 18:48









AliceAlice

31110




31110












  • $begingroup$
    Did you center and scale your features?
    $endgroup$
    – stmax
    Mar 27 '17 at 19:05










  • $begingroup$
    You would probably have to share some of your results to get useful answers.
    $endgroup$
    – oW_
    Mar 27 '17 at 19:57










  • $begingroup$
    @stmax most of my features are dummies. so I didn't center and scale them.
    $endgroup$
    – Alice
    Mar 27 '17 at 20:25










  • $begingroup$
    I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
    $endgroup$
    – vitez koja
    May 18 '18 at 13:15










  • $begingroup$
    Try the stratified sampling, it is usually usefull for imbalanced classification.
    $endgroup$
    – Moon
    Jan 14 at 21:41


















  • $begingroup$
    Did you center and scale your features?
    $endgroup$
    – stmax
    Mar 27 '17 at 19:05










  • $begingroup$
    You would probably have to share some of your results to get useful answers.
    $endgroup$
    – oW_
    Mar 27 '17 at 19:57










  • $begingroup$
    @stmax most of my features are dummies. so I didn't center and scale them.
    $endgroup$
    – Alice
    Mar 27 '17 at 20:25










  • $begingroup$
    I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
    $endgroup$
    – vitez koja
    May 18 '18 at 13:15










  • $begingroup$
    Try the stratified sampling, it is usually usefull for imbalanced classification.
    $endgroup$
    – Moon
    Jan 14 at 21:41
















$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05




$begingroup$
Did you center and scale your features?
$endgroup$
– stmax
Mar 27 '17 at 19:05












$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57




$begingroup$
You would probably have to share some of your results to get useful answers.
$endgroup$
– oW_
Mar 27 '17 at 19:57












$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25




$begingroup$
@stmax most of my features are dummies. so I didn't center and scale them.
$endgroup$
– Alice
Mar 27 '17 at 20:25












$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15




$begingroup$
I had almost the same problem with imbalanced data and binary classification. F1, recall and precision were good on training set, but bad on test set. (I also used SMOTE to over-sample the training set). Then I tried all that @D.W. sugested, but didn't succeed to improve my test results. Did you menage to improve perfomance, and how?
$endgroup$
– vitez koja
May 18 '18 at 13:15












$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41




$begingroup$
Try the stratified sampling, it is usually usefull for imbalanced classification.
$endgroup$
– Moon
Jan 14 at 21:41










3 Answers
3






active

oldest

votes


















3












$begingroup$

I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.



It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.



Here are two candidate solutions for the problem that you can try:




  1. Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.


  2. Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).



Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.



Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.






share|improve this answer









$endgroup$





















    1












    $begingroup$

    The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.



    Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.



    The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.






    share|improve this answer









    $endgroup$













    • $begingroup$
      Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
      $endgroup$
      – Alice
      Mar 27 '17 at 22:19






    • 1




      $begingroup$
      @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
      $endgroup$
      – K3---rnc
      Mar 28 '17 at 1:02










    • $begingroup$
      Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
      $endgroup$
      – JahKnows
      Mar 28 '17 at 2:22










    • $begingroup$
      Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
      $endgroup$
      – D.W.
      Apr 4 '17 at 16:02



















    0












    $begingroup$

    I do the same dangerous approach.



    The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).



    Alternatives:
    - Try a tree-based algorithm OR
    - Use PCA which is linear and test Logistic Regression again.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f17910%2fbad-classification-performance-of-logistic-regression-on-imbalanced-data-in-test%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3












      $begingroup$

      I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.



      It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.



      Here are two candidate solutions for the problem that you can try:




      1. Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.


      2. Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).



      Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.



      Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.






      share|improve this answer









      $endgroup$


















        3












        $begingroup$

        I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.



        It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.



        Here are two candidate solutions for the problem that you can try:




        1. Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.


        2. Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).



        Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.



        Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.






        share|improve this answer









        $endgroup$
















          3












          3








          3





          $begingroup$

          I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.



          It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.



          Here are two candidate solutions for the problem that you can try:




          1. Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.


          2. Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).



          Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.



          Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.






          share|improve this answer









          $endgroup$



          I suspect the reason is that the class balance in your test set is different from the class balance in your training set. That will throw everything off. The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set. SMOTE can throw that off.



          It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances. So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%. That's bad; it will cause the classifier to over-predict positive instances. For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.



          Here are two candidate solutions for the problem that you can try:




          1. Stop using SMOTE. Ensure the training set has the same distribution as the test set. SMOTE might actually be unnecessary in your situation.


          2. Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification. Logistic regression produces an estimated probability that a particular instance is from the positive class. Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative. You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).



          Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter. There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.



          Finally, understand that dealing with severe class imbalance such as you have is just hard. Fortunately, there are many techniques available. Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Apr 4 '17 at 16:10









          D.W.D.W.

          2,103628




          2,103628























              1












              $begingroup$

              The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.



              Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.



              The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.






              share|improve this answer









              $endgroup$













              • $begingroup$
                Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
                $endgroup$
                – Alice
                Mar 27 '17 at 22:19






              • 1




                $begingroup$
                @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
                $endgroup$
                – K3---rnc
                Mar 28 '17 at 1:02










              • $begingroup$
                Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
                $endgroup$
                – JahKnows
                Mar 28 '17 at 2:22










              • $begingroup$
                Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
                $endgroup$
                – D.W.
                Apr 4 '17 at 16:02
















              1












              $begingroup$

              The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.



              Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.



              The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.






              share|improve this answer









              $endgroup$













              • $begingroup$
                Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
                $endgroup$
                – Alice
                Mar 27 '17 at 22:19






              • 1




                $begingroup$
                @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
                $endgroup$
                – K3---rnc
                Mar 28 '17 at 1:02










              • $begingroup$
                Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
                $endgroup$
                – JahKnows
                Mar 28 '17 at 2:22










              • $begingroup$
                Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
                $endgroup$
                – D.W.
                Apr 4 '17 at 16:02














              1












              1








              1





              $begingroup$

              The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.



              Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.



              The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.






              share|improve this answer









              $endgroup$



              The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.



              Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.



              The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Mar 27 '17 at 20:28









              JahKnowsJahKnows

              4,787525




              4,787525












              • $begingroup$
                Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
                $endgroup$
                – Alice
                Mar 27 '17 at 22:19






              • 1




                $begingroup$
                @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
                $endgroup$
                – K3---rnc
                Mar 28 '17 at 1:02










              • $begingroup$
                Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
                $endgroup$
                – JahKnows
                Mar 28 '17 at 2:22










              • $begingroup$
                Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
                $endgroup$
                – D.W.
                Apr 4 '17 at 16:02


















              • $begingroup$
                Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
                $endgroup$
                – Alice
                Mar 27 '17 at 22:19






              • 1




                $begingroup$
                @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
                $endgroup$
                – K3---rnc
                Mar 28 '17 at 1:02










              • $begingroup$
                Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
                $endgroup$
                – JahKnows
                Mar 28 '17 at 2:22










              • $begingroup$
                Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
                $endgroup$
                – D.W.
                Apr 4 '17 at 16:02
















              $begingroup$
              Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
              $endgroup$
              – Alice
              Mar 27 '17 at 22:19




              $begingroup$
              Thanks! I am only using 200 features. Wouldn't KNN takes too much time?
              $endgroup$
              – Alice
              Mar 27 '17 at 22:19




              1




              1




              $begingroup$
              @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
              $endgroup$
              – K3---rnc
              Mar 28 '17 at 1:02




              $begingroup$
              @Alice Not if it caches neighbors in a tree structure such as K-D tree or ball tree.
              $endgroup$
              – K3---rnc
              Mar 28 '17 at 1:02












              $begingroup$
              Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
              $endgroup$
              – JahKnows
              Mar 28 '17 at 2:22




              $begingroup$
              Also check out the idea using graph like models to retain the k-NN structure of your data. papers.nips.cc/paper/2851-learning-minimum-volume-sets.pdf
              $endgroup$
              – JahKnows
              Mar 28 '17 at 2:22












              $begingroup$
              Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
              $endgroup$
              – D.W.
              Apr 4 '17 at 16:02




              $begingroup$
              Why do you say logistic regression isn't recommended for imbalanced data sets? Everything I've read suggests logistic regression is perfectly reasonable for imbalanced data sets.
              $endgroup$
              – D.W.
              Apr 4 '17 at 16:02











              0












              $begingroup$

              I do the same dangerous approach.



              The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).



              Alternatives:
              - Try a tree-based algorithm OR
              - Use PCA which is linear and test Logistic Regression again.






              share|improve this answer









              $endgroup$


















                0












                $begingroup$

                I do the same dangerous approach.



                The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).



                Alternatives:
                - Try a tree-based algorithm OR
                - Use PCA which is linear and test Logistic Regression again.






                share|improve this answer









                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  I do the same dangerous approach.



                  The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).



                  Alternatives:
                  - Try a tree-based algorithm OR
                  - Use PCA which is linear and test Logistic Regression again.






                  share|improve this answer









                  $endgroup$



                  I do the same dangerous approach.



                  The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).



                  Alternatives:
                  - Try a tree-based algorithm OR
                  - Use PCA which is linear and test Logistic Regression again.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  FrancoSwissFrancoSwiss

                  7115




                  7115






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f17910%2fbad-classification-performance-of-logistic-regression-on-imbalanced-data-in-test%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Callistus I

                      Tabula Rosettana

                      How to label and detect the document text images