Not sure if over-fitting












1












$begingroup$


I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).




  1. Used min_max_scaler

  2. Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

  3. Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

  4. Used predict on the test data

  5. Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class


Seems fishy , but I can't figure out why... what am I doing wrong ?










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Are the observations independent?
    $endgroup$
    – Michael M
    Dec 9 '18 at 15:43






  • 1




    $begingroup$
    Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:11










  • $begingroup$
    Sorry to ask the obvious, but is your label being used in the inputs?
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:13










  • $begingroup$
    @MichaelM yes, each example is independent of the others
    $endgroup$
    – M.F
    Dec 10 '18 at 7:05










  • $begingroup$
    Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
    $endgroup$
    – user12075
    Dec 23 '18 at 8:17
















1












$begingroup$


I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).




  1. Used min_max_scaler

  2. Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

  3. Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

  4. Used predict on the test data

  5. Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class


Seems fishy , but I can't figure out why... what am I doing wrong ?










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Are the observations independent?
    $endgroup$
    – Michael M
    Dec 9 '18 at 15:43






  • 1




    $begingroup$
    Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:11










  • $begingroup$
    Sorry to ask the obvious, but is your label being used in the inputs?
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:13










  • $begingroup$
    @MichaelM yes, each example is independent of the others
    $endgroup$
    – M.F
    Dec 10 '18 at 7:05










  • $begingroup$
    Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
    $endgroup$
    – user12075
    Dec 23 '18 at 8:17














1












1








1





$begingroup$


I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).




  1. Used min_max_scaler

  2. Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

  3. Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

  4. Used predict on the test data

  5. Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class


Seems fishy , but I can't figure out why... what am I doing wrong ?










share|improve this question









$endgroup$




I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).




  1. Used min_max_scaler

  2. Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

  3. Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

  4. Used predict on the test data

  5. Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class


Seems fishy , but I can't figure out why... what am I doing wrong ?







machine-learning classification scikit-learn overfitting






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 9 '18 at 14:41









M.FM.F

167




167





bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    Are the observations independent?
    $endgroup$
    – Michael M
    Dec 9 '18 at 15:43






  • 1




    $begingroup$
    Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:11










  • $begingroup$
    Sorry to ask the obvious, but is your label being used in the inputs?
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:13










  • $begingroup$
    @MichaelM yes, each example is independent of the others
    $endgroup$
    – M.F
    Dec 10 '18 at 7:05










  • $begingroup$
    Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
    $endgroup$
    – user12075
    Dec 23 '18 at 8:17


















  • $begingroup$
    Are the observations independent?
    $endgroup$
    – Michael M
    Dec 9 '18 at 15:43






  • 1




    $begingroup$
    Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:11










  • $begingroup$
    Sorry to ask the obvious, but is your label being used in the inputs?
    $endgroup$
    – Skiddles
    Dec 9 '18 at 17:13










  • $begingroup$
    @MichaelM yes, each example is independent of the others
    $endgroup$
    – M.F
    Dec 10 '18 at 7:05










  • $begingroup$
    Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
    $endgroup$
    – user12075
    Dec 23 '18 at 8:17
















$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43




$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43




1




1




$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11




$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11












$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13




$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13












$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05




$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05












$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17




$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17










1 Answer
1






active

oldest

votes


















1












$begingroup$

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.



Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.



On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.



As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.



You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f42360%2fnot-sure-if-over-fitting%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.



    Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.



    On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.



    As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.



    You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).






    share|improve this answer









    $endgroup$


















      1












      $begingroup$

      Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.



      Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.



      On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.



      As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.



      You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).






      share|improve this answer









      $endgroup$
















        1












        1








        1





        $begingroup$

        Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.



        Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.



        On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.



        As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.



        You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).






        share|improve this answer









        $endgroup$



        Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.



        Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.



        On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.



        As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.



        You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 23 '18 at 0:41









        Alex LAlex L

        1478




        1478






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f42360%2fnot-sure-if-over-fitting%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana