Evaluating the performance of a random forest classifier












2












$begingroup$


I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:



ff = size ~ capital + owning_group + state

Call:
randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 1

OOB estimate of error rate: 32.41%
Confusion matrix:
large medium small class.error
large 238 17 237 0.51626016
medium 80 25 322 0.94145199
small 73 30 1320 0.07238229

Overall Statistics

Accuracy : 0.7297
95% CI : (0.7112, 0.7476)
No Information Rate : 0.8049
P-Value [Acc > NIR] : 1

Kappa : 0.426
Mcnemar's Test P-Value : <2e-16

Statistics by Class:

Class: large Class: medium Class: small
Sensitivity 0.7087 0.84211 0.7294
Specificity 0.8868 0.83981 0.8950
Pos Pred Value 0.5488 0.14988 0.9663
Neg Pred Value 0.9400 0.99373 0.4450
Prevalence 0.1627 0.03245 0.8049
Detection Rate 0.1153 0.02733 0.5871
Detection Prevalence 0.2101 0.18232 0.6076
Balanced Accuracy 0.7977 0.84096 0.8122


I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?



Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?










share|improve this question











$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    2












    $begingroup$


    I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:



    ff = size ~ capital + owning_group + state

    Call:
    randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
    Type of random forest: classification
    Number of trees: 2000
    No. of variables tried at each split: 1

    OOB estimate of error rate: 32.41%
    Confusion matrix:
    large medium small class.error
    large 238 17 237 0.51626016
    medium 80 25 322 0.94145199
    small 73 30 1320 0.07238229

    Overall Statistics

    Accuracy : 0.7297
    95% CI : (0.7112, 0.7476)
    No Information Rate : 0.8049
    P-Value [Acc > NIR] : 1

    Kappa : 0.426
    Mcnemar's Test P-Value : <2e-16

    Statistics by Class:

    Class: large Class: medium Class: small
    Sensitivity 0.7087 0.84211 0.7294
    Specificity 0.8868 0.83981 0.8950
    Pos Pred Value 0.5488 0.14988 0.9663
    Neg Pred Value 0.9400 0.99373 0.4450
    Prevalence 0.1627 0.03245 0.8049
    Detection Rate 0.1153 0.02733 0.5871
    Detection Prevalence 0.2101 0.18232 0.6076
    Balanced Accuracy 0.7977 0.84096 0.8122


    I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?



    Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?










    share|improve this question











    $endgroup$




    bumped to the homepage by Community yesterday


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      2












      2








      2





      $begingroup$


      I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:



      ff = size ~ capital + owning_group + state

      Call:
      randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
      Type of random forest: classification
      Number of trees: 2000
      No. of variables tried at each split: 1

      OOB estimate of error rate: 32.41%
      Confusion matrix:
      large medium small class.error
      large 238 17 237 0.51626016
      medium 80 25 322 0.94145199
      small 73 30 1320 0.07238229

      Overall Statistics

      Accuracy : 0.7297
      95% CI : (0.7112, 0.7476)
      No Information Rate : 0.8049
      P-Value [Acc > NIR] : 1

      Kappa : 0.426
      Mcnemar's Test P-Value : <2e-16

      Statistics by Class:

      Class: large Class: medium Class: small
      Sensitivity 0.7087 0.84211 0.7294
      Specificity 0.8868 0.83981 0.8950
      Pos Pred Value 0.5488 0.14988 0.9663
      Neg Pred Value 0.9400 0.99373 0.4450
      Prevalence 0.1627 0.03245 0.8049
      Detection Rate 0.1153 0.02733 0.5871
      Detection Prevalence 0.2101 0.18232 0.6076
      Balanced Accuracy 0.7977 0.84096 0.8122


      I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?



      Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?










      share|improve this question











      $endgroup$




      I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:



      ff = size ~ capital + owning_group + state

      Call:
      randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
      Type of random forest: classification
      Number of trees: 2000
      No. of variables tried at each split: 1

      OOB estimate of error rate: 32.41%
      Confusion matrix:
      large medium small class.error
      large 238 17 237 0.51626016
      medium 80 25 322 0.94145199
      small 73 30 1320 0.07238229

      Overall Statistics

      Accuracy : 0.7297
      95% CI : (0.7112, 0.7476)
      No Information Rate : 0.8049
      P-Value [Acc > NIR] : 1

      Kappa : 0.426
      Mcnemar's Test P-Value : <2e-16

      Statistics by Class:

      Class: large Class: medium Class: small
      Sensitivity 0.7087 0.84211 0.7294
      Specificity 0.8868 0.83981 0.8950
      Pos Pred Value 0.5488 0.14988 0.9663
      Neg Pred Value 0.9400 0.99373 0.4450
      Prevalence 0.1627 0.03245 0.8049
      Detection Rate 0.1153 0.02733 0.5871
      Detection Prevalence 0.2101 0.18232 0.6076
      Balanced Accuracy 0.7977 0.84096 0.8122


      I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?



      Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?







      r random-forest cross-validation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 28 '18 at 14:49







      Strabonio

















      asked Jul 28 '18 at 14:32









      StrabonioStrabonio

      162




      162





      bumped to the homepage by Community yesterday


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community yesterday


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.



          Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.



          Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).






          share|improve this answer









          $endgroup$














            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36134%2fevaluating-the-performance-of-a-random-forest-classifier%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.



            Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.



            Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.



              Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.



              Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.



                Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.



                Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).






                share|improve this answer









                $endgroup$



                First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.



                Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.



                Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Sep 14 '18 at 8:24









                user2974951user2974951

                2355




                2355






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36134%2fevaluating-the-performance-of-a-random-forest-classifier%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Callistus I

                    Tabula Rosettana

                    How to label and detect the document text images