Evaluating the performance of a random forest classifier

I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:

ff = size ~ capital + owning_group + state



Call:

 randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit) 

               Type of random forest: classification

                     Number of trees: 2000

No. of variables tried at each split: 1



        OOB estimate of  error rate: 32.41%

Confusion matrix:

       large medium small class.error

large    238     17   237  0.51626016

medium    80     25   322  0.94145199

small     73     30  1320  0.07238229



  Overall Statistics



               Accuracy : 0.7297          

                 95% CI : (0.7112, 0.7476)

    No Information Rate : 0.8049          

    P-Value [Acc > NIR] : 1               



                  Kappa : 0.426           

 Mcnemar's Test P-Value : <2e-16          



Statistics by Class:



                     Class: large Class: medium Class: small

Sensitivity                0.7087       0.84211       0.7294

Specificity                0.8868       0.83981       0.8950

Pos Pred Value             0.5488       0.14988       0.9663

Neg Pred Value             0.9400       0.99373       0.4450

Prevalence                 0.1627       0.03245       0.8049

Detection Rate             0.1153       0.02733       0.5871

Detection Prevalence       0.2101       0.18232       0.6076

Balanced Accuracy          0.7977       0.84096       0.8122

I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?

Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

ff = size ~ capital + owning_group + state



Call:

 randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit) 

               Type of random forest: classification

                     Number of trees: 2000

No. of variables tried at each split: 1



        OOB estimate of  error rate: 32.41%

Confusion matrix:

       large medium small class.error

large    238     17   237  0.51626016

medium    80     25   322  0.94145199

small     73     30  1320  0.07238229



  Overall Statistics



               Accuracy : 0.7297          

                 95% CI : (0.7112, 0.7476)

    No Information Rate : 0.8049          

    P-Value [Acc > NIR] : 1               



                  Kappa : 0.426           

 Mcnemar's Test P-Value : <2e-16          



Statistics by Class:



                     Class: large Class: medium Class: small

Sensitivity                0.7087       0.84211       0.7294

Specificity                0.8868       0.83981       0.8950

Pos Pred Value             0.5488       0.14988       0.9663

Neg Pred Value             0.9400       0.99373       0.4450

Prevalence                 0.1627       0.03245       0.8049

Detection Rate             0.1153       0.02733       0.5871

Detection Prevalence       0.2101       0.18232       0.6076

Balanced Accuracy          0.7977       0.84096       0.8122

Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

ff = size ~ capital + owning_group + state



Call:

 randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit) 

               Type of random forest: classification

                     Number of trees: 2000

No. of variables tried at each split: 1



        OOB estimate of  error rate: 32.41%

Confusion matrix:

       large medium small class.error

large    238     17   237  0.51626016

medium    80     25   322  0.94145199

small     73     30  1320  0.07238229



  Overall Statistics



               Accuracy : 0.7297          

                 95% CI : (0.7112, 0.7476)

    No Information Rate : 0.8049          

    P-Value [Acc > NIR] : 1               



                  Kappa : 0.426           

 Mcnemar's Test P-Value : <2e-16          



Statistics by Class:



                     Class: large Class: medium Class: small

Sensitivity                0.7087       0.84211       0.7294

Specificity                0.8868       0.83981       0.8950

Pos Pred Value             0.5488       0.14988       0.9663

Neg Pred Value             0.9400       0.99373       0.4450

Prevalence                 0.1627       0.03245       0.8049

Detection Rate             0.1153       0.02733       0.5871

Detection Prevalence       0.2101       0.18232       0.6076

Balanced Accuracy          0.7977       0.84096       0.8122

Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

ff = size ~ capital + owning_group + state



Call:

 randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit) 

               Type of random forest: classification

                     Number of trees: 2000

No. of variables tried at each split: 1



        OOB estimate of  error rate: 32.41%

Confusion matrix:

       large medium small class.error

large    238     17   237  0.51626016

medium    80     25   322  0.94145199

small     73     30  1320  0.07238229



  Overall Statistics



               Accuracy : 0.7297          

                 95% CI : (0.7112, 0.7476)

    No Information Rate : 0.8049          

    P-Value [Acc > NIR] : 1               



                  Kappa : 0.426           

 Mcnemar's Test P-Value : <2e-16          



Statistics by Class:



                     Class: large Class: medium Class: small

Sensitivity                0.7087       0.84211       0.7294

Specificity                0.8868       0.83981       0.8950

Pos Pred Value             0.5488       0.14988       0.9663

Neg Pred Value             0.9400       0.99373       0.4450

Prevalence                 0.1627       0.03245       0.8049

Detection Rate             0.1153       0.02733       0.5871

Detection Prevalence       0.2101       0.18232       0.6076

Balanced Accuracy          0.7977       0.84096       0.8122

Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?

r random-forest cross-validation

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

edited Jul 28 '18 at 14:49

asked Jul 28 '18 at 14:32

Strabonio

162

asked Jul 28 '18 at 14:32

Strabonio

162

asked Jul 28 '18 at 14:32

Strabonio

162

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.

Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.

Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).

answered Sep 14 '18 at 8:24

user2974951

2355

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36134%2fevaluating-the-performance-of-a-random-forest-classifier%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.

Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).

answered Sep 14 '18 at 8:24

user2974951

2355

add a comment |

First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.

Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).

answered Sep 14 '18 at 8:24

user2974951

2355

add a comment |

First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.

Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).

answered Sep 14 '18 at 8:24

user2974951

2355

First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.

Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).

answered Sep 14 '18 at 8:24

user2974951

2355

answered Sep 14 '18 at 8:24

user2974951

2355

answered Sep 14 '18 at 8:24

user2974951

2355

answered Sep 14 '18 at 8:24

user2974951

2355

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk