Evaluating the performance of a random forest classifier
$begingroup$
I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:
ff = size ~ capital + owning_group + state
Call:
randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 1
OOB estimate of error rate: 32.41%
Confusion matrix:
large medium small class.error
large 238 17 237 0.51626016
medium 80 25 322 0.94145199
small 73 30 1320 0.07238229
Overall Statistics
Accuracy : 0.7297
95% CI : (0.7112, 0.7476)
No Information Rate : 0.8049
P-Value [Acc > NIR] : 1
Kappa : 0.426
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: large Class: medium Class: small
Sensitivity 0.7087 0.84211 0.7294
Specificity 0.8868 0.83981 0.8950
Pos Pred Value 0.5488 0.14988 0.9663
Neg Pred Value 0.9400 0.99373 0.4450
Prevalence 0.1627 0.03245 0.8049
Detection Rate 0.1153 0.02733 0.5871
Detection Prevalence 0.2101 0.18232 0.6076
Balanced Accuracy 0.7977 0.84096 0.8122
I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?
Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?
r random-forest cross-validation
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:
ff = size ~ capital + owning_group + state
Call:
randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 1
OOB estimate of error rate: 32.41%
Confusion matrix:
large medium small class.error
large 238 17 237 0.51626016
medium 80 25 322 0.94145199
small 73 30 1320 0.07238229
Overall Statistics
Accuracy : 0.7297
95% CI : (0.7112, 0.7476)
No Information Rate : 0.8049
P-Value [Acc > NIR] : 1
Kappa : 0.426
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: large Class: medium Class: small
Sensitivity 0.7087 0.84211 0.7294
Specificity 0.8868 0.83981 0.8950
Pos Pred Value 0.5488 0.14988 0.9663
Neg Pred Value 0.9400 0.99373 0.4450
Prevalence 0.1627 0.03245 0.8049
Detection Rate 0.1153 0.02733 0.5871
Detection Prevalence 0.2101 0.18232 0.6076
Balanced Accuracy 0.7977 0.84096 0.8122
I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?
Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?
r random-forest cross-validation
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:
ff = size ~ capital + owning_group + state
Call:
randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 1
OOB estimate of error rate: 32.41%
Confusion matrix:
large medium small class.error
large 238 17 237 0.51626016
medium 80 25 322 0.94145199
small 73 30 1320 0.07238229
Overall Statistics
Accuracy : 0.7297
95% CI : (0.7112, 0.7476)
No Information Rate : 0.8049
P-Value [Acc > NIR] : 1
Kappa : 0.426
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: large Class: medium Class: small
Sensitivity 0.7087 0.84211 0.7294
Specificity 0.8868 0.83981 0.8950
Pos Pred Value 0.5488 0.14988 0.9663
Neg Pred Value 0.9400 0.99373 0.4450
Prevalence 0.1627 0.03245 0.8049
Detection Rate 0.1153 0.02733 0.5871
Detection Prevalence 0.2101 0.18232 0.6076
Balanced Accuracy 0.7977 0.84096 0.8122
I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?
Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?
r random-forest cross-validation
$endgroup$
I'm using a random forest classifier (in R) to impute missing data in a dataset. Basically, I have a bunch of objects (companies) and I want to guess an attribute (size) from other attributes (capital, owning_group and state). The dependent attribute is a categorical variable (size) with 3 possible values (small|medium|large). A random forest (R package randomForest) on a set of 3 variables provide this output:
ff = size ~ capital + owning_group + state
Call:
randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 1
OOB estimate of error rate: 32.41%
Confusion matrix:
large medium small class.error
large 238 17 237 0.51626016
medium 80 25 322 0.94145199
small 73 30 1320 0.07238229
Overall Statistics
Accuracy : 0.7297
95% CI : (0.7112, 0.7476)
No Information Rate : 0.8049
P-Value [Acc > NIR] : 1
Kappa : 0.426
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: large Class: medium Class: small
Sensitivity 0.7087 0.84211 0.7294
Specificity 0.8868 0.83981 0.8950
Pos Pred Value 0.5488 0.14988 0.9663
Neg Pred Value 0.9400 0.99373 0.4450
Prevalence 0.1627 0.03245 0.8049
Detection Rate 0.1153 0.02733 0.5871
Detection Prevalence 0.2101 0.18232 0.6076
Balanced Accuracy 0.7977 0.84096 0.8122
I interpret this output as saying that the model has a 73% accuracy, and that the classifier makes a lot of mistakes for medium and large, but gets small mostly right. Does the P-value indicate that the model is not significant?
Assuming that this precision is OK for my context, how can I validate this model beyond these simple observations?
r random-forest cross-validation
r random-forest cross-validation
edited Jul 28 '18 at 14:49
Strabonio
asked Jul 28 '18 at 14:32
StrabonioStrabonio
162
162
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.
Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.
Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36134%2fevaluating-the-performance-of-a-random-forest-classifier%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.
Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.
Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).
$endgroup$
add a comment |
$begingroup$
First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.
Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.
Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).
$endgroup$
add a comment |
$begingroup$
First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.
Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.
Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).
$endgroup$
First of all, if you are trying to impute missing values with a RF model then take a look at the rfImpute() function.
Second, your data in unbalanced, which is why your classification is not good, your model is biased towards the majority class (small) and so it classifies a lot of you cases into the majority class. The issue of imbalance needs to be addressed.
Validating is done with a test set, as the results you have obtained from the model are already done using Cross-Validation (known as OOB scores).
answered Sep 14 '18 at 8:24
user2974951user2974951
2355
2355
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36134%2fevaluating-the-performance-of-a-random-forest-classifier%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown