What's the difference between feature importance from Random Forest and Pearson correlation coefficient
$begingroup$
I have following business domain. I have a product with three outputs/labels. The outputs are impacted by 1000 procedures, each procedure is digitized and measured. The customer wants to know what is the most influential procedures on the outputs.
1.
From Pearson correlation coefficient we could learn how two variables' relationship, say 1 is proportional, -1 is negative proportional and 0 is no relation. So I could find the biggest value of Pearson correlation coefficient to find more influential procedures.
2.
From Random Forest algorithm, I could know the top feature importance. So I could identify also the most influential procedures.
Which one is better?
random-forest
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
I have following business domain. I have a product with three outputs/labels. The outputs are impacted by 1000 procedures, each procedure is digitized and measured. The customer wants to know what is the most influential procedures on the outputs.
1.
From Pearson correlation coefficient we could learn how two variables' relationship, say 1 is proportional, -1 is negative proportional and 0 is no relation. So I could find the biggest value of Pearson correlation coefficient to find more influential procedures.
2.
From Random Forest algorithm, I could know the top feature importance. So I could identify also the most influential procedures.
Which one is better?
random-forest
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
I have following business domain. I have a product with three outputs/labels. The outputs are impacted by 1000 procedures, each procedure is digitized and measured. The customer wants to know what is the most influential procedures on the outputs.
1.
From Pearson correlation coefficient we could learn how two variables' relationship, say 1 is proportional, -1 is negative proportional and 0 is no relation. So I could find the biggest value of Pearson correlation coefficient to find more influential procedures.
2.
From Random Forest algorithm, I could know the top feature importance. So I could identify also the most influential procedures.
Which one is better?
random-forest
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
I have following business domain. I have a product with three outputs/labels. The outputs are impacted by 1000 procedures, each procedure is digitized and measured. The customer wants to know what is the most influential procedures on the outputs.
1.
From Pearson correlation coefficient we could learn how two variables' relationship, say 1 is proportional, -1 is negative proportional and 0 is no relation. So I could find the biggest value of Pearson correlation coefficient to find more influential procedures.
2.
From Random Forest algorithm, I could know the top feature importance. So I could identify also the most influential procedures.
Which one is better?
random-forest
random-forest
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 14 hours ago
user84592user84592
1162
1162
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
user84592 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Pearson correlations capture linear relationships between the input and target variables. Therefore this only makes sense for continuous inputs and a continuous target variable, and not continuous inputs with a binary/categorical output. Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other.
So it doesn't make much sense to compare the relationship between your input features and the categorical outputs this way. You may as well calculate the mean input for each feature and each label, and calculate the differences between those. I found this answer on Cross-Validated which explains this much better than I can.
Feature importance in tree based models is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well. How this feature importance is calculated depends on the implementation, this article gives a good overview of how different tree based models calculate importance for features.
$endgroup$
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
add a comment |
$begingroup$
I would say it depends a bit on what you want to achieve.
A few things to keep in mind:
Pearson gives you a correlation but what is if the information is in the absolute value- a RF has a much better chance to recognize this.
Example data where there is some clear correlation but in the absolute value:
a = [1,1,1,0,0,0, -1,-1,-1]
b = [abs(x) for x in a]
On the other hand RF importance is only relevant when the prediction is good - whatever good means for you. Pearson R has a very specific meaning that is always true- there is a correlation between the two variables.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
user84592 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47715%2fwhats-the-difference-between-feature-importance-from-random-forest-and-pearson%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Pearson correlations capture linear relationships between the input and target variables. Therefore this only makes sense for continuous inputs and a continuous target variable, and not continuous inputs with a binary/categorical output. Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other.
So it doesn't make much sense to compare the relationship between your input features and the categorical outputs this way. You may as well calculate the mean input for each feature and each label, and calculate the differences between those. I found this answer on Cross-Validated which explains this much better than I can.
Feature importance in tree based models is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well. How this feature importance is calculated depends on the implementation, this article gives a good overview of how different tree based models calculate importance for features.
$endgroup$
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
add a comment |
$begingroup$
Pearson correlations capture linear relationships between the input and target variables. Therefore this only makes sense for continuous inputs and a continuous target variable, and not continuous inputs with a binary/categorical output. Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other.
So it doesn't make much sense to compare the relationship between your input features and the categorical outputs this way. You may as well calculate the mean input for each feature and each label, and calculate the differences between those. I found this answer on Cross-Validated which explains this much better than I can.
Feature importance in tree based models is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well. How this feature importance is calculated depends on the implementation, this article gives a good overview of how different tree based models calculate importance for features.
$endgroup$
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
add a comment |
$begingroup$
Pearson correlations capture linear relationships between the input and target variables. Therefore this only makes sense for continuous inputs and a continuous target variable, and not continuous inputs with a binary/categorical output. Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other.
So it doesn't make much sense to compare the relationship between your input features and the categorical outputs this way. You may as well calculate the mean input for each feature and each label, and calculate the differences between those. I found this answer on Cross-Validated which explains this much better than I can.
Feature importance in tree based models is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well. How this feature importance is calculated depends on the implementation, this article gives a good overview of how different tree based models calculate importance for features.
$endgroup$
Pearson correlations capture linear relationships between the input and target variables. Therefore this only makes sense for continuous inputs and a continuous target variable, and not continuous inputs with a binary/categorical output. Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other.
So it doesn't make much sense to compare the relationship between your input features and the categorical outputs this way. You may as well calculate the mean input for each feature and each label, and calculate the differences between those. I found this answer on Cross-Validated which explains this much better than I can.
Feature importance in tree based models is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well. How this feature importance is calculated depends on the implementation, this article gives a good overview of how different tree based models calculate importance for features.
edited 5 hours ago
answered 6 hours ago
Dan CarterDan Carter
7121218
7121218
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
add a comment |
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
1
1
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
$begingroup$
This beautiful picture is for continuous-continuous variables. Continuous-categorical (feature-label) case is different, since "linear" relation has no meaning.
$endgroup$
– Esmailian
6 hours ago
1
1
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
$begingroup$
Ah well noticed, I hadn't spotted this question was asking about categorical labels, I'll edit my answer :)
$endgroup$
– Dan Carter
6 hours ago
add a comment |
$begingroup$
I would say it depends a bit on what you want to achieve.
A few things to keep in mind:
Pearson gives you a correlation but what is if the information is in the absolute value- a RF has a much better chance to recognize this.
Example data where there is some clear correlation but in the absolute value:
a = [1,1,1,0,0,0, -1,-1,-1]
b = [abs(x) for x in a]
On the other hand RF importance is only relevant when the prediction is good - whatever good means for you. Pearson R has a very specific meaning that is always true- there is a correlation between the two variables.
$endgroup$
add a comment |
$begingroup$
I would say it depends a bit on what you want to achieve.
A few things to keep in mind:
Pearson gives you a correlation but what is if the information is in the absolute value- a RF has a much better chance to recognize this.
Example data where there is some clear correlation but in the absolute value:
a = [1,1,1,0,0,0, -1,-1,-1]
b = [abs(x) for x in a]
On the other hand RF importance is only relevant when the prediction is good - whatever good means for you. Pearson R has a very specific meaning that is always true- there is a correlation between the two variables.
$endgroup$
add a comment |
$begingroup$
I would say it depends a bit on what you want to achieve.
A few things to keep in mind:
Pearson gives you a correlation but what is if the information is in the absolute value- a RF has a much better chance to recognize this.
Example data where there is some clear correlation but in the absolute value:
a = [1,1,1,0,0,0, -1,-1,-1]
b = [abs(x) for x in a]
On the other hand RF importance is only relevant when the prediction is good - whatever good means for you. Pearson R has a very specific meaning that is always true- there is a correlation between the two variables.
$endgroup$
I would say it depends a bit on what you want to achieve.
A few things to keep in mind:
Pearson gives you a correlation but what is if the information is in the absolute value- a RF has a much better chance to recognize this.
Example data where there is some clear correlation but in the absolute value:
a = [1,1,1,0,0,0, -1,-1,-1]
b = [abs(x) for x in a]
On the other hand RF importance is only relevant when the prediction is good - whatever good means for you. Pearson R has a very specific meaning that is always true- there is a correlation between the two variables.
answered 10 hours ago
El BurroEl Burro
455311
455311
add a comment |
add a comment |
user84592 is a new contributor. Be nice, and check out our Code of Conduct.
user84592 is a new contributor. Be nice, and check out our Code of Conduct.
user84592 is a new contributor. Be nice, and check out our Code of Conduct.
user84592 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47715%2fwhats-the-difference-between-feature-importance-from-random-forest-and-pearson%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown