How can I check the correlation between features and target variable?
$begingroup$
I am trying to build a Regression model and I am looking for a way to check whether there's any correlation between features and target variables?
This is my sample dataset
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
0 LP001002 Male No 0 Graduate No 5849
1 LP001003 Male Yes 1 Graduate No 4583
2 LP001005 Male Yes 0 Graduate Yes 3000
3 LP001006 Male Yes 0 Not Graduate No 2583
4 LP001008 Male No 0 Graduate No 6000
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Area Loan_Status
0.0 123 360.0 1.0 Urban Y
1508.0 128.0 360.0 1.0 Rural N
0.0 66.0 360.0 1.0 Urban Y
2358.0 120.0 360.0 1.0 Urban Y
0.0 141.0 360.0 1.0 Urban Y
I am trying to predict LoanAmount column based on the features available above.
I just want to see if there's a correlation between the features and target variable. I tried LinearRegression, GradientBoostingRegressor and I'm hardly getting a accuracy of around 0.30 - 0.40%.
Any suggestions on algorithms, params etc that I should use for better prediction?
machine-learning scikit-learn regression linear-regression
$endgroup$
add a comment |
$begingroup$
I am trying to build a Regression model and I am looking for a way to check whether there's any correlation between features and target variables?
This is my sample dataset
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
0 LP001002 Male No 0 Graduate No 5849
1 LP001003 Male Yes 1 Graduate No 4583
2 LP001005 Male Yes 0 Graduate Yes 3000
3 LP001006 Male Yes 0 Not Graduate No 2583
4 LP001008 Male No 0 Graduate No 6000
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Area Loan_Status
0.0 123 360.0 1.0 Urban Y
1508.0 128.0 360.0 1.0 Rural N
0.0 66.0 360.0 1.0 Urban Y
2358.0 120.0 360.0 1.0 Urban Y
0.0 141.0 360.0 1.0 Urban Y
I am trying to predict LoanAmount column based on the features available above.
I just want to see if there's a correlation between the features and target variable. I tried LinearRegression, GradientBoostingRegressor and I'm hardly getting a accuracy of around 0.30 - 0.40%.
Any suggestions on algorithms, params etc that I should use for better prediction?
machine-learning scikit-learn regression linear-regression
$endgroup$
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago
add a comment |
$begingroup$
I am trying to build a Regression model and I am looking for a way to check whether there's any correlation between features and target variables?
This is my sample dataset
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
0 LP001002 Male No 0 Graduate No 5849
1 LP001003 Male Yes 1 Graduate No 4583
2 LP001005 Male Yes 0 Graduate Yes 3000
3 LP001006 Male Yes 0 Not Graduate No 2583
4 LP001008 Male No 0 Graduate No 6000
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Area Loan_Status
0.0 123 360.0 1.0 Urban Y
1508.0 128.0 360.0 1.0 Rural N
0.0 66.0 360.0 1.0 Urban Y
2358.0 120.0 360.0 1.0 Urban Y
0.0 141.0 360.0 1.0 Urban Y
I am trying to predict LoanAmount column based on the features available above.
I just want to see if there's a correlation between the features and target variable. I tried LinearRegression, GradientBoostingRegressor and I'm hardly getting a accuracy of around 0.30 - 0.40%.
Any suggestions on algorithms, params etc that I should use for better prediction?
machine-learning scikit-learn regression linear-regression
$endgroup$
I am trying to build a Regression model and I am looking for a way to check whether there's any correlation between features and target variables?
This is my sample dataset
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
0 LP001002 Male No 0 Graduate No 5849
1 LP001003 Male Yes 1 Graduate No 4583
2 LP001005 Male Yes 0 Graduate Yes 3000
3 LP001006 Male Yes 0 Not Graduate No 2583
4 LP001008 Male No 0 Graduate No 6000
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Area Loan_Status
0.0 123 360.0 1.0 Urban Y
1508.0 128.0 360.0 1.0 Rural N
0.0 66.0 360.0 1.0 Urban Y
2358.0 120.0 360.0 1.0 Urban Y
0.0 141.0 360.0 1.0 Urban Y
I am trying to predict LoanAmount column based on the features available above.
I just want to see if there's a correlation between the features and target variable. I tried LinearRegression, GradientBoostingRegressor and I'm hardly getting a accuracy of around 0.30 - 0.40%.
Any suggestions on algorithms, params etc that I should use for better prediction?
machine-learning scikit-learn regression linear-regression
machine-learning scikit-learn regression linear-regression
edited Oct 3 '18 at 18:55
user214
asked Oct 3 '18 at 18:43
user214user214
1485
1485
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago
add a comment |
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Your data can be put into a pandas DataFrame using
import pandas as pd
data = {'Loan ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No'],
'Dependents': [0, 1, 0, 0, 0],
'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141],
'Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban'],
'Loan Status': ['Y', 'N', 'Y', 'Y', 'Y']}
df = pd.DataFrame(data)
Now to get a correlation we need to convert our categorical features to numerical ones. Of course the choice of order will affect the correlation but luckily all of our categories seem to be binary. If this is not the case you will need to devise a custom ordering.
df = pd.DataFrame(data)
df['Married'] =df['Married'].astype('category').cat.codes
df['Education'] =df['Education'].astype('category').cat.codes
df['Self_Employed'] =df['Self_Employed'].astype('category').cat.codes
df['Area'] =df['Area'].astype('category').cat.codes
df['Loan Status'] =df['Loan Status'].astype('category').cat.codes
Now we can get the correlation between the 'LoanAmount' and all the other features.
df[df.columns[1:]].corr()['LoanAmount'][:]

Now using some machine learning on this data is not likely to work. There just is not sufficient data to extract some relevant information between your large number of features and the loan amount.
You need at at least 10 times more instances than features in order to expect to get some good results.
To only obtain the correlation between a feature and a subset of the features you can do
df[['Income', 'Education', 'LoanAmount']].corr()['LoanAmount'][:]
This will take a subset of the DataFrame and then apply the same corr() function as above. Make sure that the subset of columns selected includes the column with which you want to calculate the correlation, in this example that's 'LoanAmount'.
$endgroup$
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I usestandardscalerormin_max_scalar. I want to do a classification task.
$endgroup$
– user214
Oct 5 '18 at 5:55
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
add a comment |
$begingroup$
Method in Python
One way to check the correlation of every feature against the target variable is to run the code:
# Your data should be a pandas dataframe for this example
import pandas
yourdata = ...
corr_matrix = yourdata.corr()
print(corr_matrix["your_target_variable"].sort_values(ascending=False))
The following correlation output should list all the variables and their correlations to the target variable. The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)
To plot the correlations on plots instead, run the code:
# make sure to specify some features that you might want to focus on or the plots might be too big
from pandas.tools.plotting import scatter_matrix
attributes = [list of whatever features you want to plot against the target variable]
scatter_matrix(yourdata[attributes], figsize=(12, 8))
For the figsize argument for the scatter_matrix function, input whatever size fits best.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39137%2fhow-can-i-check-the-correlation-between-features-and-target-variable%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your data can be put into a pandas DataFrame using
import pandas as pd
data = {'Loan ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No'],
'Dependents': [0, 1, 0, 0, 0],
'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141],
'Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban'],
'Loan Status': ['Y', 'N', 'Y', 'Y', 'Y']}
df = pd.DataFrame(data)
Now to get a correlation we need to convert our categorical features to numerical ones. Of course the choice of order will affect the correlation but luckily all of our categories seem to be binary. If this is not the case you will need to devise a custom ordering.
df = pd.DataFrame(data)
df['Married'] =df['Married'].astype('category').cat.codes
df['Education'] =df['Education'].astype('category').cat.codes
df['Self_Employed'] =df['Self_Employed'].astype('category').cat.codes
df['Area'] =df['Area'].astype('category').cat.codes
df['Loan Status'] =df['Loan Status'].astype('category').cat.codes
Now we can get the correlation between the 'LoanAmount' and all the other features.
df[df.columns[1:]].corr()['LoanAmount'][:]

Now using some machine learning on this data is not likely to work. There just is not sufficient data to extract some relevant information between your large number of features and the loan amount.
You need at at least 10 times more instances than features in order to expect to get some good results.
To only obtain the correlation between a feature and a subset of the features you can do
df[['Income', 'Education', 'LoanAmount']].corr()['LoanAmount'][:]
This will take a subset of the DataFrame and then apply the same corr() function as above. Make sure that the subset of columns selected includes the column with which you want to calculate the correlation, in this example that's 'LoanAmount'.
$endgroup$
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I usestandardscalerormin_max_scalar. I want to do a classification task.
$endgroup$
– user214
Oct 5 '18 at 5:55
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
add a comment |
$begingroup$
Your data can be put into a pandas DataFrame using
import pandas as pd
data = {'Loan ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No'],
'Dependents': [0, 1, 0, 0, 0],
'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141],
'Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban'],
'Loan Status': ['Y', 'N', 'Y', 'Y', 'Y']}
df = pd.DataFrame(data)
Now to get a correlation we need to convert our categorical features to numerical ones. Of course the choice of order will affect the correlation but luckily all of our categories seem to be binary. If this is not the case you will need to devise a custom ordering.
df = pd.DataFrame(data)
df['Married'] =df['Married'].astype('category').cat.codes
df['Education'] =df['Education'].astype('category').cat.codes
df['Self_Employed'] =df['Self_Employed'].astype('category').cat.codes
df['Area'] =df['Area'].astype('category').cat.codes
df['Loan Status'] =df['Loan Status'].astype('category').cat.codes
Now we can get the correlation between the 'LoanAmount' and all the other features.
df[df.columns[1:]].corr()['LoanAmount'][:]

Now using some machine learning on this data is not likely to work. There just is not sufficient data to extract some relevant information between your large number of features and the loan amount.
You need at at least 10 times more instances than features in order to expect to get some good results.
To only obtain the correlation between a feature and a subset of the features you can do
df[['Income', 'Education', 'LoanAmount']].corr()['LoanAmount'][:]
This will take a subset of the DataFrame and then apply the same corr() function as above. Make sure that the subset of columns selected includes the column with which you want to calculate the correlation, in this example that's 'LoanAmount'.
$endgroup$
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I usestandardscalerormin_max_scalar. I want to do a classification task.
$endgroup$
– user214
Oct 5 '18 at 5:55
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
add a comment |
$begingroup$
Your data can be put into a pandas DataFrame using
import pandas as pd
data = {'Loan ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No'],
'Dependents': [0, 1, 0, 0, 0],
'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141],
'Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban'],
'Loan Status': ['Y', 'N', 'Y', 'Y', 'Y']}
df = pd.DataFrame(data)
Now to get a correlation we need to convert our categorical features to numerical ones. Of course the choice of order will affect the correlation but luckily all of our categories seem to be binary. If this is not the case you will need to devise a custom ordering.
df = pd.DataFrame(data)
df['Married'] =df['Married'].astype('category').cat.codes
df['Education'] =df['Education'].astype('category').cat.codes
df['Self_Employed'] =df['Self_Employed'].astype('category').cat.codes
df['Area'] =df['Area'].astype('category').cat.codes
df['Loan Status'] =df['Loan Status'].astype('category').cat.codes
Now we can get the correlation between the 'LoanAmount' and all the other features.
df[df.columns[1:]].corr()['LoanAmount'][:]

Now using some machine learning on this data is not likely to work. There just is not sufficient data to extract some relevant information between your large number of features and the loan amount.
You need at at least 10 times more instances than features in order to expect to get some good results.
To only obtain the correlation between a feature and a subset of the features you can do
df[['Income', 'Education', 'LoanAmount']].corr()['LoanAmount'][:]
This will take a subset of the DataFrame and then apply the same corr() function as above. Make sure that the subset of columns selected includes the column with which you want to calculate the correlation, in this example that's 'LoanAmount'.
$endgroup$
Your data can be put into a pandas DataFrame using
import pandas as pd
data = {'Loan ID': ['LP001002', 'LP001003', 'LP001005', 'LP001006', 'LP001008'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No'],
'Dependents': [0, 1, 0, 0, 0],
'Education': ['Graduate', 'Graduate', 'Graduate', 'Not Graduate', 'Graduate'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141],
'Area': ['Urban', 'Rural', 'Urban', 'Urban', 'Urban'],
'Loan Status': ['Y', 'N', 'Y', 'Y', 'Y']}
df = pd.DataFrame(data)
Now to get a correlation we need to convert our categorical features to numerical ones. Of course the choice of order will affect the correlation but luckily all of our categories seem to be binary. If this is not the case you will need to devise a custom ordering.
df = pd.DataFrame(data)
df['Married'] =df['Married'].astype('category').cat.codes
df['Education'] =df['Education'].astype('category').cat.codes
df['Self_Employed'] =df['Self_Employed'].astype('category').cat.codes
df['Area'] =df['Area'].astype('category').cat.codes
df['Loan Status'] =df['Loan Status'].astype('category').cat.codes
Now we can get the correlation between the 'LoanAmount' and all the other features.
df[df.columns[1:]].corr()['LoanAmount'][:]

Now using some machine learning on this data is not likely to work. There just is not sufficient data to extract some relevant information between your large number of features and the loan amount.
You need at at least 10 times more instances than features in order to expect to get some good results.
To only obtain the correlation between a feature and a subset of the features you can do
df[['Income', 'Education', 'LoanAmount']].corr()['LoanAmount'][:]
This will take a subset of the DataFrame and then apply the same corr() function as above. Make sure that the subset of columns selected includes the column with which you want to calculate the correlation, in this example that's 'LoanAmount'.
edited Oct 5 '18 at 4:33
answered Oct 5 '18 at 4:07
JahKnowsJahKnows
4,627524
4,627524
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I usestandardscalerormin_max_scalar. I want to do a classification task.
$endgroup$
– user214
Oct 5 '18 at 5:55
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
add a comment |
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I usestandardscalerormin_max_scalar. I want to do a classification task.
$endgroup$
– user214
Oct 5 '18 at 5:55
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
Is it possible to choose what features to keep when looking at correlation of the features?
$endgroup$
– user214
Oct 5 '18 at 4:16
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
@user2475, I added to the answer regarding your question.
$endgroup$
– JahKnows
Oct 5 '18 at 4:33
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
my pleasure! let us know if other questions come up.
$endgroup$
– JahKnows
Oct 5 '18 at 4:39
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I use
standardscaler or min_max_scalar. I want to do a classification task.$endgroup$
– user214
Oct 5 '18 at 5:55
$begingroup$
@JabKnows Just one doubt. Do I need to scale my feature values. I mean should I use
standardscaler or min_max_scalar. I want to do a classification task.$endgroup$
– user214
Oct 5 '18 at 5:55
1
1
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
$begingroup$
It depends on the algorithm you select. Some require it others don't, you'd need to check the algorithm. It is good however to keep the scale of the different features similar.
$endgroup$
– JahKnows
Oct 6 '18 at 22:48
add a comment |
$begingroup$
Method in Python
One way to check the correlation of every feature against the target variable is to run the code:
# Your data should be a pandas dataframe for this example
import pandas
yourdata = ...
corr_matrix = yourdata.corr()
print(corr_matrix["your_target_variable"].sort_values(ascending=False))
The following correlation output should list all the variables and their correlations to the target variable. The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)
To plot the correlations on plots instead, run the code:
# make sure to specify some features that you might want to focus on or the plots might be too big
from pandas.tools.plotting import scatter_matrix
attributes = [list of whatever features you want to plot against the target variable]
scatter_matrix(yourdata[attributes], figsize=(12, 8))
For the figsize argument for the scatter_matrix function, input whatever size fits best.
$endgroup$
add a comment |
$begingroup$
Method in Python
One way to check the correlation of every feature against the target variable is to run the code:
# Your data should be a pandas dataframe for this example
import pandas
yourdata = ...
corr_matrix = yourdata.corr()
print(corr_matrix["your_target_variable"].sort_values(ascending=False))
The following correlation output should list all the variables and their correlations to the target variable. The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)
To plot the correlations on plots instead, run the code:
# make sure to specify some features that you might want to focus on or the plots might be too big
from pandas.tools.plotting import scatter_matrix
attributes = [list of whatever features you want to plot against the target variable]
scatter_matrix(yourdata[attributes], figsize=(12, 8))
For the figsize argument for the scatter_matrix function, input whatever size fits best.
$endgroup$
add a comment |
$begingroup$
Method in Python
One way to check the correlation of every feature against the target variable is to run the code:
# Your data should be a pandas dataframe for this example
import pandas
yourdata = ...
corr_matrix = yourdata.corr()
print(corr_matrix["your_target_variable"].sort_values(ascending=False))
The following correlation output should list all the variables and their correlations to the target variable. The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)
To plot the correlations on plots instead, run the code:
# make sure to specify some features that you might want to focus on or the plots might be too big
from pandas.tools.plotting import scatter_matrix
attributes = [list of whatever features you want to plot against the target variable]
scatter_matrix(yourdata[attributes], figsize=(12, 8))
For the figsize argument for the scatter_matrix function, input whatever size fits best.
$endgroup$
Method in Python
One way to check the correlation of every feature against the target variable is to run the code:
# Your data should be a pandas dataframe for this example
import pandas
yourdata = ...
corr_matrix = yourdata.corr()
print(corr_matrix["your_target_variable"].sort_values(ascending=False))
The following correlation output should list all the variables and their correlations to the target variable. The negative correlations mean that as the target variable decreases in value, the feature variable increases in value. (Linearly)
To plot the correlations on plots instead, run the code:
# make sure to specify some features that you might want to focus on or the plots might be too big
from pandas.tools.plotting import scatter_matrix
attributes = [list of whatever features you want to plot against the target variable]
scatter_matrix(yourdata[attributes], figsize=(12, 8))
For the figsize argument for the scatter_matrix function, input whatever size fits best.
edited Oct 7 '18 at 22:11
Stephen Rauch
1,53751129
1,53751129
answered Oct 5 '18 at 1:23
Ethan YunEthan Yun
846
846
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39137%2fhow-can-i-check-the-correlation-between-features-and-target-variable%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Is there a special function for this in R?
$endgroup$
– alkanschtein
13 hours ago