In which cases shouldn't we drop the first level of categorical variables?
$begingroup$
Beginner in machine learning, I'm looking into the one-hot encoding concept.
Unlike in statistics when you always want to drop the first level to have k-1
dummies (as discussed here on SE), it seems that some models needs to keep it and have k
dummies.
I know that having k
levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1
levels.
But since pandas.get_dummies()
has its drop_first
argument to false
by default, this definitely has to be useful sometimes.
In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k
levels for each categorical variable?
EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?
machine-learning algorithms encoding dummy-variables
New contributor
$endgroup$
add a comment |
$begingroup$
Beginner in machine learning, I'm looking into the one-hot encoding concept.
Unlike in statistics when you always want to drop the first level to have k-1
dummies (as discussed here on SE), it seems that some models needs to keep it and have k
dummies.
I know that having k
levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1
levels.
But since pandas.get_dummies()
has its drop_first
argument to false
by default, this definitely has to be useful sometimes.
In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k
levels for each categorical variable?
EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?
machine-learning algorithms encoding dummy-variables
New contributor
$endgroup$
1
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
1
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday
add a comment |
$begingroup$
Beginner in machine learning, I'm looking into the one-hot encoding concept.
Unlike in statistics when you always want to drop the first level to have k-1
dummies (as discussed here on SE), it seems that some models needs to keep it and have k
dummies.
I know that having k
levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1
levels.
But since pandas.get_dummies()
has its drop_first
argument to false
by default, this definitely has to be useful sometimes.
In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k
levels for each categorical variable?
EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?
machine-learning algorithms encoding dummy-variables
New contributor
$endgroup$
Beginner in machine learning, I'm looking into the one-hot encoding concept.
Unlike in statistics when you always want to drop the first level to have k-1
dummies (as discussed here on SE), it seems that some models needs to keep it and have k
dummies.
I know that having k
levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1
levels.
But since pandas.get_dummies()
has its drop_first
argument to false
by default, this definitely has to be useful sometimes.
In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k
levels for each categorical variable?
EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?
machine-learning algorithms encoding dummy-variables
machine-learning algorithms encoding dummy-variables
New contributor
New contributor
edited 9 hours ago
Dan Chaltiel
New contributor
asked yesterday
Dan ChaltielDan Chaltiel
1335
1335
New contributor
New contributor
1
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
1
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday
add a comment |
1
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
1
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday
1
1
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
1
1
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
First, if your data has missing values, get_dummies
by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.
In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.
K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.
See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)
$endgroup$
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python'sscikit
which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47638%2fin-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
First, if your data has missing values, get_dummies
by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.
In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.
K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.
See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)
$endgroup$
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python'sscikit
which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
add a comment |
$begingroup$
First, if your data has missing values, get_dummies
by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.
In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.
K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.
See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)
$endgroup$
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python'sscikit
which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
add a comment |
$begingroup$
First, if your data has missing values, get_dummies
by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.
In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.
K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.
See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)
$endgroup$
First, if your data has missing values, get_dummies
by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.
In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."
Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.
K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.
But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.
See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding
(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)
edited 4 hours ago
answered 7 hours ago
Ben ReinigerBen Reiniger
30319
30319
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python'sscikit
which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
add a comment |
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python'sscikit
which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python's
scikit
which apparently needs one-hot encoding beforehand.$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
Additionnaly, I'm using python's
scikit
which apparently needs one-hot encoding beforehand.$endgroup$
– Dan Chaltiel
7 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago
add a comment |
Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47638%2fin-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday
1
$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday