In which cases shouldn't we drop the first level of categorical variables?












2












$begingroup$


Beginner in machine learning, I'm looking into the one-hot encoding concept.



Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.



I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.



But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.



In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?



EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?










share|improve this question









New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 1




    $begingroup$
    Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
    $endgroup$
    – Alex L
    yesterday








  • 1




    $begingroup$
    Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
    $endgroup$
    – Dan Chaltiel
    yesterday
















2












$begingroup$


Beginner in machine learning, I'm looking into the one-hot encoding concept.



Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.



I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.



But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.



In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?



EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?










share|improve this question









New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 1




    $begingroup$
    Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
    $endgroup$
    – Alex L
    yesterday








  • 1




    $begingroup$
    Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
    $endgroup$
    – Dan Chaltiel
    yesterday














2












2








2


0



$begingroup$


Beginner in machine learning, I'm looking into the one-hot encoding concept.



Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.



I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.



But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.



In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?



EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?










share|improve this question









New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




Beginner in machine learning, I'm looking into the one-hot encoding concept.



Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.



I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.



But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.



In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?



EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?







machine-learning algorithms encoding dummy-variables






share|improve this question









New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 9 hours ago







Dan Chaltiel













New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked yesterday









Dan ChaltielDan Chaltiel

1335




1335




New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1




    $begingroup$
    Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
    $endgroup$
    – Alex L
    yesterday








  • 1




    $begingroup$
    Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
    $endgroup$
    – Dan Chaltiel
    yesterday














  • 1




    $begingroup$
    Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
    $endgroup$
    – Alex L
    yesterday








  • 1




    $begingroup$
    Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
    $endgroup$
    – Dan Chaltiel
    yesterday








1




1




$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday






$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday






1




1




$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday




$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday










1 Answer
1






active

oldest

votes


















1












$begingroup$

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.



In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.



K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.



See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)






share|improve this answer











$endgroup$













  • $begingroup$
    Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
    $endgroup$
    – Ben Reiniger
    4 hours ago










  • $begingroup$
    +1 Thanks, your answer was definitely helping
    $endgroup$
    – Dan Chaltiel
    4 hours ago











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47638%2fin-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.



In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.



K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.



See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)






share|improve this answer











$endgroup$













  • $begingroup$
    Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
    $endgroup$
    – Ben Reiniger
    4 hours ago










  • $begingroup$
    +1 Thanks, your answer was definitely helping
    $endgroup$
    – Dan Chaltiel
    4 hours ago
















1












$begingroup$

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.



In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.



K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.



See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)






share|improve this answer











$endgroup$













  • $begingroup$
    Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
    $endgroup$
    – Ben Reiniger
    4 hours ago










  • $begingroup$
    +1 Thanks, your answer was definitely helping
    $endgroup$
    – Dan Chaltiel
    4 hours ago














1












1








1





$begingroup$

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.



In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.



K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.



See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)






share|improve this answer











$endgroup$



First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.



In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.



K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.



See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)







share|improve this answer














share|improve this answer



share|improve this answer








edited 4 hours ago

























answered 7 hours ago









Ben ReinigerBen Reiniger

30319




30319












  • $begingroup$
    Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
    $endgroup$
    – Ben Reiniger
    4 hours ago










  • $begingroup$
    +1 Thanks, your answer was definitely helping
    $endgroup$
    – Dan Chaltiel
    4 hours ago


















  • $begingroup$
    Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
    $endgroup$
    – Dan Chaltiel
    7 hours ago










  • $begingroup$
    I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
    $endgroup$
    – Ben Reiniger
    4 hours ago










  • $begingroup$
    +1 Thanks, your answer was definitely helping
    $endgroup$
    – Dan Chaltiel
    4 hours ago
















$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago




$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago












$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago




$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago












$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago




$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago












$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago




$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago










Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.













Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.












Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47638%2fin-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Tabula Rosettana

Aureus (color)