In which cases shouldn't we drop the first level of categorical variables?

Beginner in machine learning, I'm looking into the one-hot encoding concept.

Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.

I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.

But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.

In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?

EDIT: @EliasStrehle's comment on above-mentioned link states that this is only true if the model has an intercept. Is this rule generalizable? What about algorithms like KNN or trees which are not exactly models in the statistic definition?

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

1

$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday

1

$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday

add a comment |

Beginner in machine learning, I'm looking into the one-hot encoding concept.

Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.

I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.

But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.

In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

1

$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday

1

$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday

add a comment |

Beginner in machine learning, I'm looking into the one-hot encoding concept.

Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.

I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.

But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.

In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

Beginner in machine learning, I'm looking into the one-hot encoding concept.

Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k dummies.

I know that having k levels could lead to collinearity problems, but I'm not aware of any problem caused by having k-1 levels.

But since pandas.get_dummies() has its drop_first argument to false by default, this definitely has to be useful sometimes.

In which cases (algorithms, parameters...) would I want to keep the 1st level and fit with k levels for each categorical variable?

machine-learning algorithms encoding dummy-variables

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

edited 9 hours ago

asked yesterday

Dan Chaltiel

1335

New contributor

asked yesterday

Dan Chaltiel

1335

asked yesterday

Dan Chaltiel

1335

New contributor

Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday

1

$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday

add a comment |

1

$begingroup$
Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).
$endgroup$
– Alex L
yesterday

1

$begingroup$
Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?
$endgroup$
– Dan Chaltiel
yesterday

Do you have any specific algorithms you're interested in? I'd can see this question being answered differently depending on the algorithm (i.e. regression vs. decision tree).

– Alex L
yesterday

Actually, this is my point. How could I know that a given algorithm needs to drop the first level of its categorical variables or not ?

– Dan Chaltiel
yesterday

add a comment |

1 Answer
1

active

oldest

votes

First, if your data has missing values, get_dummies by default will produce all zeros, so that perfect multicollinearity doesn't actually hold. Also, from a data manipulation standpoint (without regard for modeling), it makes some sense to keep the symmetry of having a dummy for every value of the categorical variable.

In a decision tree (and various ensembles thereof), keeping all the dummies is beneficial: if you remove the first dummy, then the model can only select on that level by selecting (through several steps in the tree, rather unlikely!) "not this other dummy."

Then again, it's probably better not to one-hot encode at all for decision trees, but for now some packages don't deal innately with categorical variables.

K-nearest neighbors seems like it would also benefit from keeping all levels. The taxicab distance, limited to the dummies of one feature, between two points with different values is 1 if one of their values was the removed dummy, otherwise 2.

But again, it seems like KNN would be better off without one-hot encoding, but instead some more informed measure of distances between the category's values if you can come up with them.

See also https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

(In particular, when using regularization in a linear model, it may be worth keeping all dummies.)

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago

$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47638%2fin-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago

$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago

add a comment |

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago

$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago

add a comment |

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

edited 4 hours ago

answered 7 hours ago

Ben Reiniger

30319

answered 7 hours ago

Ben Reiniger

30319

answered 7 hours ago

Ben Reiniger

30319

$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago

$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago

add a comment |

$begingroup$
Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.
$endgroup$
– Dan Chaltiel
7 hours ago

$begingroup$
I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.
$endgroup$
– Ben Reiniger
4 hours ago

$begingroup$
+1 Thanks, your answer was definitely helping
$endgroup$
– Dan Chaltiel
4 hours ago

Very interesting but you only answered on my examples. If there is no general rule on this matter, what concepts should I learn to be able to tell ?

– Dan Chaltiel
7 hours ago

Additionnaly, I'm using python's scikit which apparently needs one-hot encoding beforehand.

– Dan Chaltiel
7 hours ago

I'm not sure of a general rule. I suspect keeping all dummies is generally better except when the model assumes that there is no multicollinearity. As another example, neural networks are linear before activations, so they can use the multicollinearity to recover the removed dummy internally; but I don't think leaving the dummy there will hurt the model.

– Ben Reiniger
4 hours ago

+1 Thanks, your answer was definitely helping

– Dan Chaltiel
4 hours ago

add a comment |

Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk