Categorical data for sklearns Isolation Forrest
$begingroup$
I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)
I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.
Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.
I have thus a two part question:
How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?
What other feature transformations can I consider for distance based models?
feature-engineering categorical-data ensemble-modeling
$endgroup$
add a comment |
$begingroup$
I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)
I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.
Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.
I have thus a two part question:
How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?
What other feature transformations can I consider for distance based models?
feature-engineering categorical-data ensemble-modeling
$endgroup$
add a comment |
$begingroup$
I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)
I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.
Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.
I have thus a two part question:
How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?
What other feature transformations can I consider for distance based models?
feature-engineering categorical-data ensemble-modeling
$endgroup$
I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)
I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.
Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.
I have thus a two part question:
How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?
What other feature transformations can I consider for distance based models?
feature-engineering categorical-data ensemble-modeling
feature-engineering categorical-data ensemble-modeling
asked Jul 25 '18 at 14:29
amateurjustinamateurjustin
32
32
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.
You could:
- Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
- One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
- Try to use an autoencoder or neural method to learn an embedding of fixed dimension.
One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.
$endgroup$
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
add a comment |
$begingroup$
I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36006%2fcategorical-data-for-sklearns-isolation-forrest%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.
You could:
- Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
- One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
- Try to use an autoencoder or neural method to learn an embedding of fixed dimension.
One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.
$endgroup$
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
add a comment |
$begingroup$
I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.
You could:
- Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
- One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
- Try to use an autoencoder or neural method to learn an embedding of fixed dimension.
One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.
$endgroup$
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
add a comment |
$begingroup$
I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.
You could:
- Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
- One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
- Try to use an autoencoder or neural method to learn an embedding of fixed dimension.
One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.
$endgroup$
I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.
You could:
- Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.
- One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.
- Try to use an autoencoder or neural method to learn an embedding of fixed dimension.
One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.
answered Jul 25 '18 at 15:07
MatthewMatthew
56410
56410
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
add a comment |
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13
add a comment |
$begingroup$
I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?
New contributor
$endgroup$
add a comment |
$begingroup$
I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?
New contributor
$endgroup$
add a comment |
$begingroup$
I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?
New contributor
$endgroup$
I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?
New contributor
New contributor
answered 2 days ago
ShivanyaShivanya
164
164
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36006%2fcategorical-data-for-sklearns-isolation-forrest%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown