Categorical data for sklearns Isolation Forrest

I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

What other feature transformations can I consider for distance based models?

asked Jul 25 '18 at 14:29

amateurjustin

add a comment |

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

What other feature transformations can I consider for distance based models?

asked Jul 25 '18 at 14:29

amateurjustin

add a comment |

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

What other feature transformations can I consider for distance based models?

asked Jul 25 '18 at 14:29

amateurjustin

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

What other feature transformations can I consider for distance based models?

feature-engineering categorical-data ensemble-modeling

asked Jul 25 '18 at 14:29

amateurjustin

asked Jul 25 '18 at 14:29

amateurjustin

asked Jul 25 '18 at 14:29

amateurjustin

asked Jul 25 '18 at 14:29

amateurjustin

asked Jul 25 '18 at 14:29

amateurjustin

add a comment |

2 Answers
2

active

oldest

votes

I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.

You could:

Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

answered Jul 25 '18 at 15:07

Matthew

56410

$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10

$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13

add a comment |

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

answered 2 days ago

Shivanya

164

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36006%2fcategorical-data-for-sklearns-isolation-forrest%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You could:

Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

answered Jul 25 '18 at 15:07

Matthew

56410

$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10

$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13

add a comment |

You could:

Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

answered Jul 25 '18 at 15:07

Matthew

56410

$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10

$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13

add a comment |

You could:

Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

answered Jul 25 '18 at 15:07

Matthew

56410

You could:

Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

Try to use an autoencoder or neural method to learn an embedding of fixed dimension.

One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.

answered Jul 25 '18 at 15:07

Matthew

56410

answered Jul 25 '18 at 15:07

Matthew

56410

answered Jul 25 '18 at 15:07

Matthew

56410

answered Jul 25 '18 at 15:07

Matthew

56410

$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10

$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13

add a comment |

$begingroup$
Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:10

$begingroup$
However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
$endgroup$
– amateurjustin
Jul 26 '18 at 11:13

Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.

– amateurjustin
Jul 26 '18 at 11:10

However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.

– amateurjustin
Jul 26 '18 at 11:13

add a comment |

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

answered 2 days ago

Shivanya

164

New contributor

add a comment |

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

answered 2 days ago

Shivanya

164

New contributor

add a comment |

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

answered 2 days ago

Shivanya

164

New contributor

I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?

answered 2 days ago

Shivanya

164

New contributor

answered 2 days ago

Shivanya

164

New contributor

answered 2 days ago

Shivanya

164

answered 2 days ago

Shivanya

164

New contributor

Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk