Categorical data for sklearns Isolation Forrest












0












$begingroup$


I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)



I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.



Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.



I have thus a two part question:




  1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?


  2. What other feature transformations can I consider for distance based models?











share|improve this question









$endgroup$

















    0












    $begingroup$


    I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
    Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)



    I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.



    Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.



    I have thus a two part question:




    1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?


    2. What other feature transformations can I consider for distance based models?











    share|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
      Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)



      I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.



      Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.



      I have thus a two part question:




      1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?


      2. What other feature transformations can I consider for distance based models?











      share|improve this question









      $endgroup$




      I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn.
      Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)



      I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.



      Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.



      I have thus a two part question:




      1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?


      2. What other feature transformations can I consider for distance based models?








      feature-engineering categorical-data ensemble-modeling






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jul 25 '18 at 14:29









      amateurjustinamateurjustin

      32




      32






















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.



          You could:




          1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

          2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

          3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.


          One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:10










          • $begingroup$
            However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:13



















          1












          $begingroup$

          I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?






          share|improve this answer








          New contributor




          Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36006%2fcategorical-data-for-sklearns-isolation-forrest%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.



            You could:




            1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

            2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

            3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.


            One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:10










            • $begingroup$
              However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:13
















            0












            $begingroup$

            I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.



            You could:




            1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

            2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

            3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.


            One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:10










            • $begingroup$
              However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:13














            0












            0








            0





            $begingroup$

            I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.



            You could:




            1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

            2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

            3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.


            One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.






            share|improve this answer









            $endgroup$



            I would really try not to use ordinal numbers for categorical data. It imposes a false magnitude and ordering in the model, especially when you have 1,000 examples. For example, the difference between Brush Script and Calibri could be very small and the difference between Calibri and Times New Roman UNBELIEVABLY HUGE (assuming lexicographical assignment), when really they're all just different fonts.



            You could:




            1. Try to figure out groupings of similar features that make sense, then one-hot those groupings so you wouldn't end up with too many columns.

            2. One-hot the whole thing and then try some dimensionality reduction techniques to get the feature space back down to something sensible.

            3. Try to use an autoencoder or neural method to learn an embedding of fixed dimension.


            One thing you should definitely be careful of is how you combine the result of this process with whatever the other half of your features are.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jul 25 '18 at 15:07









            MatthewMatthew

            56410




            56410












            • $begingroup$
              Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:10










            • $begingroup$
              However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:13


















            • $begingroup$
              Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:10










            • $begingroup$
              However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
              $endgroup$
              – amateurjustin
              Jul 26 '18 at 11:13
















            $begingroup$
            Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:10




            $begingroup$
            Hi @Matthew, Thanks for the answer. I have never considered doing something like PCA over one-hot vectors. Will that work? Also, seeing as I'll firstly be using tree based methods, does it really matter with ordinal data not being that representative? I mean, it's not a distance based model? But that's what I think. I could be completely wrong here and I haven't gotten around to finding that answer myself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:10












            $begingroup$
            However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:13




            $begingroup$
            However, I do think that embeddings should be the answer. Have you any knowledge of some well defined examples? Maybe even a library at this stage. I need to have a POC real soon and don't want to painstakingly be writing code when I could quickly implement a library that does embedding itself.
            $endgroup$
            – amateurjustin
            Jul 26 '18 at 11:13











            1












            $begingroup$

            I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?






            share|improve this answer








            New contributor




            Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              1












              $begingroup$

              I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?






              share|improve this answer








              New contributor




              Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                1












                1








                1





                $begingroup$

                I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?






                share|improve this answer








                New contributor




                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                I coded isolation forest with dataset containing both categorical and numeric features, and it is working properly. How is it possible.?







                share|improve this answer








                New contributor




                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 2 days ago









                ShivanyaShivanya

                164




                164




                New contributor




                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                Shivanya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36006%2fcategorical-data-for-sklearns-isolation-forrest%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)