Using TF-IDF with other features in SKLearn












5












$begingroup$


What is the best/correct way to combine text analysis with other features? For example, I have a dataset with some text but also other features/categories. SKlearn's TF-IDF vectoriser transforms text data into sparse matrices. I can use these sparse matrices directly with a Naive Bayes classifier for example. But what's the way to also take into account the other features? Should I de-sparsify the tf-idf representation of the text and combine the features and the text into one DataFrame? Or can I keep the sparse matrix as a separate column for example? What's the correct way to do this?










share|improve this question









$endgroup$

















    5












    $begingroup$


    What is the best/correct way to combine text analysis with other features? For example, I have a dataset with some text but also other features/categories. SKlearn's TF-IDF vectoriser transforms text data into sparse matrices. I can use these sparse matrices directly with a Naive Bayes classifier for example. But what's the way to also take into account the other features? Should I de-sparsify the tf-idf representation of the text and combine the features and the text into one DataFrame? Or can I keep the sparse matrix as a separate column for example? What's the correct way to do this?










    share|improve this question









    $endgroup$















      5












      5








      5


      3



      $begingroup$


      What is the best/correct way to combine text analysis with other features? For example, I have a dataset with some text but also other features/categories. SKlearn's TF-IDF vectoriser transforms text data into sparse matrices. I can use these sparse matrices directly with a Naive Bayes classifier for example. But what's the way to also take into account the other features? Should I de-sparsify the tf-idf representation of the text and combine the features and the text into one DataFrame? Or can I keep the sparse matrix as a separate column for example? What's the correct way to do this?










      share|improve this question









      $endgroup$




      What is the best/correct way to combine text analysis with other features? For example, I have a dataset with some text but also other features/categories. SKlearn's TF-IDF vectoriser transforms text data into sparse matrices. I can use these sparse matrices directly with a Naive Bayes classifier for example. But what's the way to also take into account the other features? Should I de-sparsify the tf-idf representation of the text and combine the features and the text into one DataFrame? Or can I keep the sparse matrix as a separate column for example? What's the correct way to do this?







      python scikit-learn pandas tfidf






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 4 '17 at 11:30









      lte__lte__

      3101720




      3101720






















          2 Answers
          2






          active

          oldest

          votes


















          3












          $begingroup$

          scikit-learn's FeatureUnion concatenates features from different vectorizers. An example of combining heterogeneous data, including text, can be found here.






          share|improve this answer











          $endgroup$













          • $begingroup$
            The link has expired !
            $endgroup$
            – Abhishek Raj
            9 hours ago










          • $begingroup$
            Thanks! Link has been updated.
            $endgroup$
            – Brian Spiering
            25 mins ago



















          1












          $begingroup$

          Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.



          To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22813%2fusing-tf-idf-with-other-features-in-sklearn%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3












            $begingroup$

            scikit-learn's FeatureUnion concatenates features from different vectorizers. An example of combining heterogeneous data, including text, can be found here.






            share|improve this answer











            $endgroup$













            • $begingroup$
              The link has expired !
              $endgroup$
              – Abhishek Raj
              9 hours ago










            • $begingroup$
              Thanks! Link has been updated.
              $endgroup$
              – Brian Spiering
              25 mins ago
















            3












            $begingroup$

            scikit-learn's FeatureUnion concatenates features from different vectorizers. An example of combining heterogeneous data, including text, can be found here.






            share|improve this answer











            $endgroup$













            • $begingroup$
              The link has expired !
              $endgroup$
              – Abhishek Raj
              9 hours ago










            • $begingroup$
              Thanks! Link has been updated.
              $endgroup$
              – Brian Spiering
              25 mins ago














            3












            3








            3





            $begingroup$

            scikit-learn's FeatureUnion concatenates features from different vectorizers. An example of combining heterogeneous data, including text, can be found here.






            share|improve this answer











            $endgroup$



            scikit-learn's FeatureUnion concatenates features from different vectorizers. An example of combining heterogeneous data, including text, can be found here.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 25 mins ago

























            answered Sep 4 '17 at 14:58









            Brian SpieringBrian Spiering

            3,5281028




            3,5281028












            • $begingroup$
              The link has expired !
              $endgroup$
              – Abhishek Raj
              9 hours ago










            • $begingroup$
              Thanks! Link has been updated.
              $endgroup$
              – Brian Spiering
              25 mins ago


















            • $begingroup$
              The link has expired !
              $endgroup$
              – Abhishek Raj
              9 hours ago










            • $begingroup$
              Thanks! Link has been updated.
              $endgroup$
              – Brian Spiering
              25 mins ago
















            $begingroup$
            The link has expired !
            $endgroup$
            – Abhishek Raj
            9 hours ago




            $begingroup$
            The link has expired !
            $endgroup$
            – Abhishek Raj
            9 hours ago












            $begingroup$
            Thanks! Link has been updated.
            $endgroup$
            – Brian Spiering
            25 mins ago




            $begingroup$
            Thanks! Link has been updated.
            $endgroup$
            – Brian Spiering
            25 mins ago











            1












            $begingroup$

            Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.



            To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack






            share|improve this answer









            $endgroup$


















              1












              $begingroup$

              Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.



              To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack






              share|improve this answer









              $endgroup$
















                1












                1








                1





                $begingroup$

                Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.



                To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack






                share|improve this answer









                $endgroup$



                Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.



                To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Sep 5 '17 at 21:46









                Valentin CalommeValentin Calomme

                1,210423




                1,210423






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22813%2fusing-tf-idf-with-other-features-in-sklearn%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)