Should I rescale tfidf features?












2












$begingroup$


I have a dataset which contains both text and numeric features.



I have encoded the text ones using the TfidfVectorizer from sklearn.



I would now like to apply logistic regression to the resulting dataframe.



My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.



I'm unsure about whether to:




  • scale the whole dataframe with StandardScaler prior to passing to a classifier;


  • only scale the numeric features, and leave the ones resulting from tfidf as they are.











share|improve this question









$endgroup$




bumped to the homepage by Community 2 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    2












    $begingroup$


    I have a dataset which contains both text and numeric features.



    I have encoded the text ones using the TfidfVectorizer from sklearn.



    I would now like to apply logistic regression to the resulting dataframe.



    My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.



    I'm unsure about whether to:




    • scale the whole dataframe with StandardScaler prior to passing to a classifier;


    • only scale the numeric features, and leave the ones resulting from tfidf as they are.











    share|improve this question









    $endgroup$




    bumped to the homepage by Community 2 hours ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      2












      2








      2





      $begingroup$


      I have a dataset which contains both text and numeric features.



      I have encoded the text ones using the TfidfVectorizer from sklearn.



      I would now like to apply logistic regression to the resulting dataframe.



      My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.



      I'm unsure about whether to:




      • scale the whole dataframe with StandardScaler prior to passing to a classifier;


      • only scale the numeric features, and leave the ones resulting from tfidf as they are.











      share|improve this question









      $endgroup$




      I have a dataset which contains both text and numeric features.



      I have encoded the text ones using the TfidfVectorizer from sklearn.



      I would now like to apply logistic regression to the resulting dataframe.



      My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.



      I'm unsure about whether to:




      • scale the whole dataframe with StandardScaler prior to passing to a classifier;


      • only scale the numeric features, and leave the ones resulting from tfidf as they are.








      nlp feature-engineering feature-scaling tfidf






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jun 27 '18 at 16:30









      ignoring_gravityignoring_gravity

      163




      163





      bumped to the homepage by Community 2 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 2 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.



          According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:




          (it's) (...) usually is a two-fold normalization.



          First, each document is normalized to length 1, so there is no bias
          for longer or shorter documents. This equals taking the relative
          frequencies instead of the absolute term counts. This is "TF".



          Second, IDF then is a cross-document normalization, that puts less
          weight on common terms, and more weight on rare terms, by normalizing
          (weighting) each word with the inverse in-corpus frequency.




          Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.



          But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.



          However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.



          https://stats.stackexchange.com/a/120600/90513






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33730%2fshould-i-rescale-tfidf-features%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.



            According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:




            (it's) (...) usually is a two-fold normalization.



            First, each document is normalized to length 1, so there is no bias
            for longer or shorter documents. This equals taking the relative
            frequencies instead of the absolute term counts. This is "TF".



            Second, IDF then is a cross-document normalization, that puts less
            weight on common terms, and more weight on rare terms, by normalizing
            (weighting) each word with the inverse in-corpus frequency.




            Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.



            But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.



            However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.



            https://stats.stackexchange.com/a/120600/90513






            share|improve this answer











            $endgroup$


















              0












              $begingroup$

              The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.



              According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:




              (it's) (...) usually is a two-fold normalization.



              First, each document is normalized to length 1, so there is no bias
              for longer or shorter documents. This equals taking the relative
              frequencies instead of the absolute term counts. This is "TF".



              Second, IDF then is a cross-document normalization, that puts less
              weight on common terms, and more weight on rare terms, by normalizing
              (weighting) each word with the inverse in-corpus frequency.




              Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.



              But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.



              However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.



              https://stats.stackexchange.com/a/120600/90513






              share|improve this answer











              $endgroup$
















                0












                0








                0





                $begingroup$

                The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.



                According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:




                (it's) (...) usually is a two-fold normalization.



                First, each document is normalized to length 1, so there is no bias
                for longer or shorter documents. This equals taking the relative
                frequencies instead of the absolute term counts. This is "TF".



                Second, IDF then is a cross-document normalization, that puts less
                weight on common terms, and more weight on rare terms, by normalizing
                (weighting) each word with the inverse in-corpus frequency.




                Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.



                But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.



                However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.



                https://stats.stackexchange.com/a/120600/90513






                share|improve this answer











                $endgroup$



                The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.



                According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:




                (it's) (...) usually is a two-fold normalization.



                First, each document is normalized to length 1, so there is no bias
                for longer or shorter documents. This equals taking the relative
                frequencies instead of the absolute term counts. This is "TF".



                Second, IDF then is a cross-document normalization, that puts less
                weight on common terms, and more weight on rare terms, by normalizing
                (weighting) each word with the inverse in-corpus frequency.




                Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.



                But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.



                However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.



                https://stats.stackexchange.com/a/120600/90513







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 23 '18 at 0:21

























                answered Dec 23 '18 at 0:13









                wacaxwacax

                1,91021038




                1,91021038






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33730%2fshould-i-rescale-tfidf-features%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)