How to learn irrelevant words in an information retrieval system?












0












$begingroup$


Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/



Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?



What can you suggest for a better information retrieval using word embedings?










share|improve this question









$endgroup$

















    0












    $begingroup$


    Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
    http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/



    Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
    Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?



    What can you suggest for a better information retrieval using word embedings?










    share|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
      http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/



      Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
      Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?



      What can you suggest for a better information retrieval using word embedings?










      share|improve this question









      $endgroup$




      Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
      http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/



      Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
      Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?



      What can you suggest for a better information retrieval using word embedings?







      nlp recommender-system word-embeddings






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 12 hours ago









      TidoTido

      596




      596






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.




          max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
          strictly higher than the given threshold (corpus-specific stop words).
          If float, the parameter represents a proportion of documents, integer
          absolute counts. This parameter is ignored if vocabulary is not None.



          min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
          lower than the given threshold. This value is also called cut-off in
          the literature. If float, the parameter represents a proportion of
          documents, integer absolute counts. This parameter is ignored if
          vocabulary is not None.




          You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45183%2fhow-to-learn-irrelevant-words-in-an-information-retrieval-system%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.




            max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
            strictly higher than the given threshold (corpus-specific stop words).
            If float, the parameter represents a proportion of documents, integer
            absolute counts. This parameter is ignored if vocabulary is not None.



            min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
            lower than the given threshold. This value is also called cut-off in
            the literature. If float, the parameter represents a proportion of
            documents, integer absolute counts. This parameter is ignored if
            vocabulary is not None.




            You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.




              max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
              strictly higher than the given threshold (corpus-specific stop words).
              If float, the parameter represents a proportion of documents, integer
              absolute counts. This parameter is ignored if vocabulary is not None.



              min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
              lower than the given threshold. This value is also called cut-off in
              the literature. If float, the parameter represents a proportion of
              documents, integer absolute counts. This parameter is ignored if
              vocabulary is not None.




              You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.




                max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
                strictly higher than the given threshold (corpus-specific stop words).
                If float, the parameter represents a proportion of documents, integer
                absolute counts. This parameter is ignored if vocabulary is not None.



                min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
                lower than the given threshold. This value is also called cut-off in
                the literature. If float, the parameter represents a proportion of
                documents, integer absolute counts. This parameter is ignored if
                vocabulary is not None.




                You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.






                share|improve this answer









                $endgroup$



                If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.




                max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
                strictly higher than the given threshold (corpus-specific stop words).
                If float, the parameter represents a proportion of documents, integer
                absolute counts. This parameter is ignored if vocabulary is not None.



                min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
                lower than the given threshold. This value is also called cut-off in
                the literature. If float, the parameter represents a proportion of
                documents, integer absolute counts. This parameter is ignored if
                vocabulary is not None.




                You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 14 mins ago









                TasosTasos

                730425




                730425






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45183%2fhow-to-learn-irrelevant-words-in-an-information-retrieval-system%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana