TextInformationRetrieval content based












0












$begingroup$


I need to know how to avoid spam document(file with repeated keywords) weighting while ranking the top k documents.










share|improve this question









$endgroup$

















    0












    $begingroup$


    I need to know how to avoid spam document(file with repeated keywords) weighting while ranking the top k documents.










    share|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      I need to know how to avoid spam document(file with repeated keywords) weighting while ranking the top k documents.










      share|improve this question









      $endgroup$




      I need to know how to avoid spam document(file with repeated keywords) weighting while ranking the top k documents.







      information-retrieval






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Oct 25 '18 at 10:40









      DharaniDharani

      11




      11






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          This type of spam is called keyword stuffing and it is a widely used SEO technique.
          There might be different approaches to deal with this.

          One is to use pre-trained classifier that will provide a "spam" score to each document. One such example is Waterloo spam classifier, covered in paper:



          "Efficient and Effective Spam Filtering and Re-ranking for Large Web
          Datasets"



          , by: Gordon V. Cormack, Mark D. Smucker and Charles L. A. Clarke.

          In addition, several predictors of document (specifically, Web document) quality were proposed. One of the most effective ones is the entropy of the unigram language model of a document. Specifically, when there is a large amount of repeated keywords in a document, the entropy of the language model is unusually low, since these keywords have unusually high probabilities. Another signal of document quality is the percent of stopwords in it, since natural language has more stopwords than a spam document.

          More on Web documents quality measures can be found in: "Quality-Biased Ranking of Web Documents", by Michael Bendersky, W. Bruce Croft,Yanlei Diao.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40212%2ftextinformationretrieval-content-based%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            This type of spam is called keyword stuffing and it is a widely used SEO technique.
            There might be different approaches to deal with this.

            One is to use pre-trained classifier that will provide a "spam" score to each document. One such example is Waterloo spam classifier, covered in paper:



            "Efficient and Effective Spam Filtering and Re-ranking for Large Web
            Datasets"



            , by: Gordon V. Cormack, Mark D. Smucker and Charles L. A. Clarke.

            In addition, several predictors of document (specifically, Web document) quality were proposed. One of the most effective ones is the entropy of the unigram language model of a document. Specifically, when there is a large amount of repeated keywords in a document, the entropy of the language model is unusually low, since these keywords have unusually high probabilities. Another signal of document quality is the percent of stopwords in it, since natural language has more stopwords than a spam document.

            More on Web documents quality measures can be found in: "Quality-Biased Ranking of Web Documents", by Michael Bendersky, W. Bruce Croft,Yanlei Diao.






            share|improve this answer











            $endgroup$


















              1












              $begingroup$

              This type of spam is called keyword stuffing and it is a widely used SEO technique.
              There might be different approaches to deal with this.

              One is to use pre-trained classifier that will provide a "spam" score to each document. One such example is Waterloo spam classifier, covered in paper:



              "Efficient and Effective Spam Filtering and Re-ranking for Large Web
              Datasets"



              , by: Gordon V. Cormack, Mark D. Smucker and Charles L. A. Clarke.

              In addition, several predictors of document (specifically, Web document) quality were proposed. One of the most effective ones is the entropy of the unigram language model of a document. Specifically, when there is a large amount of repeated keywords in a document, the entropy of the language model is unusually low, since these keywords have unusually high probabilities. Another signal of document quality is the percent of stopwords in it, since natural language has more stopwords than a spam document.

              More on Web documents quality measures can be found in: "Quality-Biased Ranking of Web Documents", by Michael Bendersky, W. Bruce Croft,Yanlei Diao.






              share|improve this answer











              $endgroup$
















                1












                1








                1





                $begingroup$

                This type of spam is called keyword stuffing and it is a widely used SEO technique.
                There might be different approaches to deal with this.

                One is to use pre-trained classifier that will provide a "spam" score to each document. One such example is Waterloo spam classifier, covered in paper:



                "Efficient and Effective Spam Filtering and Re-ranking for Large Web
                Datasets"



                , by: Gordon V. Cormack, Mark D. Smucker and Charles L. A. Clarke.

                In addition, several predictors of document (specifically, Web document) quality were proposed. One of the most effective ones is the entropy of the unigram language model of a document. Specifically, when there is a large amount of repeated keywords in a document, the entropy of the language model is unusually low, since these keywords have unusually high probabilities. Another signal of document quality is the percent of stopwords in it, since natural language has more stopwords than a spam document.

                More on Web documents quality measures can be found in: "Quality-Biased Ranking of Web Documents", by Michael Bendersky, W. Bruce Croft,Yanlei Diao.






                share|improve this answer











                $endgroup$



                This type of spam is called keyword stuffing and it is a widely used SEO technique.
                There might be different approaches to deal with this.

                One is to use pre-trained classifier that will provide a "spam" score to each document. One such example is Waterloo spam classifier, covered in paper:



                "Efficient and Effective Spam Filtering and Re-ranking for Large Web
                Datasets"



                , by: Gordon V. Cormack, Mark D. Smucker and Charles L. A. Clarke.

                In addition, several predictors of document (specifically, Web document) quality were proposed. One of the most effective ones is the entropy of the unigram language model of a document. Specifically, when there is a large amount of repeated keywords in a document, the entropy of the language model is unusually low, since these keywords have unusually high probabilities. Another signal of document quality is the percent of stopwords in it, since natural language has more stopwords than a spam document.

                More on Web documents quality measures can be found in: "Quality-Biased Ranking of Web Documents", by Michael Bendersky, W. Bruce Croft,Yanlei Diao.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 38 mins ago









                Oleg

                606




                606










                answered Nov 11 '18 at 20:27









                Annie ShtokAnnie Shtok

                111




                111






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40212%2ftextinformationretrieval-content-based%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)