Best practical algorithm for sentence similarity












10












$begingroup$


I have two sentences, S1 and S2, both which have a word count (usually) below 15.



What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture is as complicated as Google Inception etc.).



I am looking for an algorithm that will work fine without putting too much time into it. Are there any algorithms you've found successful and easy to use?



This can, but does not have to fall into the category of clustering. My background is from machine learning, so any suggestions are welcome :)










share|improve this question









$endgroup$












  • $begingroup$
    What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
    $endgroup$
    – Dileepa
    Aug 15 '18 at 3:07
















10












$begingroup$


I have two sentences, S1 and S2, both which have a word count (usually) below 15.



What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture is as complicated as Google Inception etc.).



I am looking for an algorithm that will work fine without putting too much time into it. Are there any algorithms you've found successful and easy to use?



This can, but does not have to fall into the category of clustering. My background is from machine learning, so any suggestions are welcome :)










share|improve this question









$endgroup$












  • $begingroup$
    What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
    $endgroup$
    – Dileepa
    Aug 15 '18 at 3:07














10












10








10


4



$begingroup$


I have two sentences, S1 and S2, both which have a word count (usually) below 15.



What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture is as complicated as Google Inception etc.).



I am looking for an algorithm that will work fine without putting too much time into it. Are there any algorithms you've found successful and easy to use?



This can, but does not have to fall into the category of clustering. My background is from machine learning, so any suggestions are welcome :)










share|improve this question









$endgroup$




I have two sentences, S1 and S2, both which have a word count (usually) below 15.



What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture is as complicated as Google Inception etc.).



I am looking for an algorithm that will work fine without putting too much time into it. Are there any algorithms you've found successful and easy to use?



This can, but does not have to fall into the category of clustering. My background is from machine learning, so any suggestions are welcome :)







nlp clustering word2vec similarity






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 23 '17 at 14:40









DaveTheAlDaveTheAl

171117




171117












  • $begingroup$
    What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
    $endgroup$
    – Dileepa
    Aug 15 '18 at 3:07


















  • $begingroup$
    What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
    $endgroup$
    – Dileepa
    Aug 15 '18 at 3:07
















$begingroup$
What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
$endgroup$
– Dileepa
Aug 15 '18 at 3:07




$begingroup$
What did you implement ? I am also facing same, have to come up with solution for 'k' related articles in a corpus that keeps updating.
$endgroup$
– Dileepa
Aug 15 '18 at 3:07










3 Answers
3






active

oldest

votes


















10












$begingroup$

Cosine Similarity for Vector Space could be you answer: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/



Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?



"This is a tree",
"This is not a tree"



If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)



Siraj Raval has a good python notebook for creating wordvector datasets:
https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE






share|improve this answer









$endgroup$





















    8












    $begingroup$

    One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings
    using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.






    share|improve this answer









    $endgroup$





















      0












      $begingroup$

      bert-as-service (https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes) offers just that solution.



      To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.






      share|improve this answer








      New contributor




      Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25053%2fbest-practical-algorithm-for-sentence-similarity%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        10












        $begingroup$

        Cosine Similarity for Vector Space could be you answer: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/



        Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?



        "This is a tree",
        "This is not a tree"



        If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)



        Siraj Raval has a good python notebook for creating wordvector datasets:
        https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE






        share|improve this answer









        $endgroup$


















          10












          $begingroup$

          Cosine Similarity for Vector Space could be you answer: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/



          Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?



          "This is a tree",
          "This is not a tree"



          If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)



          Siraj Raval has a good python notebook for creating wordvector datasets:
          https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE






          share|improve this answer









          $endgroup$
















            10












            10








            10





            $begingroup$

            Cosine Similarity for Vector Space could be you answer: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/



            Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?



            "This is a tree",
            "This is not a tree"



            If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)



            Siraj Raval has a good python notebook for creating wordvector datasets:
            https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE






            share|improve this answer









            $endgroup$



            Cosine Similarity for Vector Space could be you answer: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/



            Or you could calculate the eigenvector of each sentences. But the Problem is, what is similarity?



            "This is a tree",
            "This is not a tree"



            If you want to check the semantic meaning of the sentence you will need a wordvector dataset. With the wordvector dataset you will able to check the relationship between words. Example: (King - Man + woman = Queen)



            Siraj Raval has a good python notebook for creating wordvector datasets:
            https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 23 '17 at 15:09









            Christian FreiChristian Frei

            24615




            24615























                8












                $begingroup$

                One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings
                using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.






                share|improve this answer









                $endgroup$


















                  8












                  $begingroup$

                  One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings
                  using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.






                  share|improve this answer









                  $endgroup$
















                    8












                    8








                    8





                    $begingroup$

                    One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings
                    using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.






                    share|improve this answer









                    $endgroup$



                    One approach you could try is averaging word vectors generated by word embedding algorithms (word2vec, glove, etc). These algorithms create a vector for each word and the cosine similarity among them represents semantic similarity among the words. In the case of the average vectors among the sentences. A good starting point for knowing more about these methods is this paper: How Well Sentence Embeddings Capture Meaning. It discusses some sentence embedding methods. I also suggest you look into Unsupervised Learning of Sentence Embeddings
                    using Compositional n-Gram Features the authors claim their approach beat state of the art methods. Also they provide the code and some usage instructions in this github repo.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 23 '17 at 15:15









                    feynman410feynman410

                    1,738416




                    1,738416























                        0












                        $begingroup$

                        bert-as-service (https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes) offers just that solution.



                        To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.






                        share|improve this answer








                        New contributor




                        Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.






                        $endgroup$


















                          0












                          $begingroup$

                          bert-as-service (https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes) offers just that solution.



                          To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.






                          share|improve this answer








                          New contributor




                          Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          $endgroup$
















                            0












                            0








                            0





                            $begingroup$

                            bert-as-service (https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes) offers just that solution.



                            To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.






                            share|improve this answer








                            New contributor




                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            $endgroup$



                            bert-as-service (https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes) offers just that solution.



                            To answer your question, implementing it yourself from zero would be quite hard as BERT is not a trivial NN, but with this solution you can just plug it in into your algo that uses sentence similarity.







                            share|improve this answer








                            New contributor




                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer






                            New contributor




                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered 19 mins ago









                            Andres SuarezAndres Suarez

                            1




                            1




                            New contributor




                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            Andres Suarez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25053%2fbest-practical-algorithm-for-sentence-similarity%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                How to label and detect the document text images

                                Vallis Paradisi

                                Tabula Rosettana