Are there algorithms for clustering objects with pairwise distances, without computing all pairwise...












1












$begingroup$


I'm looking for a clustering algorithm that clusters objects, by using their pairwise distances, without needing to calculate all pairwise distances.



Normally pairwise clustering is done like this: (see here)




  1. Compute full distance matrix between all pairwise combination of objects

  2. Assuming that the distances there are non-euclidean, one might use Spectral Clustering or Affinity propagation on the distance matrix and retrieve the clustering results.


Here comes the however:



Computing the full distance matrix for all pairwise combination of objects is computationally very expensive. So my though was, whether there are some clustering algorithms that only do lookups on a subset of the pairwise distances, so it is not necessary to compute the full matrix?



I know Spectral Clustering works also on sparse matrices, but since it is theoretically possible to compute all pairwise distances, which ones should be left out?



Exited to hear your ideas, thanks!










share|improve this question







New contributor




Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I'm looking for a clustering algorithm that clusters objects, by using their pairwise distances, without needing to calculate all pairwise distances.



    Normally pairwise clustering is done like this: (see here)




    1. Compute full distance matrix between all pairwise combination of objects

    2. Assuming that the distances there are non-euclidean, one might use Spectral Clustering or Affinity propagation on the distance matrix and retrieve the clustering results.


    Here comes the however:



    Computing the full distance matrix for all pairwise combination of objects is computationally very expensive. So my though was, whether there are some clustering algorithms that only do lookups on a subset of the pairwise distances, so it is not necessary to compute the full matrix?



    I know Spectral Clustering works also on sparse matrices, but since it is theoretically possible to compute all pairwise distances, which ones should be left out?



    Exited to hear your ideas, thanks!










    share|improve this question







    New contributor




    Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I'm looking for a clustering algorithm that clusters objects, by using their pairwise distances, without needing to calculate all pairwise distances.



      Normally pairwise clustering is done like this: (see here)




      1. Compute full distance matrix between all pairwise combination of objects

      2. Assuming that the distances there are non-euclidean, one might use Spectral Clustering or Affinity propagation on the distance matrix and retrieve the clustering results.


      Here comes the however:



      Computing the full distance matrix for all pairwise combination of objects is computationally very expensive. So my though was, whether there are some clustering algorithms that only do lookups on a subset of the pairwise distances, so it is not necessary to compute the full matrix?



      I know Spectral Clustering works also on sparse matrices, but since it is theoretically possible to compute all pairwise distances, which ones should be left out?



      Exited to hear your ideas, thanks!










      share|improve this question







      New contributor




      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I'm looking for a clustering algorithm that clusters objects, by using their pairwise distances, without needing to calculate all pairwise distances.



      Normally pairwise clustering is done like this: (see here)




      1. Compute full distance matrix between all pairwise combination of objects

      2. Assuming that the distances there are non-euclidean, one might use Spectral Clustering or Affinity propagation on the distance matrix and retrieve the clustering results.


      Here comes the however:



      Computing the full distance matrix for all pairwise combination of objects is computationally very expensive. So my though was, whether there are some clustering algorithms that only do lookups on a subset of the pairwise distances, so it is not necessary to compute the full matrix?



      I know Spectral Clustering works also on sparse matrices, but since it is theoretically possible to compute all pairwise distances, which ones should be left out?



      Exited to hear your ideas, thanks!







      clustering similarity graphs distance






      share|improve this question







      New contributor




      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      TaizameTaizame

      61




      61




      New contributor




      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Taizame is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          3 Answers
          3






          active

          oldest

          votes


















          2












          $begingroup$

          Well, one may argue that DBSCAN is based on all pairwise distances, but it uuses data index to avoid computing all of them using geometric bounds.



          And there are other examples if you browse through literature.



          For example, the classic CLARA method is an approximation to PAM that avoids computing all pairwise distances.



          And there are many more such techniques.






          share|improve this answer









          $endgroup$





















            1












            $begingroup$

            Quadtree can be used for this purpose.



            enter image description here



            This algo divides 2 D space into clusters. In this example ; we can exclude point 'C' from comparison with 'E' and 'F'



            http://wiki.gis.com/wiki/index.php/Quadtree



            DBSCAN is also useful in some scenarios : https://en.wikipedia.org/wiki/DBSCAN






            share|improve this answer









            $endgroup$





















              1












              $begingroup$

              You can use Locality Sensitive Hashing technique Wiki article



              With this, you can estimate either the Jaccard Similarity (MinHash) or Cosine Similarity (SimHash) between two documents and then apply clustering on the documents collection.



              There is a great example with Python code for MinHash. What I get from the article is the bellow quote




              In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.




              MinHash explanation with Python code






              share|improve this answer









              $endgroup$













                Your Answer





                StackExchange.ifUsing("editor", function () {
                return StackExchange.using("mathjaxEditing", function () {
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                });
                });
                }, "mathjax-editing");

                StackExchange.ready(function() {
                var channelOptions = {
                tags: "".split(" "),
                id: "557"
                };
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function() {
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled) {
                StackExchange.using("snippets", function() {
                createEditor();
                });
                }
                else {
                createEditor();
                }
                });

                function createEditor() {
                StackExchange.prepareEditor({
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader: {
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                },
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                });


                }
                });






                Taizame is a new contributor. Be nice, and check out our Code of Conduct.










                draft saved

                draft discarded


















                StackExchange.ready(
                function () {
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46950%2fare-there-algorithms-for-clustering-objects-with-pairwise-distances-without-com%23new-answer', 'question_page');
                }
                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                2












                $begingroup$

                Well, one may argue that DBSCAN is based on all pairwise distances, but it uuses data index to avoid computing all of them using geometric bounds.



                And there are other examples if you browse through literature.



                For example, the classic CLARA method is an approximation to PAM that avoids computing all pairwise distances.



                And there are many more such techniques.






                share|improve this answer









                $endgroup$


















                  2












                  $begingroup$

                  Well, one may argue that DBSCAN is based on all pairwise distances, but it uuses data index to avoid computing all of them using geometric bounds.



                  And there are other examples if you browse through literature.



                  For example, the classic CLARA method is an approximation to PAM that avoids computing all pairwise distances.



                  And there are many more such techniques.






                  share|improve this answer









                  $endgroup$
















                    2












                    2








                    2





                    $begingroup$

                    Well, one may argue that DBSCAN is based on all pairwise distances, but it uuses data index to avoid computing all of them using geometric bounds.



                    And there are other examples if you browse through literature.



                    For example, the classic CLARA method is an approximation to PAM that avoids computing all pairwise distances.



                    And there are many more such techniques.






                    share|improve this answer









                    $endgroup$



                    Well, one may argue that DBSCAN is based on all pairwise distances, but it uuses data index to avoid computing all of them using geometric bounds.



                    And there are other examples if you browse through literature.



                    For example, the classic CLARA method is an approximation to PAM that avoids computing all pairwise distances.



                    And there are many more such techniques.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered yesterday









                    Anony-MousseAnony-Mousse

                    4,896624




                    4,896624























                        1












                        $begingroup$

                        Quadtree can be used for this purpose.



                        enter image description here



                        This algo divides 2 D space into clusters. In this example ; we can exclude point 'C' from comparison with 'E' and 'F'



                        http://wiki.gis.com/wiki/index.php/Quadtree



                        DBSCAN is also useful in some scenarios : https://en.wikipedia.org/wiki/DBSCAN






                        share|improve this answer









                        $endgroup$


















                          1












                          $begingroup$

                          Quadtree can be used for this purpose.



                          enter image description here



                          This algo divides 2 D space into clusters. In this example ; we can exclude point 'C' from comparison with 'E' and 'F'



                          http://wiki.gis.com/wiki/index.php/Quadtree



                          DBSCAN is also useful in some scenarios : https://en.wikipedia.org/wiki/DBSCAN






                          share|improve this answer









                          $endgroup$
















                            1












                            1








                            1





                            $begingroup$

                            Quadtree can be used for this purpose.



                            enter image description here



                            This algo divides 2 D space into clusters. In this example ; we can exclude point 'C' from comparison with 'E' and 'F'



                            http://wiki.gis.com/wiki/index.php/Quadtree



                            DBSCAN is also useful in some scenarios : https://en.wikipedia.org/wiki/DBSCAN






                            share|improve this answer









                            $endgroup$



                            Quadtree can be used for this purpose.



                            enter image description here



                            This algo divides 2 D space into clusters. In this example ; we can exclude point 'C' from comparison with 'E' and 'F'



                            http://wiki.gis.com/wiki/index.php/Quadtree



                            DBSCAN is also useful in some scenarios : https://en.wikipedia.org/wiki/DBSCAN







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 2 days ago









                            Shamit VermaShamit Verma

                            78426




                            78426























                                1












                                $begingroup$

                                You can use Locality Sensitive Hashing technique Wiki article



                                With this, you can estimate either the Jaccard Similarity (MinHash) or Cosine Similarity (SimHash) between two documents and then apply clustering on the documents collection.



                                There is a great example with Python code for MinHash. What I get from the article is the bellow quote




                                In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.




                                MinHash explanation with Python code






                                share|improve this answer









                                $endgroup$


















                                  1












                                  $begingroup$

                                  You can use Locality Sensitive Hashing technique Wiki article



                                  With this, you can estimate either the Jaccard Similarity (MinHash) or Cosine Similarity (SimHash) between two documents and then apply clustering on the documents collection.



                                  There is a great example with Python code for MinHash. What I get from the article is the bellow quote




                                  In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.




                                  MinHash explanation with Python code






                                  share|improve this answer









                                  $endgroup$
















                                    1












                                    1








                                    1





                                    $begingroup$

                                    You can use Locality Sensitive Hashing technique Wiki article



                                    With this, you can estimate either the Jaccard Similarity (MinHash) or Cosine Similarity (SimHash) between two documents and then apply clustering on the documents collection.



                                    There is a great example with Python code for MinHash. What I get from the article is the bellow quote




                                    In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.




                                    MinHash explanation with Python code






                                    share|improve this answer









                                    $endgroup$



                                    You can use Locality Sensitive Hashing technique Wiki article



                                    With this, you can estimate either the Jaccard Similarity (MinHash) or Cosine Similarity (SimHash) between two documents and then apply clustering on the documents collection.



                                    There is a great example with Python code for MinHash. What I get from the article is the bellow quote




                                    In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.




                                    MinHash explanation with Python code







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered 2 days ago









                                    TasosTasos

                                    995630




                                    995630






















                                        Taizame is a new contributor. Be nice, and check out our Code of Conduct.










                                        draft saved

                                        draft discarded


















                                        Taizame is a new contributor. Be nice, and check out our Code of Conduct.













                                        Taizame is a new contributor. Be nice, and check out our Code of Conduct.












                                        Taizame is a new contributor. Be nice, and check out our Code of Conduct.
















                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid



                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.


                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function () {
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46950%2fare-there-algorithms-for-clustering-objects-with-pairwise-distances-without-com%23new-answer', 'question_page');
                                        }
                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        How to label and detect the document text images

                                        Tabula Rosettana

                                        Aureus (color)