Clustering algorithm for a distance matrix












1












$begingroup$


I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):



[ 0 5 12 7 ]
[ - 0 9 2 ]
[ - - 0 6 ]
[ - - - 0 ]


(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.



I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.



A few details:




  • Clusters don't have to be equal sizes

  • I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

  • Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.


On top of an algo/lib that does that, bonus points for:




  • Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

  • For such a lib, support for parallel processing

  • Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)


Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.










share|improve this question









New contributor




Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):



    [ 0 5 12 7 ]
    [ - 0 9 2 ]
    [ - - 0 6 ]
    [ - - - 0 ]


    (Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.



    I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.



    A few details:




    • Clusters don't have to be equal sizes

    • I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

    • Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.


    On top of an algo/lib that does that, bonus points for:




    • Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

    • For such a lib, support for parallel processing

    • Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)


    Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.










    share|improve this question









    New contributor




    Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):



      [ 0 5 12 7 ]
      [ - 0 9 2 ]
      [ - - 0 6 ]
      [ - - - 0 ]


      (Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.



      I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.



      A few details:




      • Clusters don't have to be equal sizes

      • I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

      • Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.


      On top of an algo/lib that does that, bonus points for:




      • Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

      • For such a lib, support for parallel processing

      • Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)


      Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.










      share|improve this question









      New contributor




      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):



      [ 0 5 12 7 ]
      [ - 0 9 2 ]
      [ - - 0 6 ]
      [ - - - 0 ]


      (Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.



      I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.



      A few details:




      • Clusters don't have to be equal sizes

      • I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

      • Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.


      On top of an algo/lib that does that, bonus points for:




      • Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

      • For such a lib, support for parallel processing

      • Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)


      Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.







      python clustering distance






      share|improve this question









      New contributor




      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 13 hours ago







      Francis Vachon













      New contributor




      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 13 hours ago









      Francis VachonFrancis Vachon

      83




      83




      New contributor




      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          There are hundreds of algorithms to choose from.




          • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

          • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

          • Spectral clustering

          • DBSCAN

          • OPTICS

          • HDBSCAN*

          • Affinity Propagation

          • ...






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.



            Alternatively, taking a step back, you could try computing distances between



            There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.






            share|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });






              Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.










              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46590%2fclustering-algorithm-for-a-distance-matrix%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$

              There are hundreds of algorithms to choose from.




              • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

              • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

              • Spectral clustering

              • DBSCAN

              • OPTICS

              • HDBSCAN*

              • Affinity Propagation

              • ...






              share|improve this answer









              $endgroup$


















                1












                $begingroup$

                There are hundreds of algorithms to choose from.




                • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

                • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

                • Spectral clustering

                • DBSCAN

                • OPTICS

                • HDBSCAN*

                • Affinity Propagation

                • ...






                share|improve this answer









                $endgroup$
















                  1












                  1








                  1





                  $begingroup$

                  There are hundreds of algorithms to choose from.




                  • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

                  • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

                  • Spectral clustering

                  • DBSCAN

                  • OPTICS

                  • HDBSCAN*

                  • Affinity Propagation

                  • ...






                  share|improve this answer









                  $endgroup$



                  There are hundreds of algorithms to choose from.




                  • Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

                  • PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

                  • Spectral clustering

                  • DBSCAN

                  • OPTICS

                  • HDBSCAN*

                  • Affinity Propagation

                  • ...







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 11 hours ago









                  Anony-MousseAnony-Mousse

                  4,836624




                  4,836624























                      0












                      $begingroup$

                      Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.



                      Alternatively, taking a step back, you could try computing distances between



                      There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.






                      share|improve this answer









                      $endgroup$


















                        0












                        $begingroup$

                        Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.



                        Alternatively, taking a step back, you could try computing distances between



                        There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.






                        share|improve this answer









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.



                          Alternatively, taking a step back, you could try computing distances between



                          There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.






                          share|improve this answer









                          $endgroup$



                          Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.



                          Alternatively, taking a step back, you could try computing distances between



                          There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 10 hours ago









                          n1k31t4n1k31t4

                          6,1862319




                          6,1862319






















                              Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.










                              draft saved

                              draft discarded


















                              Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.













                              Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.












                              Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
















                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46590%2fclustering-algorithm-for-a-distance-matrix%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to label and detect the document text images

                              Tabula Rosettana

                              Aureus (color)