Clustering algorithm for a distance matrix

I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):

[ 0 5 12 7 ]

[ - 0 9  2 ]

[ - - 0  6 ]

[ - - -  0 ]

(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.

I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.

A few details:

Clusters don't have to be equal sizes

I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.

On top of an algo/lib that does that, bonus points for:

Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

For such a lib, support for parallel processing

Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)

Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

add a comment |

[ 0 5 12 7 ]

[ - 0 9  2 ]

[ - - 0  6 ]

[ - - -  0 ]

(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.

A few details:

Clusters don't have to be equal sizes

I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.

On top of an algo/lib that does that, bonus points for:

Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

For such a lib, support for parallel processing

Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)

Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

add a comment |

[ 0 5 12 7 ]

[ - 0 9  2 ]

[ - - 0  6 ]

[ - - -  0 ]

(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.

A few details:

Clusters don't have to be equal sizes

I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.

On top of an algo/lib that does that, bonus points for:

Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

For such a lib, support for parallel processing

Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)

Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

[ 0 5 12 7 ]

[ - 0 9  2 ]

[ - - 0  6 ]

[ - - -  0 ]

(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.

A few details:

Clusters don't have to be equal sizes

I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)

Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.

On top of an algo/lib that does that, bonus points for:

Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it

For such a lib, support for parallel processing

Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)

Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.

python clustering distance

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

edited 13 hours ago

asked 13 hours ago

Francis Vachon

New contributor

asked 13 hours ago

Francis Vachon

asked 13 hours ago

Francis Vachon

New contributor

Francis Vachon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

2 Answers
2

active

oldest

votes

There are hundreds of algorithms to choose from.

Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

Spectral clustering

DBSCAN

OPTICS

HDBSCAN*

Affinity Propagation

answered 11 hours ago

Anony-Mousse

4,836624

add a comment |

Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.

Alternatively, taking a step back, you could try computing distances between

There is also another library called pyclustering, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.

answered 10 hours ago

n1k31t4

6,1862319

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46590%2fclustering-algorithm-for-a-distance-matrix%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

There are hundreds of algorithms to choose from.

Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

Spectral clustering

DBSCAN

OPTICS

HDBSCAN*

Affinity Propagation

answered 11 hours ago

Anony-Mousse

4,836624

add a comment |

There are hundreds of algorithms to choose from.

Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

Spectral clustering

DBSCAN

OPTICS

HDBSCAN*

Affinity Propagation

answered 11 hours ago

Anony-Mousse

4,836624

add a comment |

There are hundreds of algorithms to choose from.

Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

Spectral clustering

DBSCAN

OPTICS

HDBSCAN*

Affinity Propagation

answered 11 hours ago

Anony-Mousse

4,836624

There are hundreds of algorithms to choose from.

Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters

PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)

Spectral clustering

DBSCAN

OPTICS

HDBSCAN*

Affinity Propagation

answered 11 hours ago

Anony-Mousse

4,836624

answered 11 hours ago

Anony-Mousse

4,836624

answered 11 hours ago

Anony-Mousse

4,836624

answered 11 hours ago

Anony-Mousse

4,836624

add a comment |

Alternatively, taking a step back, you could try computing distances between

answered 10 hours ago

n1k31t4

6,1862319

add a comment |

Alternatively, taking a step back, you could try computing distances between

answered 10 hours ago

n1k31t4

6,1862319

add a comment |

Alternatively, taking a step back, you could try computing distances between

answered 10 hours ago

n1k31t4

6,1862319

Alternatively, taking a step back, you could try computing distances between

answered 10 hours ago

n1k31t4

6,1862319

answered 10 hours ago

n1k31t4

6,1862319

answered 10 hours ago

n1k31t4

6,1862319

answered 10 hours ago

n1k31t4

6,1862319

add a comment |

Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk