Clustering algorithm for a distance matrix
$begingroup$
I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):
[ 0 5 12 7 ]
[ - 0 9 2 ]
[ - - 0 6 ]
[ - - - 0 ]
(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.
I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.
A few details:
- Clusters don't have to be equal sizes
- I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
- Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.
On top of an algo/lib that does that, bonus points for:
- Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
- For such a lib, support for parallel processing
- Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)
Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.
python clustering distance
New contributor
$endgroup$
add a comment |
$begingroup$
I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):
[ 0 5 12 7 ]
[ - 0 9 2 ]
[ - - 0 6 ]
[ - - - 0 ]
(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.
I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.
A few details:
- Clusters don't have to be equal sizes
- I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
- Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.
On top of an algo/lib that does that, bonus points for:
- Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
- For such a lib, support for parallel processing
- Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)
Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.
python clustering distance
New contributor
$endgroup$
add a comment |
$begingroup$
I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):
[ 0 5 12 7 ]
[ - 0 9 2 ]
[ - - 0 6 ]
[ - - - 0 ]
(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.
I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.
A few details:
- Clusters don't have to be equal sizes
- I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
- Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.
On top of an algo/lib that does that, bonus points for:
- Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
- For such a lib, support for parallel processing
- Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)
Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.
python clustering distance
New contributor
$endgroup$
I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that (actual matrix would be 10 000 x 10 000 if I have say 10 000 elements to cluster together):
[ 0 5 12 7 ]
[ - 0 9 2 ]
[ - - 0 6 ]
[ - - - 0 ]
(Supposing index starting at 0) So here object_0 would be most similar to object_1 (5), but object_1 itself would be more similar to object_3 (2), etc.
I would like to cluster that into k groups, so as to minimize either the overall score or intra-cluster score. I'm not sure if that would practically make that much of a difference. I haven't checked the maths on that, but I feel that there could be cases where minimizing a specific cluster's score wouldn't necessarily minimize the overall score (across all clusters). At the end of the day, even thought rigorously speaking there may be a difference I may not practically speaking care if the results are close enough either way.
A few details:
- Clusters don't have to be equal sizes
- I don't really mind if the number of clusters is an input or if it is being decided by the algo itself in some fashion (thought I would prefer inputing it)
- Typically I would have about 10 000 elements to cluster. Maybe 40 000 if I push it. But then I may need to repeatedly run it for a bunch of matrices that size.
On top of an algo/lib that does that, bonus points for:
- Some library that's already implemented it, so I can hopefully just format the data properly for it and just see if it seems to give good results or not before I sink a whole lot of time in it
- For such a lib, support for parallel processing
- Other nice bells & whistles (like maximum number of iterations, ability to input some stopping criteria, etc.)
Failing that, of course it could also just be some pseudo-code that allows me to cluster them into groups.
python clustering distance
python clustering distance
New contributor
New contributor
edited 13 hours ago
Francis Vachon
New contributor
asked 13 hours ago
Francis VachonFrancis Vachon
83
83
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
There are hundreds of algorithms to choose from.
- Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
- PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
- Spectral clustering
- DBSCAN
- OPTICS
- HDBSCAN*
- Affinity Propagation
- ...
$endgroup$
add a comment |
$begingroup$
Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.
Alternatively, taking a step back, you could try computing distances between
There is also another library called pyclustering
, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46590%2fclustering-algorithm-for-a-distance-matrix%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There are hundreds of algorithms to choose from.
- Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
- PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
- Spectral clustering
- DBSCAN
- OPTICS
- HDBSCAN*
- Affinity Propagation
- ...
$endgroup$
add a comment |
$begingroup$
There are hundreds of algorithms to choose from.
- Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
- PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
- Spectral clustering
- DBSCAN
- OPTICS
- HDBSCAN*
- Affinity Propagation
- ...
$endgroup$
add a comment |
$begingroup$
There are hundreds of algorithms to choose from.
- Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
- PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
- Spectral clustering
- DBSCAN
- OPTICS
- HDBSCAN*
- Affinity Propagation
- ...
$endgroup$
There are hundreds of algorithms to choose from.
- Hierarchical clustering in it's myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters
- PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center)
- Spectral clustering
- DBSCAN
- OPTICS
- HDBSCAN*
- Affinity Propagation
- ...
answered 11 hours ago
Anony-MousseAnony-Mousse
4,836624
4,836624
add a comment |
add a comment |
$begingroup$
Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.
Alternatively, taking a step back, you could try computing distances between
There is also another library called pyclustering
, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.
$endgroup$
add a comment |
$begingroup$
Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.
Alternatively, taking a step back, you could try computing distances between
There is also another library called pyclustering
, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.
$endgroup$
add a comment |
$begingroup$
Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.
Alternatively, taking a step back, you could try computing distances between
There is also another library called pyclustering
, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.
$endgroup$
Within the documentation for HDBSCAN (Hierarchical DBSCAN), there is a really nice comparison of clustering algorithms. It is a bit biased, highlighting its own strengths (of course), but will still give you the examples and some boilerplate code to get up and running quickly. DBSCAN and HDBSCAN are generally known not to be so good at handling high variance in the clusters. If that ends up being important for your use case, you might want to look at using OPTICS, which is better at handling that.
Alternatively, taking a step back, you could try computing distances between
There is also another library called pyclustering
, which contains many algorithms, as well as a set of examples. These algorithms are mainly implemented in C++ under the hood, so generally are much faster than the versions in more standard python libraries.
answered 10 hours ago
n1k31t4n1k31t4
6,1862319
6,1862319
add a comment |
add a comment |
Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
Francis Vachon is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46590%2fclustering-algorithm-for-a-distance-matrix%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown