How do we interpret the outputs of DBSCAN clustering?

I am starting to learn DBSCAN for clustering but the interpretation part of it seems to be tricky to understand.

dataset = np.vstack((quotient_times, quotient)).T



scaler = StandardScaler()

dataset = scaler.fit_transform(dataset)



db_scan = DBSCAN(eps=0.6, min_samples=1)

db_scan.fit(dataset)



colors = [int(i % 23) for i in db_scan.labels_]



plt.figure();

plt.title(fname)

plt.xlabel("time")

plt.ylabel("quotient")

plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)

plt.legend()

plt.show()

This produces the following plot.

enter image description here

When i print the values of db_scan.labels_, I am getting

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

I was expective my y-axis values from 0-1 and x-axis values from 1-7. Why do I have the negative scales both on the x-axis and y-axis? and how do we interpret these values and the plot?

asked yesterday

Brown

244

add a comment |

I am starting to learn DBSCAN for clustering but the interpretation part of it seems to be tricky to understand.

dataset = np.vstack((quotient_times, quotient)).T



scaler = StandardScaler()

dataset = scaler.fit_transform(dataset)



db_scan = DBSCAN(eps=0.6, min_samples=1)

db_scan.fit(dataset)



colors = [int(i % 23) for i in db_scan.labels_]



plt.figure();

plt.title(fname)

plt.xlabel("time")

plt.ylabel("quotient")

plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)

plt.legend()

plt.show()

This produces the following plot.

enter image description here

When i print the values of db_scan.labels_, I am getting

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

I was expective my y-axis values from 0-1 and x-axis values from 1-7. Why do I have the negative scales both on the x-axis and y-axis? and how do we interpret these values and the plot?

asked yesterday

Brown

244

add a comment |

I am starting to learn DBSCAN for clustering but the interpretation part of it seems to be tricky to understand.

dataset = np.vstack((quotient_times, quotient)).T



scaler = StandardScaler()

dataset = scaler.fit_transform(dataset)



db_scan = DBSCAN(eps=0.6, min_samples=1)

db_scan.fit(dataset)



colors = [int(i % 23) for i in db_scan.labels_]



plt.figure();

plt.title(fname)

plt.xlabel("time")

plt.ylabel("quotient")

plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)

plt.legend()

plt.show()

This produces the following plot.

enter image description here

When i print the values of db_scan.labels_, I am getting

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

I was expective my y-axis values from 0-1 and x-axis values from 1-7. Why do I have the negative scales both on the x-axis and y-axis? and how do we interpret these values and the plot?

asked yesterday

Brown

244

I am starting to learn DBSCAN for clustering but the interpretation part of it seems to be tricky to understand.

dataset = np.vstack((quotient_times, quotient)).T



scaler = StandardScaler()

dataset = scaler.fit_transform(dataset)



db_scan = DBSCAN(eps=0.6, min_samples=1)

db_scan.fit(dataset)



colors = [int(i % 23) for i in db_scan.labels_]



plt.figure();

plt.title(fname)

plt.xlabel("time")

plt.ylabel("quotient")

plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)

plt.legend()

plt.show()

This produces the following plot.

enter image description here

When i print the values of db_scan.labels_, I am getting

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

I was expective my y-axis values from 0-1 and x-axis values from 1-7. Why do I have the negative scales both on the x-axis and y-axis? and how do we interpret these values and the plot?

machine-learning python scikit-learn clustering dbscan

asked yesterday

Brown

244

asked yesterday

Brown

244

asked yesterday

Brown

244

asked yesterday

Brown

244

asked yesterday

Brown

244

add a comment |

2 Answers
2

active

oldest

votes

The output from db_scan.labels_ is the assigned cluster value for each of the points that you provide as input to the algorithm.

You provided 20 points, so there are 20 labels. As explained in the relevant documentation, you will see:

labels_ : array, shape = [n_samples]
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

So each of your points was assigned to one of two clusters, and there were no points marked as noise (receiving a value of -1).

Interpreting the clusters

Your plot shows the original data, plotted on the axes using the original values; only the colour encodes the cluster to which each point was assigned.

As with many clustering algorithms, DBSCAN measures the distance between points in some n-dimensional space. It uses parameters you provide to decide if points belong to the same group, based on their distance from one another.

Why do I have the negative scales both on the x-axis and y-axis?

The distance metrics used do not really care about the sign of the points (where they lie on an axis), simply the distances. In the case of DBSCAN, it is the Euclidean distance. There are other distances that you can pass to the metric parameter of the DBSCAN algorithm; here is the valid list.

answered yesterday

n1k31t4

6,0312318

$begingroup$
Check the DBSCAN documentation...
$endgroup$
– Anony-Mousse
10 hours ago

add a comment |

Don't blindly copy code from the internet.

You copied code that scales data. Hence, the axis changed. But that wasn't DBSCAN, but that was you invoking StandardScaler.

It is important that you understand what you are doing... You have a time axis and a value axis. These are completely different, and computing Euclidean distance does not make any sense...

What you probably wanted to do with such data is GDBSCAN instead. For Generalized DBSCAN you would define two epsilon thresholds:

One threshold for time: how many time steps apart is still the same cluster?

One threshold for values: what difference in values is still the same cluster?

min_samples should always be at least 3, or you are not doing DBSCAN, but Single-Link clustering.

Then run GDBSCAN on that.

Since sklearn does not have GDBSCAN, you can emulate this here by scaling and maximum norm, but that is a bit hackish.

answered 10 hours ago

Anony-Mousse

4,801624

$begingroup$
you are right but there is no module gdbscan with the python library
$endgroup$
– Brown
10 hours ago

$begingroup$
I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...
$endgroup$
– Anony-Mousse
5 hours ago

$begingroup$
OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.
$endgroup$
– Brown
4 hours ago

$begingroup$
I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!
$endgroup$
– Brown
4 hours ago

$begingroup$
Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.
$endgroup$
– Anony-Mousse
58 mins ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46092%2fhow-do-we-interpret-the-outputs-of-dbscan-clustering%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

The output from db_scan.labels_ is the assigned cluster value for each of the points that you provide as input to the algorithm.

You provided 20 points, so there are 20 labels. As explained in the relevant documentation, you will see:

labels_ : array, shape = [n_samples]
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

So each of your points was assigned to one of two clusters, and there were no points marked as noise (receiving a value of -1).

Interpreting the clusters

Your plot shows the original data, plotted on the axes using the original values; only the colour encodes the cluster to which each point was assigned.

Why do I have the negative scales both on the x-axis and y-axis?

answered yesterday

n1k31t4

6,0312318

$begingroup$
Check the DBSCAN documentation...
$endgroup$
– Anony-Mousse
10 hours ago

add a comment |

The output from db_scan.labels_ is the assigned cluster value for each of the points that you provide as input to the algorithm.

You provided 20 points, so there are 20 labels. As explained in the relevant documentation, you will see:

labels_ : array, shape = [n_samples]
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

So each of your points was assigned to one of two clusters, and there were no points marked as noise (receiving a value of -1).

Interpreting the clusters

Your plot shows the original data, plotted on the axes using the original values; only the colour encodes the cluster to which each point was assigned.

Why do I have the negative scales both on the x-axis and y-axis?

answered yesterday

n1k31t4

6,0312318

$begingroup$
Check the DBSCAN documentation...
$endgroup$
– Anony-Mousse
10 hours ago

add a comment |

The output from db_scan.labels_ is the assigned cluster value for each of the points that you provide as input to the algorithm.

You provided 20 points, so there are 20 labels. As explained in the relevant documentation, you will see:

labels_ : array, shape = [n_samples]
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

So each of your points was assigned to one of two clusters, and there were no points marked as noise (receiving a value of -1).

Interpreting the clusters

Your plot shows the original data, plotted on the axes using the original values; only the colour encodes the cluster to which each point was assigned.

Why do I have the negative scales both on the x-axis and y-axis?

answered yesterday

n1k31t4

6,0312318

The output from db_scan.labels_ is the assigned cluster value for each of the points that you provide as input to the algorithm.

You provided 20 points, so there are 20 labels. As explained in the relevant documentation, you will see:

labels_ : array, shape = [n_samples]
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

So each of your points was assigned to one of two clusters, and there were no points marked as noise (receiving a value of -1).

Interpreting the clusters

Your plot shows the original data, plotted on the axes using the original values; only the colour encodes the cluster to which each point was assigned.

Why do I have the negative scales both on the x-axis and y-axis?

answered yesterday

n1k31t4

6,0312318

answered yesterday

n1k31t4

6,0312318

answered yesterday

n1k31t4

6,0312318

answered yesterday

n1k31t4

6,0312318

$begingroup$
Check the DBSCAN documentation...
$endgroup$
– Anony-Mousse
10 hours ago

add a comment |

$begingroup$
Check the DBSCAN documentation...
$endgroup$
– Anony-Mousse
10 hours ago

Check the DBSCAN documentation...

– Anony-Mousse
10 hours ago

add a comment |

Don't blindly copy code from the internet.

You copied code that scales data. Hence, the axis changed. But that wasn't DBSCAN, but that was you invoking StandardScaler.

It is important that you understand what you are doing... You have a time axis and a value axis. These are completely different, and computing Euclidean distance does not make any sense...

What you probably wanted to do with such data is GDBSCAN instead. For Generalized DBSCAN you would define two epsilon thresholds:

One threshold for time: how many time steps apart is still the same cluster?

One threshold for values: what difference in values is still the same cluster?

min_samples should always be at least 3, or you are not doing DBSCAN, but Single-Link clustering.

Then run GDBSCAN on that.

Since sklearn does not have GDBSCAN, you can emulate this here by scaling and maximum norm, but that is a bit hackish.

answered 10 hours ago

Anony-Mousse

4,801624

$begingroup$
you are right but there is no module gdbscan with the python library
$endgroup$
– Brown
10 hours ago

$begingroup$
I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...
$endgroup$
– Anony-Mousse
5 hours ago

$begingroup$
OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.
$endgroup$
– Brown
4 hours ago

$begingroup$
I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!
$endgroup$
– Brown
4 hours ago

$begingroup$
Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.
$endgroup$
– Anony-Mousse
58 mins ago

add a comment |

Don't blindly copy code from the internet.

You copied code that scales data. Hence, the axis changed. But that wasn't DBSCAN, but that was you invoking StandardScaler.

It is important that you understand what you are doing... You have a time axis and a value axis. These are completely different, and computing Euclidean distance does not make any sense...

What you probably wanted to do with such data is GDBSCAN instead. For Generalized DBSCAN you would define two epsilon thresholds:

One threshold for time: how many time steps apart is still the same cluster?

One threshold for values: what difference in values is still the same cluster?

min_samples should always be at least 3, or you are not doing DBSCAN, but Single-Link clustering.

Then run GDBSCAN on that.

Since sklearn does not have GDBSCAN, you can emulate this here by scaling and maximum norm, but that is a bit hackish.

answered 10 hours ago

Anony-Mousse

4,801624

$begingroup$
you are right but there is no module gdbscan with the python library
$endgroup$
– Brown
10 hours ago

$begingroup$
I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...
$endgroup$
– Anony-Mousse
5 hours ago

$begingroup$
OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.
$endgroup$
– Brown
4 hours ago

$begingroup$
I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!
$endgroup$
– Brown
4 hours ago

$begingroup$
Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.
$endgroup$
– Anony-Mousse
58 mins ago

add a comment |

Don't blindly copy code from the internet.

You copied code that scales data. Hence, the axis changed. But that wasn't DBSCAN, but that was you invoking StandardScaler.

It is important that you understand what you are doing... You have a time axis and a value axis. These are completely different, and computing Euclidean distance does not make any sense...

What you probably wanted to do with such data is GDBSCAN instead. For Generalized DBSCAN you would define two epsilon thresholds:

One threshold for time: how many time steps apart is still the same cluster?

One threshold for values: what difference in values is still the same cluster?

min_samples should always be at least 3, or you are not doing DBSCAN, but Single-Link clustering.

Then run GDBSCAN on that.

Since sklearn does not have GDBSCAN, you can emulate this here by scaling and maximum norm, but that is a bit hackish.

answered 10 hours ago

Anony-Mousse

4,801624

Don't blindly copy code from the internet.

You copied code that scales data. Hence, the axis changed. But that wasn't DBSCAN, but that was you invoking StandardScaler.

It is important that you understand what you are doing... You have a time axis and a value axis. These are completely different, and computing Euclidean distance does not make any sense...

What you probably wanted to do with such data is GDBSCAN instead. For Generalized DBSCAN you would define two epsilon thresholds:

One threshold for time: how many time steps apart is still the same cluster?

One threshold for values: what difference in values is still the same cluster?

min_samples should always be at least 3, or you are not doing DBSCAN, but Single-Link clustering.

Then run GDBSCAN on that.

Since sklearn does not have GDBSCAN, you can emulate this here by scaling and maximum norm, but that is a bit hackish.

answered 10 hours ago

Anony-Mousse

4,801624

answered 10 hours ago

Anony-Mousse

4,801624

answered 10 hours ago

Anony-Mousse

4,801624

answered 10 hours ago

Anony-Mousse

4,801624

$begingroup$
you are right but there is no module gdbscan with the python library
$endgroup$
– Brown
10 hours ago

$begingroup$
I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...
$endgroup$
– Anony-Mousse
5 hours ago

$begingroup$
OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.
$endgroup$
– Brown
4 hours ago

$begingroup$
I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!
$endgroup$
– Brown
4 hours ago

$begingroup$
Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.
$endgroup$
– Anony-Mousse
58 mins ago

add a comment |

$begingroup$
you are right but there is no module gdbscan with the python library
$endgroup$
– Brown
10 hours ago

$begingroup$
I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...
$endgroup$
– Anony-Mousse
5 hours ago

$begingroup$
OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.
$endgroup$
– Brown
4 hours ago

$begingroup$
I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!
$endgroup$
– Brown
4 hours ago

$begingroup$
Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.
$endgroup$
– Anony-Mousse
58 mins ago

you are right but there is no module gdbscan with the python library

– Brown
10 hours ago

I don t think k-means is a good choice for such data, because of its sensitivity to outliers. As for your question, don't expect computers to guess. If you want to plot x and y, tell it to plot x and y. The error exactly tells you what parameter is wrong...

– Anony-Mousse
5 hours ago

OK thank you but why do you have to downvote the question then? I don't think that is fair. There is nothing wrong with being a beginner and that's how you learn new things.

– Brown
4 hours ago

I just asked you for help if you don't think my question is not good enough, you just ignore it. Otherwise, I don't think being harsh and downvoting the question is fair at all. Nobody is born stupid and it is a matter of time and learning things through time. Thanks!

– Brown
4 hours ago

Downvoting means "the question is not likely to help future visitors". This is nothing against you personally. It's just that the documentation of sklearn is fairly detailed, and the answer is there already, so the question is not too useful for others.

– Anony-Mousse
58 mins ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk