Knn distance plot for determining eps of DBSCAN












7












$begingroup$


I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm.
Based on this page:




The idea is to calculate, the average of the distances of every point
to its k nearest neighbors. The value of k will be specified by the
user and corresponds to MinPts. Next, these k-distances are plotted in
an ascending order. The aim is to determine the “knee”, which
corresponds to the optimal eps parameter.




Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:



X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X)
distances, indices = nbrs.kneighbors(X)

# Indices

[[0 1 2 3 4 5]
[1 0 2 3 4 5]
[2 1 0 3 4 5]
[3 4 5 0 1 2]
[4 3 5 0 1 2]
[5 4 3 0 1 2]]

# Distances
[[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]
[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]]


then I computed the average distance:



distances.mean()
2.9269575028354495


The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.



Thank for your help.










share|improve this question











$endgroup$












  • $begingroup$
    ![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
    $endgroup$
    – Dung Le
    Oct 10 '17 at 0:37
















7












$begingroup$


I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm.
Based on this page:




The idea is to calculate, the average of the distances of every point
to its k nearest neighbors. The value of k will be specified by the
user and corresponds to MinPts. Next, these k-distances are plotted in
an ascending order. The aim is to determine the “knee”, which
corresponds to the optimal eps parameter.




Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:



X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X)
distances, indices = nbrs.kneighbors(X)

# Indices

[[0 1 2 3 4 5]
[1 0 2 3 4 5]
[2 1 0 3 4 5]
[3 4 5 0 1 2]
[4 3 5 0 1 2]
[5 4 3 0 1 2]]

# Distances
[[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]
[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]]


then I computed the average distance:



distances.mean()
2.9269575028354495


The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.



Thank for your help.










share|improve this question











$endgroup$












  • $begingroup$
    ![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
    $endgroup$
    – Dung Le
    Oct 10 '17 at 0:37














7












7








7


4



$begingroup$


I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm.
Based on this page:




The idea is to calculate, the average of the distances of every point
to its k nearest neighbors. The value of k will be specified by the
user and corresponds to MinPts. Next, these k-distances are plotted in
an ascending order. The aim is to determine the “knee”, which
corresponds to the optimal eps parameter.




Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:



X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X)
distances, indices = nbrs.kneighbors(X)

# Indices

[[0 1 2 3 4 5]
[1 0 2 3 4 5]
[2 1 0 3 4 5]
[3 4 5 0 1 2]
[4 3 5 0 1 2]
[5 4 3 0 1 2]]

# Distances
[[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]
[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]]


then I computed the average distance:



distances.mean()
2.9269575028354495


The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.



Thank for your help.










share|improve this question











$endgroup$




I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm.
Based on this page:




The idea is to calculate, the average of the distances of every point
to its k nearest neighbors. The value of k will be specified by the
user and corresponds to MinPts. Next, these k-distances are plotted in
an ascending order. The aim is to determine the “knee”, which
corresponds to the optimal eps parameter.




Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:



X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X)
distances, indices = nbrs.kneighbors(X)

# Indices

[[0 1 2 3 4 5]
[1 0 2 3 4 5]
[2 1 0 3 4 5]
[3 4 5 0 1 2]
[4 3 5 0 1 2]
[5 4 3 0 1 2]]

# Distances
[[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]
[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ]
[ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189]
[ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]]


then I computed the average distance:



distances.mean()
2.9269575028354495


The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.



Thank for your help.







python clustering parameter-estimation dbscan






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 2 '16 at 15:50









Kasra Manshaei

3,8071135




3,8071135










asked Feb 9 '16 at 16:29









marcLmarcL

8226




8226












  • $begingroup$
    ![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
    $endgroup$
    – Dung Le
    Oct 10 '17 at 0:37


















  • $begingroup$
    ![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
    $endgroup$
    – Dung Le
    Oct 10 '17 at 0:37
















$begingroup$
![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
$endgroup$
– Dung Le
Oct 10 '17 at 0:37




$begingroup$
![enter image description here](i.stack.imgur.com/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!!
$endgroup$
– Dung Le
Oct 10 '17 at 0:37










2 Answers
2






active

oldest

votes


















6












$begingroup$

You




  1. take the last column of that matrix

  2. sort descending

  3. plot index, distance

  4. hope to see a knee (if the distance does not work well. there might be none)






share|improve this answer









$endgroup$













  • $begingroup$
    On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
    $endgroup$
    – marcL
    Feb 9 '16 at 20:53












  • $begingroup$
    Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 20:57












  • $begingroup$
    And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
    $endgroup$
    – marcL
    Feb 9 '16 at 22:26










  • $begingroup$
    That post is incorrect there and in at least another place (you don't need to set a seed)
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:46








  • 1




    $begingroup$
    You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:53





















0












$begingroup$

why do me take the last column of the distance matrix? Please elaborate.





share








New contributor




Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$














    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f10162%2fknn-distance-plot-for-determining-eps-of-dbscan%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6












    $begingroup$

    You




    1. take the last column of that matrix

    2. sort descending

    3. plot index, distance

    4. hope to see a knee (if the distance does not work well. there might be none)






    share|improve this answer









    $endgroup$













    • $begingroup$
      On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
      $endgroup$
      – marcL
      Feb 9 '16 at 20:53












    • $begingroup$
      Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 20:57












    • $begingroup$
      And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
      $endgroup$
      – marcL
      Feb 9 '16 at 22:26










    • $begingroup$
      That post is incorrect there and in at least another place (you don't need to set a seed)
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:46








    • 1




      $begingroup$
      You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:53


















    6












    $begingroup$

    You




    1. take the last column of that matrix

    2. sort descending

    3. plot index, distance

    4. hope to see a knee (if the distance does not work well. there might be none)






    share|improve this answer









    $endgroup$













    • $begingroup$
      On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
      $endgroup$
      – marcL
      Feb 9 '16 at 20:53












    • $begingroup$
      Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 20:57












    • $begingroup$
      And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
      $endgroup$
      – marcL
      Feb 9 '16 at 22:26










    • $begingroup$
      That post is incorrect there and in at least another place (you don't need to set a seed)
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:46








    • 1




      $begingroup$
      You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:53
















    6












    6








    6





    $begingroup$

    You




    1. take the last column of that matrix

    2. sort descending

    3. plot index, distance

    4. hope to see a knee (if the distance does not work well. there might be none)






    share|improve this answer









    $endgroup$



    You




    1. take the last column of that matrix

    2. sort descending

    3. plot index, distance

    4. hope to see a knee (if the distance does not work well. there might be none)







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Feb 9 '16 at 19:34









    Anony-MousseAnony-Mousse

    5,300625




    5,300625












    • $begingroup$
      On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
      $endgroup$
      – marcL
      Feb 9 '16 at 20:53












    • $begingroup$
      Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 20:57












    • $begingroup$
      And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
      $endgroup$
      – marcL
      Feb 9 '16 at 22:26










    • $begingroup$
      That post is incorrect there and in at least another place (you don't need to set a seed)
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:46








    • 1




      $begingroup$
      You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:53




















    • $begingroup$
      On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
      $endgroup$
      – marcL
      Feb 9 '16 at 20:53












    • $begingroup$
      Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 20:57












    • $begingroup$
      And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
      $endgroup$
      – marcL
      Feb 9 '16 at 22:26










    • $begingroup$
      That post is incorrect there and in at least another place (you don't need to set a seed)
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:46








    • 1




      $begingroup$
      You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
      $endgroup$
      – Anony-Mousse
      Feb 9 '16 at 22:53


















    $begingroup$
    On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
    $endgroup$
    – marcL
    Feb 9 '16 at 20:53






    $begingroup$
    On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index"
    $endgroup$
    – marcL
    Feb 9 '16 at 20:53














    $begingroup$
    Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 20:57






    $begingroup$
    Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot.
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 20:57














    $begingroup$
    And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
    $endgroup$
    – marcL
    Feb 9 '16 at 22:26




    $begingroup$
    And i only use the last column of the distance matrix. Because in the example they talk about averaging distances..
    $endgroup$
    – marcL
    Feb 9 '16 at 22:26












    $begingroup$
    That post is incorrect there and in at least another place (you don't need to set a seed)
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:46






    $begingroup$
    That post is incorrect there and in at least another place (you don't need to set a seed)
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:46






    1




    1




    $begingroup$
    You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:53






    $begingroup$
    You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites?
    $endgroup$
    – Anony-Mousse
    Feb 9 '16 at 22:53













    0












    $begingroup$

    why do me take the last column of the distance matrix? Please elaborate.





    share








    New contributor




    Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      0












      $begingroup$

      why do me take the last column of the distance matrix? Please elaborate.





      share








      New contributor




      Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        0












        0








        0





        $begingroup$

        why do me take the last column of the distance matrix? Please elaborate.





        share








        New contributor




        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        why do me take the last column of the distance matrix? Please elaborate.






        share








        New contributor




        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.








        share


        share






        New contributor




        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 7 mins ago









        NehaNeha

        1




        1




        New contributor




        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Neha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f10162%2fknn-distance-plot-for-determining-eps-of-dbscan%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana