clustering numbers into fixed bucket












0












$begingroup$


I am trying to put numeric data into fixed number of buckets using Python/R.



I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}, which is device_id:data_usages I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000), So later I can see which data points are in which bucket.



What algorithm can do this in Python OR R?










share|improve this question









New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
    $endgroup$
    – n1k31t4
    10 hours ago










  • $begingroup$
    Updated my question
    $endgroup$
    – roy
    10 hours ago
















0












$begingroup$


I am trying to put numeric data into fixed number of buckets using Python/R.



I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}, which is device_id:data_usages I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000), So later I can see which data points are in which bucket.



What algorithm can do this in Python OR R?










share|improve this question









New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
    $endgroup$
    – n1k31t4
    10 hours ago










  • $begingroup$
    Updated my question
    $endgroup$
    – roy
    10 hours ago














0












0








0





$begingroup$


I am trying to put numeric data into fixed number of buckets using Python/R.



I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}, which is device_id:data_usages I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000), So later I can see which data points are in which bucket.



What algorithm can do this in Python OR R?










share|improve this question









New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I am trying to put numeric data into fixed number of buckets using Python/R.



I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}, which is device_id:data_usages I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000), So later I can see which data points are in which bucket.



What algorithm can do this in Python OR R?







python r clustering






share|improve this question









New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 8 hours ago







roy













New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 10 hours ago









royroy

1012




1012




New contributor




roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
    $endgroup$
    – n1k31t4
    10 hours ago










  • $begingroup$
    Updated my question
    $endgroup$
    – roy
    10 hours ago


















  • $begingroup$
    Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
    $endgroup$
    – n1k31t4
    10 hours ago










  • $begingroup$
    Updated my question
    $endgroup$
    – roy
    10 hours ago
















$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago




$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago












$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago




$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago










1 Answer
1






active

oldest

votes


















2












$begingroup$

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.



You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.



Assign to buckets



You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:




Use cut when you need to segment and sort data values into bins.




In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data


Create some dummy data, put it in a dataframe and define the bins:



In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])


Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:



In [6]: df["bucket"] = pd.cut(df.data, bins)


You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:



In [7]: df.head()                                                              
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]


Counting how many in each bucket



Here is an example using NumPy, to get an idea of the distribution, as a histogram.



Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:



In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))


Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.





You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:



data = np.array([v for v in your_dict.values()])





share|improve this answer









$endgroup$













  • $begingroup$
    Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
    $endgroup$
    – roy
    8 hours ago






  • 1




    $begingroup$
    @roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
    $endgroup$
    – n1k31t4
    8 hours ago











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






roy is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45118%2fclustering-numbers-into-fixed-bucket%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2












$begingroup$

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.



You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.



Assign to buckets



You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:




Use cut when you need to segment and sort data values into bins.




In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data


Create some dummy data, put it in a dataframe and define the bins:



In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])


Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:



In [6]: df["bucket"] = pd.cut(df.data, bins)


You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:



In [7]: df.head()                                                              
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]


Counting how many in each bucket



Here is an example using NumPy, to get an idea of the distribution, as a histogram.



Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:



In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))


Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.





You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:



data = np.array([v for v in your_dict.values()])





share|improve this answer









$endgroup$













  • $begingroup$
    Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
    $endgroup$
    – roy
    8 hours ago






  • 1




    $begingroup$
    @roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
    $endgroup$
    – n1k31t4
    8 hours ago
















2












$begingroup$

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.



You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.



Assign to buckets



You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:




Use cut when you need to segment and sort data values into bins.




In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data


Create some dummy data, put it in a dataframe and define the bins:



In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])


Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:



In [6]: df["bucket"] = pd.cut(df.data, bins)


You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:



In [7]: df.head()                                                              
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]


Counting how many in each bucket



Here is an example using NumPy, to get an idea of the distribution, as a histogram.



Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:



In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))


Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.





You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:



data = np.array([v for v in your_dict.values()])





share|improve this answer









$endgroup$













  • $begingroup$
    Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
    $endgroup$
    – roy
    8 hours ago






  • 1




    $begingroup$
    @roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
    $endgroup$
    – n1k31t4
    8 hours ago














2












2








2





$begingroup$

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.



You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.



Assign to buckets



You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:




Use cut when you need to segment and sort data values into bins.




In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data


Create some dummy data, put it in a dataframe and define the bins:



In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])


Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:



In [6]: df["bucket"] = pd.cut(df.data, bins)


You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:



In [7]: df.head()                                                              
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]


Counting how many in each bucket



Here is an example using NumPy, to get an idea of the distribution, as a histogram.



Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:



In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))


Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.





You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:



data = np.array([v for v in your_dict.values()])





share|improve this answer









$endgroup$



You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.



You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.



Assign to buckets



You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:




Use cut when you need to segment and sort data values into bins.




In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data


Create some dummy data, put it in a dataframe and define the bins:



In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])


Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:



In [6]: df["bucket"] = pd.cut(df.data, bins)


You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:



In [7]: df.head()                                                              
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]


Counting how many in each bucket



Here is an example using NumPy, to get an idea of the distribution, as a histogram.



Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:



In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))


Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.





You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:



data = np.array([v for v in your_dict.values()])






share|improve this answer












share|improve this answer



share|improve this answer










answered 9 hours ago









n1k31t4n1k31t4

5,7662318




5,7662318












  • $begingroup$
    Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
    $endgroup$
    – roy
    8 hours ago






  • 1




    $begingroup$
    @roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
    $endgroup$
    – n1k31t4
    8 hours ago


















  • $begingroup$
    Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
    $endgroup$
    – roy
    8 hours ago






  • 1




    $begingroup$
    @roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
    $endgroup$
    – n1k31t4
    8 hours ago
















$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago




$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago




1




1




$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago




$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago










roy is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















roy is a new contributor. Be nice, and check out our Code of Conduct.













roy is a new contributor. Be nice, and check out our Code of Conduct.












roy is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45118%2fclustering-numbers-into-fixed-bucket%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Vallis Paradisi

Tabula Rosettana