clustering numbers into fixed bucket
$begingroup$
I am trying to put numeric data into fixed number of buckets using Python/R.
I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}
, which is device_id:data_usages
I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000)
, So later I can see which data points are in which bucket.
What algorithm can do this in Python OR R?
python r clustering
New contributor
$endgroup$
add a comment |
$begingroup$
I am trying to put numeric data into fixed number of buckets using Python/R.
I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}
, which is device_id:data_usages
I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000)
, So later I can see which data points are in which bucket.
What algorithm can do this in Python OR R?
python r clustering
New contributor
$endgroup$
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago
add a comment |
$begingroup$
I am trying to put numeric data into fixed number of buckets using Python/R.
I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}
, which is device_id:data_usages
I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000)
, So later I can see which data points are in which bucket.
What algorithm can do this in Python OR R?
python r clustering
New contributor
$endgroup$
I am trying to put numeric data into fixed number of buckets using Python/R.
I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}
, which is device_id:data_usages
I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000)
, So later I can see which data points are in which bucket.
What algorithm can do this in Python OR R?
python r clustering
python r clustering
New contributor
New contributor
edited 8 hours ago
roy
New contributor
asked 10 hours ago
royroy
1012
1012
New contributor
New contributor
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago
add a comment |
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.
You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.
Assign to buckets
You just need to create a Pandas DataFrame with your data and then call the handy cut
function, which will put each value into a bucket/bin of your definition. From the documentation:
Use
cut
when you need to segment and sort data values into bins.
In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data
Create some dummy data, put it in a dataframe and define the bins:
In [3]: data = np.random.randint(low=1, high=10001, size=1000)
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])
Pass the data, along with the bin definitions to the cut
function and assign it back as a new column in the dataframe:
In [6]: df["bucket"] = pd.cut(df.data, bins)
You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:
In [7]: df.head()
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]
Counting how many in each bucket
Here is an example using NumPy, to get an idea of the distribution, as a histogram.
Using the data
and bins
as defined above, we pass them to the numpy histogram
function, which will count how many data points fall into each bin:
In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))
Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.
You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:
data = np.array([v for v in your_dict.values()])
$endgroup$
$begingroup$
Data which is inkey:value
format is actuallydevice_id:data_usages
. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago
1
$begingroup$
@roy - then you can use thedevice_id
as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
roy is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45118%2fclustering-numbers-into-fixed-bucket%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.
You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.
Assign to buckets
You just need to create a Pandas DataFrame with your data and then call the handy cut
function, which will put each value into a bucket/bin of your definition. From the documentation:
Use
cut
when you need to segment and sort data values into bins.
In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data
Create some dummy data, put it in a dataframe and define the bins:
In [3]: data = np.random.randint(low=1, high=10001, size=1000)
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])
Pass the data, along with the bin definitions to the cut
function and assign it back as a new column in the dataframe:
In [6]: df["bucket"] = pd.cut(df.data, bins)
You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:
In [7]: df.head()
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]
Counting how many in each bucket
Here is an example using NumPy, to get an idea of the distribution, as a histogram.
Using the data
and bins
as defined above, we pass them to the numpy histogram
function, which will count how many data points fall into each bin:
In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))
Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.
You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:
data = np.array([v for v in your_dict.values()])
$endgroup$
$begingroup$
Data which is inkey:value
format is actuallydevice_id:data_usages
. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago
1
$begingroup$
@roy - then you can use thedevice_id
as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago
add a comment |
$begingroup$
You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.
You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.
Assign to buckets
You just need to create a Pandas DataFrame with your data and then call the handy cut
function, which will put each value into a bucket/bin of your definition. From the documentation:
Use
cut
when you need to segment and sort data values into bins.
In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data
Create some dummy data, put it in a dataframe and define the bins:
In [3]: data = np.random.randint(low=1, high=10001, size=1000)
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])
Pass the data, along with the bin definitions to the cut
function and assign it back as a new column in the dataframe:
In [6]: df["bucket"] = pd.cut(df.data, bins)
You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:
In [7]: df.head()
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]
Counting how many in each bucket
Here is an example using NumPy, to get an idea of the distribution, as a histogram.
Using the data
and bins
as defined above, we pass them to the numpy histogram
function, which will count how many data points fall into each bin:
In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))
Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.
You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:
data = np.array([v for v in your_dict.values()])
$endgroup$
$begingroup$
Data which is inkey:value
format is actuallydevice_id:data_usages
. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago
1
$begingroup$
@roy - then you can use thedevice_id
as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago
add a comment |
$begingroup$
You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.
You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.
Assign to buckets
You just need to create a Pandas DataFrame with your data and then call the handy cut
function, which will put each value into a bucket/bin of your definition. From the documentation:
Use
cut
when you need to segment and sort data values into bins.
In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data
Create some dummy data, put it in a dataframe and define the bins:
In [3]: data = np.random.randint(low=1, high=10001, size=1000)
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])
Pass the data, along with the bin definitions to the cut
function and assign it back as a new column in the dataframe:
In [6]: df["bucket"] = pd.cut(df.data, bins)
You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:
In [7]: df.head()
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]
Counting how many in each bucket
Here is an example using NumPy, to get an idea of the distribution, as a histogram.
Using the data
and bins
as defined above, we pass them to the numpy histogram
function, which will count how many data points fall into each bin:
In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))
Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.
You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:
data = np.array([v for v in your_dict.values()])
$endgroup$
You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.
You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.
Assign to buckets
You just need to create a Pandas DataFrame with your data and then call the handy cut
function, which will put each value into a bucket/bin of your definition. From the documentation:
Use
cut
when you need to segment and sort data values into bins.
In [1]: import pandas as pd
In [2]: import numpy as np # to create dummy data
Create some dummy data, put it in a dataframe and define the bins:
In [3]: data = np.random.randint(low=1, high=10001, size=1000)
In [4]: df = pd.DataFrame(data=data, columns=["data"])
In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])
Pass the data, along with the bin definitions to the cut
function and assign it back as a new column in the dataframe:
In [6]: df["bucket"] = pd.cut(df.data, bins)
You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:
In [7]: df.head()
Out[7]:
data bucket
0 8754 (5000, 10000]
1 2970 (1000, 5000]
2 6778 (5000, 10000]
3 2550 (1000, 5000]
4 5226 (5000, 10000]
Counting how many in each bucket
Here is an example using NumPy, to get an idea of the distribution, as a histogram.
Using the data
and bins
as defined above, we pass them to the numpy histogram
function, which will count how many data points fall into each bin:
In [8]: np.histogram(data, bins)
Out[8]:
(array([ 0, 2, 1, 8, 6, 61, 417, 505]),
array([ 1, 5, 25, 50, 150, 250, 1000, 5000, 10000]))
Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.
You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:
data = np.array([v for v in your_dict.values()])
answered 9 hours ago
n1k31t4n1k31t4
5,7662318
5,7662318
$begingroup$
Data which is inkey:value
format is actuallydevice_id:data_usages
. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago
1
$begingroup$
@roy - then you can use thedevice_id
as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago
add a comment |
$begingroup$
Data which is inkey:value
format is actuallydevice_id:data_usages
. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago
1
$begingroup$
@roy - then you can use thedevice_id
as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago
$begingroup$
Data which is in
key:value
format is actually device_id:data_usages
. And at the end need to know which device belong to which bins$endgroup$
– roy
8 hours ago
$begingroup$
Data which is in
key:value
format is actually device_id:data_usages
. And at the end need to know which device belong to which bins$endgroup$
– roy
8 hours ago
1
1
$begingroup$
@roy - then you can use the
device_id
as the index in the DataFrame. That way you don't lose the information.$endgroup$
– n1k31t4
8 hours ago
$begingroup$
@roy - then you can use the
device_id
as the index in the DataFrame. That way you don't lose the information.$endgroup$
– n1k31t4
8 hours ago
add a comment |
roy is a new contributor. Be nice, and check out our Code of Conduct.
roy is a new contributor. Be nice, and check out our Code of Conduct.
roy is a new contributor. Be nice, and check out our Code of Conduct.
roy is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45118%2fclustering-numbers-into-fixed-bucket%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago
$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago