clustering numbers into fixed bucket

I am trying to put numeric data into fixed number of buckets using Python/R.

I have data in key:value format {1 : 12.3, 2 : 4.7, 3 : 7.4, 4 : 15.9, ......, 50 : 24.1}, which is device_id:data_usages I need to bucket based on value into nine buckets (1,5,25,50,150,250,1000,5000,10000), So later I can see which data points are in which bucket.

What algorithm can do this in Python OR R?

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago

$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago

add a comment |

I am trying to put numeric data into fixed number of buckets using Python/R.

What algorithm can do this in Python OR R?

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago

$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago

add a comment |

I am trying to put numeric data into fixed number of buckets using Python/R.

What algorithm can do this in Python OR R?

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

I am trying to put numeric data into fixed number of buckets using Python/R.

What algorithm can do this in Python OR R?

python r clustering

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

edited 8 hours ago

asked 10 hours ago

roy

1012

New contributor

asked 10 hours ago

roy

1012

asked 10 hours ago

roy

1012

New contributor

roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago

$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago

add a comment |

$begingroup$
Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?
$endgroup$
– n1k31t4
10 hours ago

$begingroup$
Updated my question
$endgroup$
– roy
10 hours ago

Do you need to then use the data from each bucket? Or do you just want to know how many are in each bucket?

– n1k31t4
10 hours ago

Updated my question

– roy
10 hours ago

add a comment |

1 Answer
1

active

oldest

votes

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.

You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.

Assign to buckets

You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:

Use cut when you need to segment and sort data values into bins.

In [1]: import pandas as pd

In [2]: import numpy as np    # to create dummy data

Create some dummy data, put it in a dataframe and define the bins:

In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 

In [4]: df = pd.DataFrame(data=data, columns=["data"])

In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])

Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:

In [6]: df["bucket"] = pd.cut(df.data, bins)

You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:

In [7]: df.head()                                                              

Out[7]: 

   data         bucket

0  8754  (5000, 10000]

1  2970   (1000, 5000]

2  6778  (5000, 10000]

3  2550   (1000, 5000]

4  5226  (5000, 10000]

Counting how many in each bucket

Here is an example using NumPy, to get an idea of the distribution, as a histogram.

Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:

In [8]: np.histogram(data, bins)

Out[8]: 

(array([  0,   2,   1,   8,   6,  61, 417, 505]),

 array([    1,     5,    25,    50,   150,   250,  1000,  5000, 10000]))

Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.

You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:

data = np.array([v for v in your_dict.values()])

answered 9 hours ago

n1k31t4

5,7662318

$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago

1

$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

roy is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45118%2fclustering-numbers-into-fixed-bucket%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.

You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.

Assign to buckets

You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:

Use cut when you need to segment and sort data values into bins.

In [1]: import pandas as pd

In [2]: import numpy as np    # to create dummy data

Create some dummy data, put it in a dataframe and define the bins:

In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 

In [4]: df = pd.DataFrame(data=data, columns=["data"])

In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])

Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:

In [6]: df["bucket"] = pd.cut(df.data, bins)

You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:

In [7]: df.head()                                                              

Out[7]: 

   data         bucket

0  8754  (5000, 10000]

1  2970   (1000, 5000]

2  6778  (5000, 10000]

3  2550   (1000, 5000]

4  5226  (5000, 10000]

Counting how many in each bucket

Here is an example using NumPy, to get an idea of the distribution, as a histogram.

Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:

In [8]: np.histogram(data, bins)

Out[8]: 

(array([  0,   2,   1,   8,   6,  61, 417, 505]),

 array([    1,     5,    25,    50,   150,   250,  1000,  5000, 10000]))

Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.

You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:

data = np.array([v for v in your_dict.values()])

answered 9 hours ago

n1k31t4

5,7662318

$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago

1

$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago

add a comment |

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.

You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.

Assign to buckets

You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:

Use cut when you need to segment and sort data values into bins.

In [1]: import pandas as pd

In [2]: import numpy as np    # to create dummy data

Create some dummy data, put it in a dataframe and define the bins:

In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 

In [4]: df = pd.DataFrame(data=data, columns=["data"])

In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])

Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:

In [6]: df["bucket"] = pd.cut(df.data, bins)

You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:

In [7]: df.head()                                                              

Out[7]: 

   data         bucket

0  8754  (5000, 10000]

1  2970   (1000, 5000]

2  6778  (5000, 10000]

3  2550   (1000, 5000]

4  5226  (5000, 10000]

Counting how many in each bucket

Here is an example using NumPy, to get an idea of the distribution, as a histogram.

Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:

In [8]: np.histogram(data, bins)

Out[8]: 

(array([  0,   2,   1,   8,   6,  61, 417, 505]),

 array([    1,     5,    25,    50,   150,   250,  1000,  5000, 10000]))

Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.

You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:

data = np.array([v for v in your_dict.values()])

answered 9 hours ago

n1k31t4

5,7662318

$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago

1

$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago

add a comment |

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.

You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.

Assign to buckets

You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:

Use cut when you need to segment and sort data values into bins.

In [1]: import pandas as pd

In [2]: import numpy as np    # to create dummy data

Create some dummy data, put it in a dataframe and define the bins:

In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 

In [4]: df = pd.DataFrame(data=data, columns=["data"])

In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])

Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:

In [6]: df["bucket"] = pd.cut(df.data, bins)

You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:

In [7]: df.head()                                                              

Out[7]: 

   data         bucket

0  8754  (5000, 10000]

1  2970   (1000, 5000]

2  6778  (5000, 10000]

3  2550   (1000, 5000]

4  5226  (5000, 10000]

Counting how many in each bucket

Here is an example using NumPy, to get an idea of the distribution, as a histogram.

Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:

In [8]: np.histogram(data, bins)

Out[8]: 

(array([  0,   2,   1,   8,   6,  61, 417, 505]),

 array([    1,     5,    25,    50,   150,   250,  1000,  5000, 10000]))

Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.

You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:

data = np.array([v for v in your_dict.values()])

answered 9 hours ago

n1k31t4

5,7662318

You don't really need to implement an algorithm to achieve this. There are a few tools that will do this for you.

You can get the data assigned to buckets for further processing using Pandas, or simply count how many values fall into each bucket using NumPy.

Assign to buckets

You just need to create a Pandas DataFrame with your data and then call the handy cut function, which will put each value into a bucket/bin of your definition. From the documentation:

Use cut when you need to segment and sort data values into bins.

In [1]: import pandas as pd

In [2]: import numpy as np    # to create dummy data

Create some dummy data, put it in a dataframe and define the bins:

In [3]: data = np.random.randint(low=1, high=10001, size=1000)                 

In [4]: df = pd.DataFrame(data=data, columns=["data"])

In [5]: bins = np.array([1,5,25,50,150,250,1000,5000,10000])

Pass the data, along with the bin definitions to the cut function and assign it back as a new column in the dataframe:

In [6]: df["bucket"] = pd.cut(df.data, bins)

You can then inspect the first few rows to see that the values have now been labelled with the relevant bucket:

In [7]: df.head()                                                              

Out[7]: 

   data         bucket

0  8754  (5000, 10000]

1  2970   (1000, 5000]

2  6778  (5000, 10000]

3  2550   (1000, 5000]

4  5226  (5000, 10000]

Counting how many in each bucket

Here is an example using NumPy, to get an idea of the distribution, as a histogram.

Using the data and bins as defined above, we pass them to the numpy histogram function, which will count how many data points fall into each bin:

In [8]: np.histogram(data, bins)

Out[8]: 

(array([  0,   2,   1,   8,   6,  61, 417, 505]),

 array([    1,     5,    25,    50,   150,   250,  1000,  5000, 10000]))

Where the first row tells you how many values fell into each bin, and the second row confirms the bins used.

You can get your dictionary of data into the same form as my dummy data above (into a numpy array) by doing this:

data = np.array([v for v in your_dict.values()])

answered 9 hours ago

n1k31t4

5,7662318

answered 9 hours ago

n1k31t4

5,7662318

answered 9 hours ago

n1k31t4

5,7662318

answered 9 hours ago

n1k31t4

5,7662318

$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago

1

$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago

add a comment |

$begingroup$
Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins
$endgroup$
– roy
8 hours ago

1

$begingroup$
@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.
$endgroup$
– n1k31t4
8 hours ago

Data which is in key:value format is actually device_id:data_usages. And at the end need to know which device belong to which bins

– roy
8 hours ago

@roy - then you can use the device_id as the index in the DataFrame. That way you don't lose the information.

– n1k31t4
8 hours ago

add a comment |

roy is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

roy is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk