Kmean clustering on text data

I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code

Rawdata.3means <- kmeans(Rawdata, centers = 3).

Error:

Error in kmeans(Rawdata, centers = 3) : 

  more cluster centers than distinct data points.

In addition: Warning message:

In storage.mode(x) <- "double" : NAs introduced by coercion

It's my first time using r language and r studio so, I would be grateful if you guys could help me out.

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

add a comment |

I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code

Rawdata.3means <- kmeans(Rawdata, centers = 3).

Error:

Error in kmeans(Rawdata, centers = 3) : 

  more cluster centers than distinct data points.

In addition: Warning message:

In storage.mode(x) <- "double" : NAs introduced by coercion

It's my first time using r language and r studio so, I would be grateful if you guys could help me out.

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

add a comment |

I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code

Rawdata.3means <- kmeans(Rawdata, centers = 3).

Error:

Error in kmeans(Rawdata, centers = 3) : 

  more cluster centers than distinct data points.

In addition: Warning message:

In storage.mode(x) <- "double" : NAs introduced by coercion

It's my first time using r language and r studio so, I would be grateful if you guys could help me out.

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code

Rawdata.3means <- kmeans(Rawdata, centers = 3).

Error:

Error in kmeans(Rawdata, centers = 3) : 

  more cluster centers than distinct data points.

In addition: Warning message:

In storage.mode(x) <- "double" : NAs introduced by coercion

It's my first time using r language and r studio so, I would be grateful if you guys could help me out.

r dataset clustering k-means rstudio

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

edited yesterday

Siong Thye Goh

1,132418

edited yesterday

Siong Thye Goh

1,132418

edited yesterday

Siong Thye Goh

1,132418

asked yesterday

jen ki

New contributor

asked yesterday

jen ki

asked yesterday

jen ki

New contributor

jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .

Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R

Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R

If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF

edited 15 hours ago

answered yesterday

Latent

399

1

$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday

$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday

$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday

$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday

1

$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago

|
show 2 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

jen ki is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46027%2fkmean-clustering-on-text-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R

If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF

edited 15 hours ago

answered yesterday

Latent

399

1

$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday

$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday

$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday

$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday

1

$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago

|
show 2 more comments

Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R

If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF

edited 15 hours ago

answered yesterday

Latent

399

1

$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday

$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday

$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday

$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday

1

$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago

|
show 2 more comments

Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R

If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF

edited 15 hours ago

answered yesterday

Latent

399

Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R

If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF

edited 15 hours ago

answered yesterday

Latent

399

edited 15 hours ago

answered yesterday

Latent

399

answered yesterday

Latent

399

answered yesterday

Latent

399

1

$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday

$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday

$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday

$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday

1

$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago

|
show 2 more comments

1

$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday

$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday

$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday

$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday

1

$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago

Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?

– HFulcher
yesterday

Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….

– jen ki
yesterday

@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.

– Latent
yesterday

Thank you @Latent. I'll look at that.

– jen ki
yesterday

While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.

– Anony-Mousse
16 hours ago

|
show 2 more comments

jen ki is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

jen ki is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk