Kmean clustering on text data
$begingroup$
I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code
Rawdata.3means <- kmeans(Rawdata, centers = 3).
Error:
Error in kmeans(Rawdata, centers = 3) :
more cluster centers than distinct data points.
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
It's my first time using r language and r studio so, I would be grateful if you guys could help me out.
r dataset clustering k-means rstudio
New contributor
$endgroup$
add a comment |
$begingroup$
I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code
Rawdata.3means <- kmeans(Rawdata, centers = 3).
Error:
Error in kmeans(Rawdata, centers = 3) :
more cluster centers than distinct data points.
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
It's my first time using r language and r studio so, I would be grateful if you guys could help me out.
r dataset clustering k-means rstudio
New contributor
$endgroup$
add a comment |
$begingroup$
I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code
Rawdata.3means <- kmeans(Rawdata, centers = 3).
Error:
Error in kmeans(Rawdata, centers = 3) :
more cluster centers than distinct data points.
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
It's my first time using r language and r studio so, I would be grateful if you guys could help me out.
r dataset clustering k-means rstudio
New contributor
$endgroup$
I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code
Rawdata.3means <- kmeans(Rawdata, centers = 3).
Error:
Error in kmeans(Rawdata, centers = 3) :
more cluster centers than distinct data points.
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
It's my first time using r language and r studio so, I would be grateful if you guys could help me out.
r dataset clustering k-means rstudio
r dataset clustering k-means rstudio
New contributor
New contributor
edited yesterday
Siong Thye Goh
1,132418
1,132418
New contributor
asked yesterday
jen kijen ki
91
91
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .
- Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R
Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R
- If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF
$endgroup$
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
jen ki is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46027%2fkmean-clustering-on-text-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .
- Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R
Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R
- If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF
$endgroup$
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
|
show 2 more comments
$begingroup$
Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .
- Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R
Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R
- If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF
$endgroup$
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
|
show 2 more comments
$begingroup$
Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .
- Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R
Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R
- If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF
$endgroup$
Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .
- Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R
Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R
- If your dataset is plain text (like tweets or stackexchange posts) :
One common method is using td-idf (but there are many more) , you can read more here:
Text clustering using R: an introduction for data scientists
and here in a nice kaggle R notebook:
R : cleaning data, and using TF-IDF
edited 15 hours ago
answered yesterday
LatentLatent
399
399
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
|
show 2 more comments
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
1
1
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
$endgroup$
– HFulcher
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
$endgroup$
– jen ki
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
$endgroup$
– Latent
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
$begingroup$
Thank you @Latent. I'll look at that.
$endgroup$
– jen ki
yesterday
1
1
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
$begingroup$
While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
$endgroup$
– Anony-Mousse
16 hours ago
|
show 2 more comments
jen ki is a new contributor. Be nice, and check out our Code of Conduct.
jen ki is a new contributor. Be nice, and check out our Code of Conduct.
jen ki is a new contributor. Be nice, and check out our Code of Conduct.
jen ki is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46027%2fkmean-clustering-on-text-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown