Kmean clustering on text data












1












$begingroup$


I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code



Rawdata.3means <- kmeans(Rawdata, centers = 3).


Error:



Error in kmeans(Rawdata, centers = 3) : 
more cluster centers than distinct data points.
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion


It's my first time using r language and r studio so, I would be grateful if you guys could help me out.










share|improve this question









New contributor




jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code



    Rawdata.3means <- kmeans(Rawdata, centers = 3).


    Error:



    Error in kmeans(Rawdata, centers = 3) : 
    more cluster centers than distinct data points.
    In addition: Warning message:
    In storage.mode(x) <- "double" : NAs introduced by coercion


    It's my first time using r language and r studio so, I would be grateful if you guys could help me out.










    share|improve this question









    New contributor




    jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code



      Rawdata.3means <- kmeans(Rawdata, centers = 3).


      Error:



      Error in kmeans(Rawdata, centers = 3) : 
      more cluster centers than distinct data points.
      In addition: Warning message:
      In storage.mode(x) <- "double" : NAs introduced by coercion


      It's my first time using r language and r studio so, I would be grateful if you guys could help me out.










      share|improve this question









      New contributor




      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have a large raw dataset on crime and I want to cluster the data using k-mean, However, I get this Error when I enter this code



      Rawdata.3means <- kmeans(Rawdata, centers = 3).


      Error:



      Error in kmeans(Rawdata, centers = 3) : 
      more cluster centers than distinct data points.
      In addition: Warning message:
      In storage.mode(x) <- "double" : NAs introduced by coercion


      It's my first time using r language and r studio so, I would be grateful if you guys could help me out.







      r dataset clustering k-means rstudio






      share|improve this question









      New contributor




      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited yesterday









      Siong Thye Goh

      1,132418




      1,132418






      New contributor




      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      jen kijen ki

      91




      91




      New contributor




      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      jen ki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .




          1. Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R


          Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R




          1. If your dataset is plain text (like tweets or stackexchange posts) :
            One common method is using td-idf (but there are many more) , you can read more here:
            Text clustering using R: an introduction for data scientists
            and here in a nice kaggle R notebook:
            R : cleaning data, and using TF-IDF






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
            $endgroup$
            – HFulcher
            yesterday










          • $begingroup$
            Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
            $endgroup$
            – jen ki
            yesterday










          • $begingroup$
            @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
            $endgroup$
            – Latent
            yesterday












          • $begingroup$
            Thank you @Latent. I'll look at that.
            $endgroup$
            – jen ki
            yesterday






          • 1




            $begingroup$
            While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
            $endgroup$
            – Anony-Mousse
            16 hours ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          jen ki is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46027%2fkmean-clustering-on-text-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .




          1. Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R


          Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R




          1. If your dataset is plain text (like tweets or stackexchange posts) :
            One common method is using td-idf (but there are many more) , you can read more here:
            Text clustering using R: an introduction for data scientists
            and here in a nice kaggle R notebook:
            R : cleaning data, and using TF-IDF






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
            $endgroup$
            – HFulcher
            yesterday










          • $begingroup$
            Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
            $endgroup$
            – jen ki
            yesterday










          • $begingroup$
            @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
            $endgroup$
            – Latent
            yesterday












          • $begingroup$
            Thank you @Latent. I'll look at that.
            $endgroup$
            – jen ki
            yesterday






          • 1




            $begingroup$
            While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
            $endgroup$
            – Anony-Mousse
            16 hours ago
















          1












          $begingroup$

          Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .




          1. Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R


          Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R




          1. If your dataset is plain text (like tweets or stackexchange posts) :
            One common method is using td-idf (but there are many more) , you can read more here:
            Text clustering using R: an introduction for data scientists
            and here in a nice kaggle R notebook:
            R : cleaning data, and using TF-IDF






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
            $endgroup$
            – HFulcher
            yesterday










          • $begingroup$
            Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
            $endgroup$
            – jen ki
            yesterday










          • $begingroup$
            @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
            $endgroup$
            – Latent
            yesterday












          • $begingroup$
            Thank you @Latent. I'll look at that.
            $endgroup$
            – jen ki
            yesterday






          • 1




            $begingroup$
            While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
            $endgroup$
            – Anony-Mousse
            16 hours ago














          1












          1








          1





          $begingroup$

          Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .




          1. Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R


          Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R




          1. If your dataset is plain text (like tweets or stackexchange posts) :
            One common method is using td-idf (but there are many more) , you can read more here:
            Text clustering using R: an introduction for data scientists
            and here in a nice kaggle R notebook:
            R : cleaning data, and using TF-IDF






          share|improve this answer











          $endgroup$



          Kmeans utilize the mean of your data points for clustering . If your dataset is made of plain text or other type of factors (i.e not numbers) then it wont work for you . You need to do another step of preprocessing your data before you can apply Kmean or most of the ML algorithms .




          1. Categorical dataset : i.e your data is in the form of multiple categories like column of fruits with values of Apple , orange ,banana etc. Then you can use "one hot encoding" method that will transform your category column into multiple columns that each indicate if the sample is belong to the relevant category (i.e for column with 3 fruit types you will get 3 new binary (1 or 0) columns - is apple ? is orange? is banana ? read more about how to do it in R here : One hot encoding in R


          Update: like some suggested in the comments , K means wont be the best approach for clustering categorical data and in some cases you can get much better results when using more suitable approaches .Here is a link to another (more advanced) method for clustering categorical data in R - ROCK algorithem (kaggle notebook) . Also ,you can read about "Kmode" which is similar to kmeans for categories and implemented in R




          1. If your dataset is plain text (like tweets or stackexchange posts) :
            One common method is using td-idf (but there are many more) , you can read more here:
            Text clustering using R: an introduction for data scientists
            and here in a nice kaggle R notebook:
            R : cleaning data, and using TF-IDF







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 15 hours ago

























          answered yesterday









          LatentLatent

          399




          399








          • 1




            $begingroup$
            Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
            $endgroup$
            – HFulcher
            yesterday










          • $begingroup$
            Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
            $endgroup$
            – jen ki
            yesterday










          • $begingroup$
            @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
            $endgroup$
            – Latent
            yesterday












          • $begingroup$
            Thank you @Latent. I'll look at that.
            $endgroup$
            – jen ki
            yesterday






          • 1




            $begingroup$
            While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
            $endgroup$
            – Anony-Mousse
            16 hours ago














          • 1




            $begingroup$
            Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
            $endgroup$
            – HFulcher
            yesterday










          • $begingroup$
            Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
            $endgroup$
            – jen ki
            yesterday










          • $begingroup$
            @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
            $endgroup$
            – Latent
            yesterday












          • $begingroup$
            Thank you @Latent. I'll look at that.
            $endgroup$
            – jen ki
            yesterday






          • 1




            $begingroup$
            While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
            $endgroup$
            – Anony-Mousse
            16 hours ago








          1




          1




          $begingroup$
          Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
          $endgroup$
          – HFulcher
          yesterday




          $begingroup$
          Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
          $endgroup$
          – HFulcher
          yesterday












          $begingroup$
          Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
          $endgroup$
          – jen ki
          yesterday




          $begingroup$
          Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
          $endgroup$
          – jen ki
          yesterday












          $begingroup$
          @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
          $endgroup$
          – Latent
          yesterday






          $begingroup$
          @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
          $endgroup$
          – Latent
          yesterday














          $begingroup$
          Thank you @Latent. I'll look at that.
          $endgroup$
          – jen ki
          yesterday




          $begingroup$
          Thank you @Latent. I'll look at that.
          $endgroup$
          – jen ki
          yesterday




          1




          1




          $begingroup$
          While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
          $endgroup$
          – Anony-Mousse
          16 hours ago




          $begingroup$
          While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
          $endgroup$
          – Anony-Mousse
          16 hours ago










          jen ki is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          jen ki is a new contributor. Be nice, and check out our Code of Conduct.













          jen ki is a new contributor. Be nice, and check out our Code of Conduct.












          jen ki is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46027%2fkmean-clustering-on-text-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)