PCA-like analysis for dataset that has both categorical and continuous variables












0












$begingroup$


I have a dataset containing a categorical variable and multiple continuous variables. The categorical variables are coded as discrete integers, whereas the continuous variables are just a range of floats. I believe that the variance in my dataset can be almost entirely described by the single categorical variable and one of the many continuous variables. To justify this, I would be interested in using PCA, but I'm not sure the best approach to use when I am considering categorical data. Any suggestions?










share|improve this question









$endgroup$




bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    PCA requires you to be able to define meaningful distances between categories.
    $endgroup$
    – oW_
    Dec 21 '18 at 18:42
















0












$begingroup$


I have a dataset containing a categorical variable and multiple continuous variables. The categorical variables are coded as discrete integers, whereas the continuous variables are just a range of floats. I believe that the variance in my dataset can be almost entirely described by the single categorical variable and one of the many continuous variables. To justify this, I would be interested in using PCA, but I'm not sure the best approach to use when I am considering categorical data. Any suggestions?










share|improve this question









$endgroup$




bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    PCA requires you to be able to define meaningful distances between categories.
    $endgroup$
    – oW_
    Dec 21 '18 at 18:42














0












0








0





$begingroup$


I have a dataset containing a categorical variable and multiple continuous variables. The categorical variables are coded as discrete integers, whereas the continuous variables are just a range of floats. I believe that the variance in my dataset can be almost entirely described by the single categorical variable and one of the many continuous variables. To justify this, I would be interested in using PCA, but I'm not sure the best approach to use when I am considering categorical data. Any suggestions?










share|improve this question









$endgroup$




I have a dataset containing a categorical variable and multiple continuous variables. The categorical variables are coded as discrete integers, whereas the continuous variables are just a range of floats. I believe that the variance in my dataset can be almost entirely described by the single categorical variable and one of the many continuous variables. To justify this, I would be interested in using PCA, but I'm not sure the best approach to use when I am considering categorical data. Any suggestions?







dataset statistics






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 19 '18 at 15:52









AndrewAndrew

1




1





bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    PCA requires you to be able to define meaningful distances between categories.
    $endgroup$
    – oW_
    Dec 21 '18 at 18:42


















  • $begingroup$
    PCA requires you to be able to define meaningful distances between categories.
    $endgroup$
    – oW_
    Dec 21 '18 at 18:42
















$begingroup$
PCA requires you to be able to define meaningful distances between categories.
$endgroup$
– oW_
Dec 21 '18 at 18:42




$begingroup$
PCA requires you to be able to define meaningful distances between categories.
$endgroup$
– oW_
Dec 21 '18 at 18:42










2 Answers
2






active

oldest

votes


















0












$begingroup$

How many values can the categorical value take?



Maybe make a column for each possible value and have 1 if the column name matches the categorical value, 0 otherwise.



I think that will show up in PCA.






share|improve this answer









$endgroup$





















    0












    $begingroup$

    I'm not aware of any the dimensionality reduction algorithms (like PCA) that can work with categorical values.



    However, an approach that could help you with that is to make a one-hot encoding of your categorical variables (if the number of possible values is manageable. Otherwise, try to pick only the most frequent values and assign the rest to a single variable).



    If you are using Pandas DataFrames, get_dummies can be helpful.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38493%2fpca-like-analysis-for-dataset-that-has-both-categorical-and-continuous-variables%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0












      $begingroup$

      How many values can the categorical value take?



      Maybe make a column for each possible value and have 1 if the column name matches the categorical value, 0 otherwise.



      I think that will show up in PCA.






      share|improve this answer









      $endgroup$


















        0












        $begingroup$

        How many values can the categorical value take?



        Maybe make a column for each possible value and have 1 if the column name matches the categorical value, 0 otherwise.



        I think that will show up in PCA.






        share|improve this answer









        $endgroup$
















          0












          0








          0





          $begingroup$

          How many values can the categorical value take?



          Maybe make a column for each possible value and have 1 if the column name matches the categorical value, 0 otherwise.



          I think that will show up in PCA.






          share|improve this answer









          $endgroup$



          How many values can the categorical value take?



          Maybe make a column for each possible value and have 1 if the column name matches the categorical value, 0 otherwise.



          I think that will show up in PCA.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Sep 19 '18 at 19:34









          Pieter21Pieter21

          51626




          51626























              0












              $begingroup$

              I'm not aware of any the dimensionality reduction algorithms (like PCA) that can work with categorical values.



              However, an approach that could help you with that is to make a one-hot encoding of your categorical variables (if the number of possible values is manageable. Otherwise, try to pick only the most frequent values and assign the rest to a single variable).



              If you are using Pandas DataFrames, get_dummies can be helpful.






              share|improve this answer









              $endgroup$


















                0












                $begingroup$

                I'm not aware of any the dimensionality reduction algorithms (like PCA) that can work with categorical values.



                However, an approach that could help you with that is to make a one-hot encoding of your categorical variables (if the number of possible values is manageable. Otherwise, try to pick only the most frequent values and assign the rest to a single variable).



                If you are using Pandas DataFrames, get_dummies can be helpful.






                share|improve this answer









                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  I'm not aware of any the dimensionality reduction algorithms (like PCA) that can work with categorical values.



                  However, an approach that could help you with that is to make a one-hot encoding of your categorical variables (if the number of possible values is manageable. Otherwise, try to pick only the most frequent values and assign the rest to a single variable).



                  If you are using Pandas DataFrames, get_dummies can be helpful.






                  share|improve this answer









                  $endgroup$



                  I'm not aware of any the dimensionality reduction algorithms (like PCA) that can work with categorical values.



                  However, an approach that could help you with that is to make a one-hot encoding of your categorical variables (if the number of possible values is manageable. Otherwise, try to pick only the most frequent values and assign the rest to a single variable).



                  If you are using Pandas DataFrames, get_dummies can be helpful.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 21 '18 at 9:09









                  Arthur CamaraArthur Camara

                  101




                  101






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38493%2fpca-like-analysis-for-dataset-that-has-both-categorical-and-continuous-variables%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Vallis Paradisi

                      Tabula Rosettana