How can I fill NaN values in a pandas data frame?












4












$begingroup$


Greeting everyone. I am trying to learn data analysis and machine learning by trying out some problems. I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran



nulls = data.isnull().sum()
nulls[nulls > 0]


This shows the columns with missing values:



LotFrontage     259 
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406


At this point I am totally lost and I don't know how to get rid of these "NaN" values. Any help would be appreciated.










share|improve this question











$endgroup$

















    4












    $begingroup$


    Greeting everyone. I am trying to learn data analysis and machine learning by trying out some problems. I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran



    nulls = data.isnull().sum()
    nulls[nulls > 0]


    This shows the columns with missing values:



    LotFrontage     259 
    Alley 1369
    MasVnrType 8
    MasVnrArea 8
    BsmtQual 37
    BsmtCond 37
    BsmtExposure 38
    BsmtFinType1 37
    BsmtFinType2 38
    Electrical 1
    FireplaceQu 690
    GarageType 81
    GarageYrBlt 81
    GarageFinish 81
    GarageQual 81
    GarageCond 81
    PoolQC 1453
    Fence 1179
    MiscFeature 1406


    At this point I am totally lost and I don't know how to get rid of these "NaN" values. Any help would be appreciated.










    share|improve this question











    $endgroup$















      4












      4








      4





      $begingroup$


      Greeting everyone. I am trying to learn data analysis and machine learning by trying out some problems. I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran



      nulls = data.isnull().sum()
      nulls[nulls > 0]


      This shows the columns with missing values:



      LotFrontage     259 
      Alley 1369
      MasVnrType 8
      MasVnrArea 8
      BsmtQual 37
      BsmtCond 37
      BsmtExposure 38
      BsmtFinType1 37
      BsmtFinType2 38
      Electrical 1
      FireplaceQu 690
      GarageType 81
      GarageYrBlt 81
      GarageFinish 81
      GarageQual 81
      GarageCond 81
      PoolQC 1453
      Fence 1179
      MiscFeature 1406


      At this point I am totally lost and I don't know how to get rid of these "NaN" values. Any help would be appreciated.










      share|improve this question











      $endgroup$




      Greeting everyone. I am trying to learn data analysis and machine learning by trying out some problems. I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran



      nulls = data.isnull().sum()
      nulls[nulls > 0]


      This shows the columns with missing values:



      LotFrontage     259 
      Alley 1369
      MasVnrType 8
      MasVnrArea 8
      BsmtQual 37
      BsmtCond 37
      BsmtExposure 38
      BsmtFinType1 37
      BsmtFinType2 38
      Electrical 1
      FireplaceQu 690
      GarageType 81
      GarageYrBlt 81
      GarageFinish 81
      GarageQual 81
      GarageCond 81
      PoolQC 1453
      Fence 1179
      MiscFeature 1406


      At this point I am totally lost and I don't know how to get rid of these "NaN" values. Any help would be appreciated.







      python data-cleaning kaggle






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '17 at 1:38









      timleathart

      2,139726




      2,139726










      asked Dec 25 '16 at 22:29









      Ahmed DhananiAhmed Dhanani

      12315




      12315






















          2 Answers
          2






          active

          oldest

          votes


















          5












          $begingroup$

          You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,



          df.fillna(0, inplace=True)


          will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:



          df.fillna(df.mean(), inplace=True)


          or take the last value seen for a column:



          df.fillna(method='ffill', inplace=True)


          Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
            $endgroup$
            – Ahmed Dhanani
            Dec 26 '16 at 13:07






          • 1




            $begingroup$
            Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
            $endgroup$
            – timleathart
            Dec 26 '16 at 22:01





















          0












          $begingroup$

          ~ # Taking care of missing data
          ~ from sklearn.preprocessing import Imputer
          ~ imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
          ~ imputer = imputer.fit(X[:, 1:3])
          ~ X[:, 1:3] = imputer.transform(X[:, 1:3])



          suppose the name of my array is X and I want to take care of missing data in columns indexed 1 and 2 by replacing it with mean. Imputer is a great class to do this from sklearn library






          share|improve this answer








          New contributor




          smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15924%2fhow-can-i-fill-nan-values-in-a-pandas-data-frame%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5












            $begingroup$

            You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,



            df.fillna(0, inplace=True)


            will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:



            df.fillna(df.mean(), inplace=True)


            or take the last value seen for a column:



            df.fillna(method='ffill', inplace=True)


            Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
              $endgroup$
              – Ahmed Dhanani
              Dec 26 '16 at 13:07






            • 1




              $begingroup$
              Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
              $endgroup$
              – timleathart
              Dec 26 '16 at 22:01


















            5












            $begingroup$

            You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,



            df.fillna(0, inplace=True)


            will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:



            df.fillna(df.mean(), inplace=True)


            or take the last value seen for a column:



            df.fillna(method='ffill', inplace=True)


            Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
              $endgroup$
              – Ahmed Dhanani
              Dec 26 '16 at 13:07






            • 1




              $begingroup$
              Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
              $endgroup$
              – timleathart
              Dec 26 '16 at 22:01
















            5












            5








            5





            $begingroup$

            You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,



            df.fillna(0, inplace=True)


            will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:



            df.fillna(df.mean(), inplace=True)


            or take the last value seen for a column:



            df.fillna(method='ffill', inplace=True)


            Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.






            share|improve this answer









            $endgroup$



            You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,



            df.fillna(0, inplace=True)


            will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:



            df.fillna(df.mean(), inplace=True)


            or take the last value seen for a column:



            df.fillna(method='ffill', inplace=True)


            Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 26 '16 at 0:06









            timleatharttimleathart

            2,139726




            2,139726












            • $begingroup$
              Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
              $endgroup$
              – Ahmed Dhanani
              Dec 26 '16 at 13:07






            • 1




              $begingroup$
              Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
              $endgroup$
              – timleathart
              Dec 26 '16 at 22:01




















            • $begingroup$
              Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
              $endgroup$
              – Ahmed Dhanani
              Dec 26 '16 at 13:07






            • 1




              $begingroup$
              Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
              $endgroup$
              – timleathart
              Dec 26 '16 at 22:01


















            $begingroup$
            Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
            $endgroup$
            – Ahmed Dhanani
            Dec 26 '16 at 13:07




            $begingroup$
            Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values?
            $endgroup$
            – Ahmed Dhanani
            Dec 26 '16 at 13:07




            1




            1




            $begingroup$
            Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
            $endgroup$
            – timleathart
            Dec 26 '16 at 22:01






            $begingroup$
            Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.
            $endgroup$
            – timleathart
            Dec 26 '16 at 22:01













            0












            $begingroup$

            ~ # Taking care of missing data
            ~ from sklearn.preprocessing import Imputer
            ~ imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
            ~ imputer = imputer.fit(X[:, 1:3])
            ~ X[:, 1:3] = imputer.transform(X[:, 1:3])



            suppose the name of my array is X and I want to take care of missing data in columns indexed 1 and 2 by replacing it with mean. Imputer is a great class to do this from sklearn library






            share|improve this answer








            New contributor




            smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              0












              $begingroup$

              ~ # Taking care of missing data
              ~ from sklearn.preprocessing import Imputer
              ~ imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
              ~ imputer = imputer.fit(X[:, 1:3])
              ~ X[:, 1:3] = imputer.transform(X[:, 1:3])



              suppose the name of my array is X and I want to take care of missing data in columns indexed 1 and 2 by replacing it with mean. Imputer is a great class to do this from sklearn library






              share|improve this answer








              New contributor




              smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                0












                0








                0





                $begingroup$

                ~ # Taking care of missing data
                ~ from sklearn.preprocessing import Imputer
                ~ imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
                ~ imputer = imputer.fit(X[:, 1:3])
                ~ X[:, 1:3] = imputer.transform(X[:, 1:3])



                suppose the name of my array is X and I want to take care of missing data in columns indexed 1 and 2 by replacing it with mean. Imputer is a great class to do this from sklearn library






                share|improve this answer








                New contributor




                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                ~ # Taking care of missing data
                ~ from sklearn.preprocessing import Imputer
                ~ imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
                ~ imputer = imputer.fit(X[:, 1:3])
                ~ X[:, 1:3] = imputer.transform(X[:, 1:3])



                suppose the name of my array is X and I want to take care of missing data in columns indexed 1 and 2 by replacing it with mean. Imputer is a great class to do this from sklearn library







                share|improve this answer








                New contributor




                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 45 mins ago









                smit patelsmit patel

                11




                11




                New contributor




                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                smit patel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15924%2fhow-can-i-fill-nan-values-in-a-pandas-data-frame%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)