Missing population values in census data












1












$begingroup$


I have population data from Census.gov:



Total US population by age by year from 1940 through 2010



Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.



Specifically it follows this pattern:




  • 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
    and greater

  • 1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

  • 2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater


The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.



Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.



And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.



Some Observations:




  • The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

  • Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.


Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60



All available population values ages 0-99 for years 1940 through 2010



Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.



Average of age per each year and fitted curve of 4th order



Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.



enter image description here



enter image description here



Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.



Total population 1940 through 2010:



enter image description here



Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).



enter image description here





My question



This is where I could use some guidance to move forward:



When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?



Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?



If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?










share|improve this question











$endgroup$




bumped to the homepage by Community 7 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    1












    $begingroup$


    I have population data from Census.gov:



    Total US population by age by year from 1940 through 2010



    Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.



    Specifically it follows this pattern:




    • 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
      and greater

    • 1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

    • 2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater


    The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.



    Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.



    And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.



    Some Observations:




    • The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

    • Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.


    Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60



    All available population values ages 0-99 for years 1940 through 2010



    Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.



    Average of age per each year and fitted curve of 4th order



    Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.



    enter image description here



    enter image description here



    Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.



    Total population 1940 through 2010:



    enter image description here



    Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).



    enter image description here





    My question



    This is where I could use some guidance to move forward:



    When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?



    Does one or more documented methods naturally apply to the problem as I have described it?
    For example: KNN, PCA, BPCA, Mean, MICE, other?



    If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?










    share|improve this question











    $endgroup$




    bumped to the homepage by Community 7 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      1












      1








      1





      $begingroup$


      I have population data from Census.gov:



      Total US population by age by year from 1940 through 2010



      Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.



      Specifically it follows this pattern:




      • 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
        and greater

      • 1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

      • 2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater


      The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.



      Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.



      And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.



      Some Observations:




      • The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

      • Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.


      Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60



      All available population values ages 0-99 for years 1940 through 2010



      Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.



      Average of age per each year and fitted curve of 4th order



      Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.



      enter image description here



      enter image description here



      Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.



      Total population 1940 through 2010:



      enter image description here



      Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).



      enter image description here





      My question



      This is where I could use some guidance to move forward:



      When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?



      Does one or more documented methods naturally apply to the problem as I have described it?
      For example: KNN, PCA, BPCA, Mean, MICE, other?



      If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?










      share|improve this question











      $endgroup$




      I have population data from Census.gov:



      Total US population by age by year from 1940 through 2010



      Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.



      Specifically it follows this pattern:




      • 1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
        and greater

      • 1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

      • 2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater


      The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.



      Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.



      And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.



      Some Observations:




      • The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

      • Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.


      Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60



      All available population values ages 0-99 for years 1940 through 2010



      Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.



      Average of age per each year and fitted curve of 4th order



      Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.



      enter image description here



      enter image description here



      Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.



      Total population 1940 through 2010:



      enter image description here



      Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).



      enter image description here





      My question



      This is where I could use some guidance to move forward:



      When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?



      Does one or more documented methods naturally apply to the problem as I have described it?
      For example: KNN, PCA, BPCA, Mean, MICE, other?



      If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?







      r missing-data data-imputation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 21 '18 at 20:12









      wacax

      1,91021038




      1,91021038










      asked Feb 22 '18 at 15:09









      ThreadidThreadid

      1062




      1062





      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:




          1. mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

          2. there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?


          I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.






          share|improve this answer









          $endgroup$













          • $begingroup$
            do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
            $endgroup$
            – Threadid
            Feb 22 '18 at 17:27










          • $begingroup$
            That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
            $endgroup$
            – bradS
            Feb 23 '18 at 11:05











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28188%2fmissing-population-values-in-census-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:




          1. mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

          2. there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?


          I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.






          share|improve this answer









          $endgroup$













          • $begingroup$
            do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
            $endgroup$
            – Threadid
            Feb 22 '18 at 17:27










          • $begingroup$
            That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
            $endgroup$
            – bradS
            Feb 23 '18 at 11:05
















          0












          $begingroup$

          I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:




          1. mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

          2. there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?


          I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.






          share|improve this answer









          $endgroup$













          • $begingroup$
            do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
            $endgroup$
            – Threadid
            Feb 22 '18 at 17:27










          • $begingroup$
            That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
            $endgroup$
            – bradS
            Feb 23 '18 at 11:05














          0












          0








          0





          $begingroup$

          I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:




          1. mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

          2. there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?


          I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.






          share|improve this answer









          $endgroup$



          I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:




          1. mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

          2. there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?


          I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 22 '18 at 16:41









          bradSbradS

          55312




          55312












          • $begingroup$
            do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
            $endgroup$
            – Threadid
            Feb 22 '18 at 17:27










          • $begingroup$
            That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
            $endgroup$
            – bradS
            Feb 23 '18 at 11:05


















          • $begingroup$
            do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
            $endgroup$
            – Threadid
            Feb 22 '18 at 17:27










          • $begingroup$
            That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
            $endgroup$
            – bradS
            Feb 23 '18 at 11:05
















          $begingroup$
          do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
          $endgroup$
          – Threadid
          Feb 22 '18 at 17:27




          $begingroup$
          do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
          $endgroup$
          – Threadid
          Feb 22 '18 at 17:27












          $begingroup$
          That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
          $endgroup$
          – bradS
          Feb 23 '18 at 11:05




          $begingroup$
          That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
          $endgroup$
          – bradS
          Feb 23 '18 at 11:05


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28188%2fmissing-population-values-in-census-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)