Missing population values in census data

I have population data from Census.gov:

Total US population by age by year from 1940 through 2010

Depending on the range of decades, the data is missing discrete population values for ages greater than a certain age. Instead an aggregate amount is provided that represents all ages greater than the cutoff.

Specifically it follows this pattern:

1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
and greater

1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater

The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.

Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.

And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.

Some Observations:

The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.

Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60

All available population values ages 0-99 for years 1940 through 2010

Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.

Average of age per each year and fitted curve of 4th order

Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.

enter image description here

Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.

Total population 1940 through 2010:

enter image description here

Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).

enter image description here

My question

This is where I could use some guidance to move forward:

When imputing discrete missing population values by age and year, how do I combine the fitted curve that models changes in population when age increases with the linear regression that models changes in population year over year?

Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?

If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have population data from Census.gov:

Total US population by age by year from 1940 through 2010

Specifically it follows this pattern:

1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
and greater

1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater

The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.

Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.

And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.

Some Observations:

The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.

Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60

All available population values ages 0-99 for years 1940 through 2010

Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.

Average of age per each year and fitted curve of 4th order

Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.

enter image description here

Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.

Total population 1940 through 2010:

enter image description here

Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).

enter image description here

My question

This is where I could use some guidance to move forward:

Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?

If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have population data from Census.gov:

Total US population by age by year from 1940 through 2010

Specifically it follows this pattern:

1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
and greater

1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater

The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.

Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.

And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.

Some Observations:

The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.

Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60

All available population values ages 0-99 for years 1940 through 2010

Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.

Average of age per each year and fitted curve of 4th order

Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.

enter image description here

Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.

Total population 1940 through 2010:

enter image description here

Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).

enter image description here

My question

This is where I could use some guidance to move forward:

Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?

If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

I have population data from Census.gov:

Total US population by age by year from 1940 through 2010

Specifically it follows this pattern:

1940 to 1979: Discrete data from 0 to 84 and aggregate for ages 85
and greater

1980 to 1999: Discrete data from 0 to 99 and aggregate for ages 100 and greater

2000 to 2010: Discrete data from 0 to 84 and aggregate for ages 85 and greater

The desired outcome is to have discrete data points for each age and year from 0-99 and then an aggregated lump sum figure for ages 100 and greater.

Therefore I want to input the missing discrete population values for ages 85 to 100 for years 1940 through 1979 and years 2000 through 2010.

And I want to use the actual discrete population values for ages 85 to 100 for years 1980 through 1989 to achieve that outcome.

Some Observations:

The pattern of missing values is MNAR (Missing Not At Random) - these were systematically omitted but the aggregate value representing the missing detail is provided

Population data for this time frame is deterministic: Population levels rise linearly each year; The duty cycle of a human body is finite and constraints and limits are well known.

Looking at the data we can see that each of the three subsets of years have very similar patterns. More variation in younger ages and variation flattens out for ages greater than 60

All available population values ages 0-99 for years 1940 through 2010

Then if we focus on the years 1980 through 1989 we can fit a nice curve for ages 0 through 100 with a Multiple-R-Squared of .979.

Average of age per each year and fitted curve of 4th order

Then if we narrow the focus to ages 60 to 100 and even narrower to ages 85 to 100 the Multiple-R-Squared increases to .9996.

enter image description here

Now if we flip our focus and look at the increasing levels of population we can observe that these relationships are linear. Population rises at a steady rate year over year.

Total population 1940 through 2010:

enter image description here

Ages 85 through 100 for years 1980 through 1999. Each age is linear. Each age has a slightly lower rate of increase (lesser slope).

enter image description here

My question

This is where I could use some guidance to move forward:

Does one or more documented methods naturally apply to the problem as I have described it?
For example: KNN, PCA, BPCA, Mean, MICE, other?

If there is recommended method can you point me to available R or Python packages and documentation that describes the mechanics of applying a given approach?

r missing-data data-imputation

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

edited Dec 21 '18 at 20:12

wacax

1,91021038

edited Dec 21 '18 at 20:12

wacax

1,91021038

edited Dec 21 '18 at 20:12

wacax

1,91021038

asked Feb 22 '18 at 15:09

Threadid

1062

asked Feb 22 '18 at 15:09

Threadid

1062

asked Feb 22 '18 at 15:09

Threadid

1062

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:

mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?

I would suggest a different approach. Actuaries have traditionally produced "life tables" which capture mortality in population cohorts. If you can find a set of tables which is applicable to the period in question, you could use these to calculate population numbers.

answered Feb 22 '18 at 16:41

bradS

55312

$begingroup$
do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
$endgroup$
– Threadid
Feb 22 '18 at 17:27

$begingroup$
That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
$endgroup$
– bradS
Feb 23 '18 at 11:05

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28188%2fmissing-population-values-in-census-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:

mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?

answered Feb 22 '18 at 16:41

bradS

55312

$begingroup$
do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
$endgroup$
– Threadid
Feb 22 '18 at 17:27

$begingroup$
That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
$endgroup$
– bradS
Feb 23 '18 at 11:05

add a comment |

I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:

mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?

answered Feb 22 '18 at 16:41

bradS

55312

$begingroup$
do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
$endgroup$
– Threadid
Feb 22 '18 at 17:27

$begingroup$
That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
$endgroup$
– bradS
Feb 23 '18 at 11:05

add a comment |

I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:

mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?

answered Feb 22 '18 at 16:41

bradS

55312

I think you need to be wary of using curves to extrapolate beyond the age thresholds - specifically I think you should consider:

mortality increases with age; I would imagine it increases at an increasing rate with age (especially at higher ages). Would you be able to capture this effectively?

there is obviously some sort of overlap between populations in subsequent years (e.g. someone who is included in 1985 data could also be in 1986 data). What effect would this have on the data imputation?

answered Feb 22 '18 at 16:41

bradS

55312

answered Feb 22 '18 at 16:41

bradS

55312

answered Feb 22 '18 at 16:41

bradS

55312

answered Feb 22 '18 at 16:41

bradS

55312

$begingroup$
do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
$endgroup$
– Threadid
Feb 22 '18 at 17:27

$begingroup$
That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
$endgroup$
– bradS
Feb 23 '18 at 11:05

add a comment |

$begingroup$
do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.
$endgroup$
– Threadid
Feb 22 '18 at 17:27

$begingroup$
That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.
$endgroup$
– bradS
Feb 23 '18 at 11:05

do you have any thoughts on using the linear models for year over year growth by age? I have a linear equation for age 85 for the actual data from 1980-1999. If I provide 1950 as the independent variable then it will calculate a population value for age 85 for year 1950. This assumes that the linear relationship holds true over the entire period.

– Threadid
Feb 22 '18 at 17:27

That might be a good way to tackle the problem; a down-side would be the number of models you have to build (I'm assuming you plan on building a separate model for age 86, 87, and so on?). Maybe also consider non-linear curves? The life table approach I suggested above does something similar, although it uses year-on-year survival probabilities. So my thinking was to source an appropriate life table, and apply the survival probabilities to extrapolate beyond the truncation.

– bradS
Feb 23 '18 at 11:05

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk