This package provides a high-level wrapper for generating synthetic populations via Census APIs based on the American Community Survey (ACS) 5-Year Estimates. Synthetic populations are virtual representations of people and households produced for small census areas (block groups, tracts) and can be attributed by a variety of demographic, economic, social, worker, student, mobility, housing, health, and communication characteristics found in the ACS.
Synthetic populations are generated by allocating records from the ACS Public Use Microdata Sample (PUMS) from their native spatial resolution of Public-Use Microdata Areas (100,000+ people) to small census areas (typically <8000 people) such that the aggregate characteristics of people and households align closely with population profiles of the small census areas available in the ACS Summary File (SF). This is accomplished using Penalized Maximum-Entropy Dasymetric Modeling (P-MEDM), which seeks to recreate the error variances on each small-area variable estimate in the ACS SF. LiveLike makes it simple to design and solve P-MEDM problems by fetching all of the necessary P-MEDM inputs for a given PUMA via Census APIs.
The bulk of P-MEDM setup is handled automatically by the acs
module via the Census Microdata API.
In a basic use-case, inputs are simply:
- The 2010 or 2020 PUMA ID (
<State FIPS> + <PUMA FIPS>
, as shown here - A Census API key (optional).
Examples are provided in the notebooks
directory.
P-MEDM requires a target geography and an aggregate geography to account for error variances. The selected target geography determines the aggregate geography:
Level | Code | Population (approx.) | Aggregate |
---|---|---|---|
Block group | bg |
600 - 3000 | Tract |
Tract | trt |
1200 - 8000 | Supertract |
LiveLike handles tracts, which have no sub-county aggregation level, using a regionalization approach to generate custom "supertracts" (see notebooks/tract_supertract_2019.ipynb
for an example).
The ACS 5-Year Estimates are a rolling 5% sample of the United States population weighted to be representative of the release year (vintage), with additional adjustments for factors like income. LiveLike uses the ACS 2019 5-Year Estimates as its default vintage.
Year | Vintage | Available |
---|---|---|
2016 | ACS 2012 - 2016 5-Year Estimates | ✅ |
2017 | ACS 2013 - 2017 5-Year Estimates | ✅ |
2018 | ACS 2014 - 2018 5-Year Estimates | ✅ |
2019 | ACS 2015 - 2019 5-Year Estimates | ✅ |
2020 | ACS 2016 - 2020 5-Year Estimates | ❌ |
2021 | ACS 2017 - 2021 5-Year Estimates | ❌ |
2022 | ACS 2018 - 2022 5-Year Estimates | ❌ |
2023 | ACS 2019 - 2023 5-Year Estimates | ✅ |
Currently, years between 2016 and 2019 and 2023 are supported. The gap between 2020 - 2022 is due to mixed geography problems that P-MEDM cannot directly handle (2010 PUMAs with 2020 small areas for 2020, 2021; mixture of 2010/2020 PUMAs with 2020 small areas for 2022).
P-MEDM constraints are sets of residential and population characteristics common between the ACS SF and PUMS that can be used to design a P-MEDM model and attribute the synthetic population. LiveLike provides several configurations of prebuilt constraints:
-
Base (default): Baseline modeling constraints representing population totals, routine daily activities (workers, students), and mobility characteristics, available in
config.up_base_constraints_selection
. -
Expanded: Baseline modeling constraints with a selection of demographic, social, economic, and housing characteristics, available in
config.up_expanded_constraints_selection
. The Base constraints can be overwritten by the Expanded ones using:from config import up_expanded_constraints_selection acs.puma(..., constraints_selection=up_expanded_constraints_selection)
Several additional constraint themes (health, communications) are available outside the prebuilt configurations and can be added onto a custom constraints selection.
Theme | Description | Base | Expanded | Notes |
---|---|---|---|---|
universe | Sampling universe totals (population, civilian noninstituionalized population, group quarters population, housing units, occupied housing units). | x | x | |
worker | Worker characteristics (employment, class of worker, industry, occupation, hours worked per week). | x | x | |
student | Student characteristics (grade level attending, public/private school). | x | x | |
mobility | Mobility characteristics (commute time/mode, vehicles available). | x | x | |
demographic | Basic demographics (sex, age) and living arrangement characteristics. | x | Expanded: Sex by age and household type only | |
social | Social characteristics (race/ethnicity, language, place of birth, veteran status). | x | Expanded: Race/ethnicity only | |
economic | Economic characteristics (household income, poverty, educational attainment). | x | Expanded: Household income and income to poverty ratio only | |
housing | Housing characteristics (tenure, dwelling type, year built, number of rooms, house heating fuel). | x | Expanded: Dwelling type and year built only | |
health | Health insurance coverage type. | |||
communications | Household internet access. |
Constraint selections are passed to acs.puma(constraint_selection=...)
as a dict
with keys representing ACS variable themes and values representing specific subjects (tables). If the value passed is a bool
type, a True
value will include variables for all subjects in the theme, while a False
value will bypass that theme (the same as omitting the theme from the selection). If the value passed is a list
type, only listed subjects will be included in the result.
Example:
custom_constraints_selection = {
"universe" : True,
"worker" : True,
"student" : True,
"mobility" : True,
"demographic" : [
"sex_age",
"hhtype",
],
"economic" : [
"hhinc",
"ipr",
],
"health" : True,
"communications" : True,
}
- Use all variables listed under the
universe
,worker
,student
, andmobility
,health
, andcommunications
themes. - Use only household income (
hhinc
) and income to poverty ratio (ipr
) from theeconomic
theme.
The constraints file (livelike/data/constraints.csv
) underlies the constraint selection process, describing relationships between available PUMS variables, P-MEDM constraints, and ACS Summary File (SF) variables, as well as year of availability for constraints. It is used to generate individual-level representations of ACS SF tables/variables based on PUMS data.
level
: PUMS file level (person
orhousehold
).geo_base_level
: Baseline geography for which the constraint is available (bg
: block group;trt
: tract).theme
: Constraint topics/themes. Each theme points to a PUMS/SF crosswalking function inlivelike.pums
.subject
: The subject of the ACS SF table to be represented at the individual level using PUMS data. This column references the function in thepums
module used to produce a P-MEDM constraint.constraint
: P-MEDM constraining variable name.pums[1...n]
: Multiple columns the PUMS variables associated with each P-MEDM constraint table. These are parsed using a regex search for any columns in the file beginning withpums
.code
: ACS SF variable codes matching each P-MEDM constraint.desc
: P-MEDM constraining variable longform description.begin_year
: the initial year in which the constraint was availble.end_year
: the final year in which the constraint was available.
Using a Census API Key is optional but is recommended to avoid hitting request limits.
- Register for a Census API Key.
- Activate your key via the confirmation email link you receive.
- In the top directory of
livelike
, run:
echo YOUR_CENSUS_API_KEY > censusapikey.txt
The file that is created, censusapikey.txt
, is not tracked by git
. This ensures that your personal API key is never exposed on a remote branch.
Utilities for population synthesis can be found in the homesim
module. Our current approach is to sample from the P-MEDM allocation matrix (
The multi
module provides utilities for population synthesis across multiple PUMAs, including:
- Making PUMA instances across multiple geographies or replicates (alternative PUMS weights)
- Population synthesis
- Querying and extracting PUMS descriptors from Census Microdata API
The scripts to rebuild test data are stored in the utilities
directory. Execute them from the main directory, for example:
python utilities/prep_test_build_puma.py
python utilities/prep_test_notebook_solutions.py
To run the testing suite locally, enter:
bash run_tests.sh
The default P-MEDM solver, pymedm
, gives different solutions when constraint order varies. This seems to be tied to floating point underflow errors in jax
, a core dependency of pymedm
, that seem to be caused by differing positions of the model input variables. LiveLike for both prebuilt and custom constraints, implementing a method in the puma
constructor to consistently sort constraints by theme
and code
.
In rare cases, the values of PUMS replicate household weights can be negative. For compatibility with P-MEDM, we zero out these negative values. See this thread for further details.
The P-MEDM population
constraint is approximated as a sum of the ratio of each household member's person weight (PWGTP
) to the head of household's weight (which itself roughly matches the household weight). When the head of household's replicate person weight is less than one, we use a placeholder value of 1 so that each additional household member still contributes to the population
constraint for the household. We welcome community contributions for more robust improvements to this approach.