-
Notifications
You must be signed in to change notification settings - Fork 2
2. Data workflow
The goal of various data treatments applied to raw census data consists of providing a harmonized version of census data from the different countries that can be explored easily through the dashboard.
-
Goal: Countries' census data have various schemas, structures, formats... And the definition of the administrative levels vary between countries. In order to integrate different census data into a single platform we need to harmonize them by defining a common schema, format and storage structure.
-
Input: Raw census data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/raw/{country_name}/census/{year}/raw/country-census.csv
-
Output:
- Census data harmonized with respect to a standardized census data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/{environment}/{country_name}/census/{year}/census_{country_name}{admin_level}{year}.csv.
- Table containing the definitions of census variables
-
Transformations:
- Subset and rename census variables
- Split the census data depending on the geographical level
- Store the data in csv format in aws file storage
- Link to the scripts:
- Input: Raw boundaries data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/raw/boundary_level.shp
- Output: Boundaries data processed and transformed with respect to a standardized boundaries data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/processed/level.geojson.
- Transformations:
- Simplify geometry in order to reduce files sizes
- Rename variables
- Store the data in geojson format in aws file storage
- Link to the scripts:
- Input: Processed census and boundaries data
- Output: Geojson files including both boundaries' simplified geometries and associated census variables. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/census_geo/year/level_name/state_name/census_geo.geojson.
- Transformations:
- Join census with geometries
- Split data by state and store the outputs with respect to the defined file storage structure
For this first version of the socio-economic-vulnerability project, all the data is stored in this aws s3 buckets: s3://cities-socio-economic-vulnerability/data/. All data is stored in this repository need to be public in order to make easy the share between different users and applications.
Environment | Description |
---|---|
raw | the raw data or any data that we need to share and not ready to use within the dashboard> The data doesn't need to follow a specific schemas or fromats |
dev | The dev folder contains data used for the in-development version of the dashboard |
prd | The prd folder contains the data used for the in-production version of the dashboard. This repository follows the same structure than the dev repository |
The structure of the data repository is as follows:
- raw:
- country_name_1
- boundaries
- census
- country_name_2
- ...
- country_name_1
- dev:
- country_name_1
- boundaries
- boundary_admin-level-1.geojson
- boundary_admin-level-1_simplified.geojson
- boundary_admin-level-2.geojson
- boundary_admin-level-2_simplified.geojson
- ...
- census
-
year 1
- census_country-name_admin-level-1_year.csv
- census_country-name_admin-level-2_year.csv
- ...
-
year 2
- census_country-name_admin-level-1_year.csv
- census_country-name_admin-level-2_year.csv
- ...
-
year 1
- census_geo
-
year 1
- admin_level_1
- state-1-code
- census_geo.geojson
- state-2-code
- census_geo.geojson
- ...
- All
- census_geo.geojson
- state-1-code
- admin_level_2
- ...
- admin_level_1
-
year 1
- boundaries
- country_name_1
- prd
# load libarary
library(aws.s3)
# set keys
Sys.setenv(
"AWS_ACCESS_KEY_ID" = "my_access_key_id",
"AWS_SECRET_ACCESS_KEY" = "my secret key",
"AWS_DEFAULT_REGION" = "storage_region"
)
# file_path = path to the data as stored locally
# object_path = path to the data storage file in the bucket
put_object(file = file_path,
object = object_path,
bucket = bucket_name,
acl = "public-read",
multipart = TRUE)
-
The first step is to install the module boto3 using your preferred method (pip,conda).
-
To access S3, we first need to authenticate by defining a ServiceRessourceObject using the following code
s3 = boto3.resource(
service_name='s3',
region_name='eu-west-3',
aws_access_key_id='mykey',
aws_secret_access_key='mysecretkey'
)
- The service name should be s3.
- The region needs to be set to the one that is affiliated with the s3
- aws_access_key_id and aws_secret_access_key are credentials that should have been transmitted to you when the account was created
After having properly done the authentication, you can start having access to the different buckets in the s3. To see a list of all available buckets, you can use the following code
for bucket in s3.buckets.all():
print(bucket.name)
- To upload a file, one can use the following line of code
s3.Bucket(bucket_name).upload_file(Filename=local_file, Key=server_file)
- bucket_name should be one of the available buckets that were printed before
- local_file is the path to the local file that has to be uploaded to s3
- serer_file is the path to where the file will be uploaded in the bucket on s3
Exemple:
s3.Bucket('cities-socio-economic-vulnerability').upload_file(Filename='census.csv', Key='data/mexico/census/census.csv')
Mexico census raw data is available as tabular csv files for each Mexican state. Each file contains census information at the state scale and combining statistics at various resolutions: block, basic geostatistical areas, locality, municipality and state. A list of 222 census variables is provided in each census file.
field name | Description |
---|---|
ENTIDAD | State code |
NOM_ENT | State name |
MUN | Municipality code |
NOM_MUN | Municipality name |
LOC | Locality code |
NOM_LOC | Locality name |
AGEB | Basic Geostatistical Area code |
MZA | Block number by AGEB |
POBTOT | Total population count |
POBFEM | Population by gender (women) |
POBMAS | Population by gender (men) |
P_0A2 | Population by age (0 to 2 years) |
P_0A2_F | Population by age and gender (Female Population 0 to 2 years) |
P_0A2_M | Population by age and gender (Male Population 0 to 2 years) |
P_0A2 | Population by age (0 to 2 years) |
P_0A2_F | Population by age and gender (Female Population 0 to 2 years) |
P_0A2_M | Population by age and gender (Male Population 0 to 2 years) |
P_3YMAS | Population by age (Population aged 3 years and over) |
P_3YMAS_F | Population by age and gender (Female Population aged 3 years and over) |
P_3YMAS_M | Population by age and gender (Male Population aged 3 years and over) |
... | ... |
A sample of Mexico census data is provided in this csv file.
Different data treatments are applied to the raw census data to provide the final harmonized census output table:
- Census variables selection and renaming: Only a list of census variables is extracted from the complete list of variables provided by the raw data. The list of variables is provided in this mapping parameter table.
- Census aggregation: Census values are aggregated at different geographical levels (state, municipality, locality, ageb, and block)
- Data formatting: Different aggregated tables are merged in a single csv file with respect to the defined data model. An example of census output table is provided in this csv file.
The aggregated and processed data is stored in one csv file combining the different geographical scales. In order to filter census information by variables, the dataset is pivoted in a long format in a way that each row corresponds to a census variable value by geographical unit. The table below describes the defined data model:
Field name | Type | Description |
---|---|---|
Level | string | The geographical level corresponding to the variables' values. |
State | string | The state name of the different geographical units. |
Municipality | string | The municipality name of the different geographical units. |
Locality | string | The locality name of the different geographical units. |
AGEB code | string | The Basic Geostatistical Area code. |
Block code | string | The block code. |
Census variable | string | Census variable name. |
Year | string | Year of the census data used. |
Value | float | Census variable value. |