2. Data workflow

Data workflow v1

The goal of various data treatments applied to raw census data consists of providing a harmonized version of census data from the different countries that can be explored easily through the dashboard.

Harmonize census data

Goal: Countries' census data have various schemas, structures, formats... And the definition of the administrative levels vary between countries. In order to integrate different census data into a single platform we need to harmonize them by defining a common schema, format and storage structure.
Input: Raw census data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/raw/{country_name}/census/{year}/raw/country-census.csv
Output:
- Census data harmonized with respect to a standardized census data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/{environment}/{country_name}/census/{year}/census_{country_name}{admin_level}{year}.csv.
- Table containing the definitions of census variables
Transformations:
- Subset and rename census variables
- Split the census data depending on the geographical level
- Store the data in csv format in aws file storage
- Link to the scripts:

Process boundaries

Input: Raw boundaries data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/raw/boundary_level.shp
Output: Boundaries data processed and transformed with respect to a standardized boundaries data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/processed/level.geojson.
Transformations:
- Simplify geometry in order to reduce files sizes
- Rename variables
- Store the data in geojson format in aws file storage
- Link to the scripts:

Join census with boundaries

Input: Processed census and boundaries data
Output: Geojson files including both boundaries' simplified geometries and associated census variables. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/census_geo/year/level_name/state_name/census_geo.geojson.
Transformations:
- Join census with geometries
- Split data by state and store the outputs with respect to the defined file storage structure

Data storage

Repository structure

For this first version of the socio-economic-vulnerability project, all the data is stored in this aws s3 buckets: s3://cities-socio-economic-vulnerability/data/. All data is stored in this repository need to be public in order to make easy the share between different users and applications.

Environment	Description
raw	the raw data or any data that we need to share and not ready to use within the dashboard> The data doesn't need to follow a specific schemas or fromats
dev	The dev folder contains data used for the in-development version of the dashboard
prd	The prd folder contains the data used for the in-production version of the dashboard. This repository follows the same structure than the dev repository

The structure of the data repository is as follows:

raw:
- country_name_1
  - boundaries
  - census
- country_name_2
- ...
dev:
- country_name_1
  - boundaries
    - boundary_admin-level-1.geojson
    - boundary_admin-level-1_simplified.geojson
    - boundary_admin-level-2.geojson
    - boundary_admin-level-2_simplified.geojson
    - ...
  - census
    - year 1
      - census_country-name_admin-level-1_year.csv
      - census_country-name_admin-level-2_year.csv
      - ...
    - year 2
      - census_country-name_admin-level-1_year.csv
      - census_country-name_admin-level-2_year.csv
      - ...
  - census_geo
    - year 1
      - admin_level_1
        
        state-1-code
        
        census_geo.geojson
        
        state-2-code
        
        census_geo.geojson
        
        ...
        
        All
        
        census_geo.geojson
      - admin_level_2
      - ...
prd

v2

Data access

Using R

# load libarary
library(aws.s3)

# set keys
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "my_access_key_id",
  "AWS_SECRET_ACCESS_KEY" = "my secret key",
  "AWS_DEFAULT_REGION" = "storage_region"
)

# file_path = path to the data as stored locally
# object_path = path to the data storage file in the bucket

put_object(file = file_path, 
           object = object_path, 
           bucket = bucket_name,
           acl = "public-read",
           multipart = TRUE)

Using python

The first step is to install the module boto3 using your preferred method (pip,conda).
To access S3, we first need to authenticate by defining a ServiceRessourceObject using the following code

s3 = boto3.resource(
    service_name='s3',
    region_name='eu-west-3',
    aws_access_key_id='mykey',
    aws_secret_access_key='mysecretkey'
)

The service name should be s3.
The region needs to be set to the one that is affiliated with the s3
aws_access_key_id and aws_secret_access_key are credentials that should have been transmitted to you when the account was created

After having properly done the authentication, you can start having access to the different buckets in the s3. To see a list of all available buckets, you can use the following code

for bucket in s3.buckets.all():
    print(bucket.name)

To upload a file, one can use the following line of code

s3.Bucket(bucket_name).upload_file(Filename=local_file, Key=server_file)

bucket_name should be one of the available buckets that were printed before
local_file is the path to the local file that has to be uploaded to s3
serer_file is the path to where the file will be uploaded in the bucket on s3

Exemple:

s3.Bucket('cities-socio-economic-vulnerability').upload_file(Filename='census.csv', Key='data/mexico/census/census.csv')

Mexico

Data collection

Mexico census raw data is available as tabular csv files for each Mexican state. Each file contains census information at the state scale and combining statistics at various resolutions: block, basic geostatistical areas, locality, municipality and state. A list of 222 census variables is provided in each census file.

field name	Description
ENTIDAD	State code
NOM_ENT	State name
MUN	Municipality code
NOM_MUN	Municipality name
LOC	Locality code
NOM_LOC	Locality name
AGEB	Basic Geostatistical Area code
MZA	Block number by AGEB
POBTOT	Total population count
POBFEM	Population by gender (women)
POBMAS	Population by gender (men)
P_0A2	Population by age (0 to 2 years)
P_0A2_F	Population by age and gender (Female Population 0 to 2 years)
P_0A2_M	Population by age and gender (Male Population 0 to 2 years)
P_0A2	Population by age (0 to 2 years)
P_0A2_F	Population by age and gender (Female Population 0 to 2 years)
P_0A2_M	Population by age and gender (Male Population 0 to 2 years)
P_3YMAS	Population by age (Population aged 3 years and over)
P_3YMAS_F	Population by age and gender (Female Population aged 3 years and over)
P_3YMAS_M	Population by age and gender (Male Population aged 3 years and over)
...	...

A sample of Mexico census data is provided in this csv file.

Data processing

Different data treatments are applied to the raw census data to provide the final harmonized census output table:

Census variables selection and renaming: Only a list of census variables is extracted from the complete list of variables provided by the raw data. The list of variables is provided in this mapping parameter table.
Census aggregation: Census values are aggregated at different geographical levels (state, municipality, locality, ageb, and block)
Data formatting: Different aggregated tables are merged in a single csv file with respect to the defined data model. An example of census output table is provided in this csv file.

The aggregated and processed data is stored in one csv file combining the different geographical scales. In order to filter census information by variables, the dataset is pivoted in a long format in a way that each row corresponds to a census variable value by geographical unit. The table below describes the defined data model:

Field name	Type	Description
Level	string	The geographical level corresponding to the variables' values.
State	string	The state name of the different geographical units.
Municipality	string	The municipality name of the different geographical units.
Locality	string	The locality name of the different geographical units.
AGEB code	string	The Basic Geostatistical Area code.
Block code	string	The block code.
Census variable	string	Census variable name.
Year	string	Year of the census data used.
Value	float	Census variable value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Data workflow

Data workflow v1

Harmonize census data

Process boundaries

Join census with boundaries

Data storage

Repository structure

v2

Data access

Using R

Using python

Mexico

Data collection

Data processing

Data storage

Brazil

Data collection

Data processing

Data storage

Columbia

Data collection

Data processing

Data storage

India

Data collection

Data processing

Data storage

Clone this wiki locally