Skip to content

2. Data workflow

Saif Shabou edited this page Aug 5, 2022 · 12 revisions

Data workflow v1

The goal of various data treatments applied to raw census data consists of providing a harmonized version of census data from the different countries that can be explored easily through the dashboard.

Harmonize census data

  • Goal: Countries' census data have various schemas, structures, formats... And the definition of the administrative levels vary between countries. In order to integrate different census data into a single platform we need to harmonize them by defining a common schema, format and storage structure.

  • Input: Raw census data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/raw/{country_name}/census/{year}/raw/country-census.csv

  • Output:

    • Census data harmonized with respect to a standardized census data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/data/{environment}/{country_name}/census/{year}/census_{country_name}{admin_level}{year}.csv.
    • Table containing the definitions of census variables
  • Transformations:

    • Subset and rename census variables
    • Split the census data depending on the geographical level
    • Store the data in csv format in aws file storage
    • Link to the scripts:

Process boundaries

  • Input: Raw boundaries data stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/raw/boundary_level.shp
  • Output: Boundaries data processed and transformed with respect to a standardized boundaries data schemas described in this wiki page. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/boundaries/processed/level.geojson.
  • Transformations:
    • Simplify geometry in order to reduce files sizes
    • Rename variables
    • Store the data in geojson format in aws file storage
    • Link to the scripts:

Join census with boundaries

  • Input: Processed census and boundaries data
  • Output: Geojson files including both boundaries' simplified geometries and associated census variables. The output files are stored in aws s3 buckets with respect to the following file storage structure: s3/cities-socio-economic-vulnerability/country_name/census_geo/year/level_name/state_name/census_geo.geojson.
  • Transformations:
    • Join census with geometries
    • Split data by state and store the outputs with respect to the defined file storage structure

Data storage

Repository structure

For this first version of the socio-economic-vulnerability project, all the data is stored in this aws s3 buckets: s3://cities-socio-economic-vulnerability/data/. All data is stored in this repository need to be public in order to make easy the share between different users and applications.

Environment Description
raw the raw data or any data that we need to share and not ready to use within the dashboard> The data doesn't need to follow a specific schemas or fromats
dev The dev folder contains data used for the in-development version of the dashboard
prd The prd folder contains the data used for the in-production version of the dashboard. This repository follows the same structure than the dev repository

The structure of the data repository is as follows:

  • raw:
    • country_name_1
      • boundaries
      • census
    • country_name_2
    • ...
  • dev:
    • country_name_1
      • boundaries
        • boundary_admin-level-1.geojson
        • boundary_admin-level-1_simplified.geojson
        • boundary_admin-level-2.geojson
        • boundary_admin-level-2_simplified.geojson
        • ...
      • census
        • year 1
          • census_country-name_admin-level-1_year.csv
          • census_country-name_admin-level-2_year.csv
          • ...
        • year 2
          • census_country-name_admin-level-1_year.csv
          • census_country-name_admin-level-2_year.csv
          • ...
      • census_geo
        • year 1
          • admin_level_1
            • state-1-code
              • census_geo.geojson
            • state-2-code
              • census_geo.geojson
            • ...
            • All
              • census_geo.geojson
          • admin_level_2
          • ...
  • prd

v2

Data access

Using R

# load libarary
library(aws.s3)

# set keys
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "my_access_key_id",
  "AWS_SECRET_ACCESS_KEY" = "my secret key",
  "AWS_DEFAULT_REGION" = "storage_region"
)

# file_path = path to the data as stored locally
# object_path = path to the data storage file in the bucket

put_object(file = file_path, 
           object = object_path, 
           bucket = bucket_name,
           acl = "public-read",
           multipart = TRUE)

Using python

  1. The first step is to install the module boto3 using your preferred method (pip,conda).

  2. To access S3, we first need to authenticate by defining a ServiceRessourceObject using the following code

s3 = boto3.resource(
    service_name='s3',
    region_name='eu-west-3',
    aws_access_key_id='mykey',
    aws_secret_access_key='mysecretkey'
)
  • The service name should be s3.
  • The region needs to be set to the one that is affiliated with the s3
  • aws_access_key_id and aws_secret_access_key are credentials that should have been transmitted to you when the account was created

After having properly done the authentication, you can start having access to the different buckets in the s3. To see a list of all available buckets, you can use the following code

for bucket in s3.buckets.all():
    print(bucket.name)
  1. To upload a file, one can use the following line of code
s3.Bucket(bucket_name).upload_file(Filename=local_file, Key=server_file)
  • bucket_name should be one of the available buckets that were printed before
  • local_file is the path to the local file that has to be uploaded to s3
  • serer_file is the path to where the file will be uploaded in the bucket on s3

Exemple:

s3.Bucket('cities-socio-economic-vulnerability').upload_file(Filename='census.csv', Key='data/mexico/census/census.csv')

Mexico

Data collection

Mexico census raw data is available as tabular csv files for each Mexican state. Each file contains census information at the state scale and combining statistics at various resolutions: block, basic geostatistical areas, locality, municipality and state. A list of 222 census variables is provided in each census file.

field name Description
ENTIDAD State code
NOM_ENT State name
MUN Municipality code
NOM_MUN Municipality name
LOC Locality code
NOM_LOC Locality name
AGEB Basic Geostatistical Area code
MZA Block number by AGEB
POBTOT Total population count
POBFEM Population by gender (women)
POBMAS Population by gender (men)
P_0A2 Population by age (0 to 2 years)
P_0A2_F Population by age and gender (Female Population 0 to 2 years)
P_0A2_M Population by age and gender (Male Population 0 to 2 years)
P_0A2 Population by age (0 to 2 years)
P_0A2_F Population by age and gender (Female Population 0 to 2 years)
P_0A2_M Population by age and gender (Male Population 0 to 2 years)
P_3YMAS Population by age (Population aged 3 years and over)
P_3YMAS_F Population by age and gender (Female Population aged 3 years and over)
P_3YMAS_M Population by age and gender (Male Population aged 3 years and over)
... ...

A sample of Mexico census data is provided in this csv file.

Data processing

Different data treatments are applied to the raw census data to provide the final harmonized census output table:

  • Census variables selection and renaming: Only a list of census variables is extracted from the complete list of variables provided by the raw data. The list of variables is provided in this mapping parameter table.
  • Census aggregation: Census values are aggregated at different geographical levels (state, municipality, locality, ageb, and block)
  • Data formatting: Different aggregated tables are merged in a single csv file with respect to the defined data model. An example of census output table is provided in this csv file.

The aggregated and processed data is stored in one csv file combining the different geographical scales. In order to filter census information by variables, the dataset is pivoted in a long format in a way that each row corresponds to a census variable value by geographical unit. The table below describes the defined data model:

Field name Type Description
Level string The geographical level corresponding to the variables' values.
State string The state name of the different geographical units.
Municipality string The municipality name of the different geographical units.
Locality string The locality name of the different geographical units.
AGEB code string The Basic Geostatistical Area code.
Block code string The block code.
Census variable string Census variable name.
Year string Year of the census data used.
Value float Census variable value.

Data storage

Brazil

Data collection

Data processing

Data storage

Columbia

Data collection

Data processing

Data storage

India

Data collection

Data processing

Data storage