Skip to content

tomholt1/airflow-de-intro-project-tomholt1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airflow-de-intro-project

This repository presents a simple introductory project demonstrating the basic elements of a pipeline designed to ingest data in the form of parquet files, perform certain transformation on that data using Pandas before writing the output as an iceberg table within the s3 sandbox.

During the project you will use some basic functions from some internal MoJ packages.

All data in this repository was adapted from a synthetic open dataset originally located at: https://github.com/datablist/sample-csv-files

How to use (WIP)

Initially fork this repo, then clone it to a local git project. This will allow changes you make to not conflict with changes by other users of this intro project. This is important later on when you start running the pipeline as an airflow job.

In scripts/functions.py we have defined some basic abstract functions which cover the main stages required for this pipeline with instructions in the doc string suggesting tools which should be used to perform these actions.

If your implementation of these functions becomes too complex it may be necessary to define your own functions as intermediate steps to make it easier to read and test.

We will be updating this project based upon user feedback.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 86.3%
  • Dockerfile 13.7%