airflow-de-intro-project

This repository presents a simple introductory project demonstrating the basic elements of a pipeline designed to ingest data in the form of parquet files, perform certain transformation on that data using Pandas before writing the output as an iceberg table within the s3 sandbox.

During the project you will use some basic functions from some internal MoJ packages.

All data in this repository was adapted from a synthetic open dataset originally located at: https://github.com/datablist/sample-csv-files

How to use (WIP)

Initially fork this repo, then clone it to a local git project. This will allow changes you make to not conflict with changes by other users of this intro project. This is important later on when you start running the pipeline as an airflow job.

In scripts/functions.py we have defined some basic abstract functions which cover the main stages required for this pipeline with instructions in the doc string suggesting tools which should be used to perform these actions.

If your implementation of these functions becomes too complex it may be necessary to define your own functions as intermediate steps to make it easier to read and test.

We will be updating this project based upon user feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

airflow-de-intro-project

How to use (WIP)

About

Uh oh!

Releases 1

Packages

Languages

License

tomholt1/airflow-de-intro-project-tomholt1

Folders and files

Latest commit

History

Repository files navigation

airflow-de-intro-project

How to use (WIP)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages