Week 2: Workflow Orchestration

Welcome to Week 2 of the Data Engineering Zoomcamp! This week, we'll cover workflow orchestration with Kestra.

Kestra is an open-source, event-driven orchestration platform that makes both scheduled and event-driven workflows easy. By bringing Infrastructure as Code best practices to data, process, and microservice orchestration, you can build reliable workflows directly from the UI in just a few lines of YAML.

The course will cover the basics of workflow orchestration, why it's important, and how it can be used to build data engineering pipelines.

Introduction to Workflow Orchestration

In this section, we'll cover the basics of workflow orchestration. We'll discuss what it is, why it's important, and how it can be used to build data pipelines.

Videos

Introduction to Workflow Orchestration

Introduction to Kestra

In this section, you'll learn what is Kestra, how to use it, and how to build a Hello-World data pipeline.

Videos

Introduction to Kestra

Launch Kestra using Docker Compose

Kestra Fundamentals

Resources

ETL: Extract data and load it to Postgres

In this section, we'll cover how you can ingest the Yellow Taxi data from the NYC Taxi and Limousine Commission (TLC) and load it into a Postgres database. We'll cover how to extract data from CSV files, and load them into a local Postgres database running in a Docker container.

Important note: the TLC Trip Record Data provided on the nyc.gov website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the CSV files available here on GitHub. This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor.

ETL: Extract data and load it to Google Cloud

Now that you explored how to run ETL locally using Postgres, we'll now do the same on GCP. We'll load the same data to:

Data lake using Google Cloud Storage (GCS)
Data Warehouse using BigQuery.

Scheduling and Backfills

In this section, we'll cover how you can schedule your data pipelines to run at specific times. We'll also cover how you can backfill your data pipelines to run on historical data.

We'll demonstrate backfills first locally using Postgres and then on GCP using GCS and BigQuery.

Homework

The homework for this week can be found here. Don't worry, it's just a bunch of Multiple Choice Questions to test your understanding of Kestra, Workflow Orchestration, and ETL pipelines for a data lake and data warehouse.

Additional Resources

Check Kestra Docs
Explore our Blueprints library
Browse over 600 plugins available in Kestra
Give us a star on GitHub
Join our Slack community if you have any questions.

Community notes

Did you take notes? You can share them here. Just create a PR to this file and add your notes below.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
02-workflow-orchestration		02-workflow-orchestration
flows		flows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Week 2: Workflow Orchestration

Introduction to Workflow Orchestration

Introduction to Kestra

ETL: Extract data and load it to Postgres

ETL: Extract data and load it to Google Cloud

Scheduling and Backfills

Homework

Additional Resources

Community notes

About

License

kestra-io/data-engineering-zoomcamp

Folders and files

Latest commit

History

Repository files navigation

Week 2: Workflow Orchestration

Introduction to Workflow Orchestration

Introduction to Kestra

ETL: Extract data and load it to Postgres

ETL: Extract data and load it to Google Cloud

Scheduling and Backfills

Homework

Additional Resources

Community notes

About

Topics

Resources

License

Stars

Watchers

Forks