Learning by doing and thinking about the code we written.
This lab is designed to help you get familiar with Apache Airflow
. You will learn how to create a simple DAG, schedule it, monitor its execution, and more.
I'd try to go beyond the basic concepts and cover some advanced topics like TaskGroup
, TaskFlow API
, SLA
, Data-aware scheduling
, Kubernetes Executor
, and more.
Note: You can use Astro CLI
to create a new Airflow project. For more information, see Astro CLI
- Basic knowledge of Python
- Variables
- Functions
- Control Flow
arg
andkwargs
- Basic knowledge of Docker
docker compose up
anddown
is good enough
- Poetry: Package Manager for Python
poetry install --no-root
to install dependencies
- Some parts of the lab require basic knowledge of
Kubernetes
andHelm
- You can skip those parts if you are not familiar with them
-
- Lightweight Airflow setup with Docker, see
docker-compose.lite.yaml
- Enable Test button in Airflow UI
- Disable Example DAGs
- Copy Airflow Configuration
- Enable Flower UI
- Lightweight Airflow setup with Docker, see
-
- Data Pipeline
- Workflow Orchestration
-
Overview of Airflow UI and concepts
- Airflow UI
- Pause/Unpause
- Trigger DAG
- Refresh
- Recent Tasks
- DAG Runs
- Graph View
- DAGs
- Operators
- Tasks
- Airflow UI
-
Writing your first DAG (Single Operator)
- Create a new DAG with
PythonOperator
- Defining DAG
- Schedule
- Task
- Test the DAG
- Create a new DAG with
-
Writing your second DAG (Multiple Operators)
- Create a new DAG with
PythonOperator
- Define dependencies between tasks
- Test the DAG
- Create a new DAG with
-
- Fixed Interval
- Cron Expression
- Preset Airflow Scheduler
-
- Create a new DAG
- Create a new connection for Google Drive via Service Account
- Use
GoogleDriveToGCSOperator
to copy files from Google Drive to GCS - Test the DAG
-
GoogleDriveFileSensor
to wait for a file to be uploaded to Google Drive
-
Scraping Data from Githubs to Postgres
SimpleHTTPOperator
to get data from Github APIPostgresOperator
to insert data into Postgres
-
- Learn how to trigger another DAG
- Getting to know
TriggerDagRunOperator
-
Task Decorators - Taskflow API
- Simplified way to define tasks
- Getting to know
@task
decorator - Using
@task
to define taks likePythonOperator
-
Testing - In Progress
- Unit Testing
- DAG Integrity Testing
dag.test()
method
-
Dataset - Data-aware scheduling - In Progress
- Trigger DAG based on the data availability
- Wait for many datasets to be available
-
- Monitor the task execution with Flower UI (To enable Flower UI, see chapter-0)
- Add more workers to the Celery Executor
- Duplicate
airflow-worker
service indocker-compose.yml
and rename it - Restart Docker
- Duplicate
-
- Basic define dependencies between tasks
- Fan-in and Fan-out
- Trigger Rules
- Conditional Trigger
-
Managing Complex Tasks with TaskGroup
- Group tasks together
- Define dependencies between TaskGroups
-
- Define SLA for a DAG
- Define SLA for a task
- Define SLA callback
-
Airflow on Kubernetes - In Progress
- Deploy Airflow on Kubernetes Cluster using
Helm
andKind
- Use Kubernetes Executor
- Use KubernetesPodOperator
- Deploy Airflow on Kubernetes Cluster using
-
- Build Airflow Docker Image (Poetry for Package Management)
-
Working with DataHub - In Progress
- Setup DataHub on Local Development
- Emit Metadata to DataHub