Skip to content

WeRockStar/airflow-lab

Repository files navigation

Airflow Lab

Learning by doing and thinking about the code we written.

Introduction

This lab is designed to help you get familiar with Apache Airflow. You will learn how to create a simple DAG, schedule it, monitor its execution, and more.

I'd try to go beyond the basic concepts and cover some advanced topics like TaskGroup, TaskFlow API, SLA, Data-aware scheduling, Kubernetes Executor, and more.

Note: You can use Astro CLI to create a new Airflow project. For more information, see Astro CLI

Prerequisites

  • Basic knowledge of Python
    • Variables
    • Functions
    • Control Flow
    • arg and kwargs
  • Basic knowledge of Docker
    • docker compose up and down is good enough
  • Poetry: Package Manager for Python
    • poetry install --no-root to install dependencies
  • Some parts of the lab require basic knowledge of Kubernetes and Helm
    • You can skip those parts if you are not familiar with them

Lab Instructions

  1. Configuration

    • Lightweight Airflow setup with Docker, see docker-compose.lite.yaml
    • Enable Test button in Airflow UI
    • Disable Example DAGs
    • Copy Airflow Configuration
    • Enable Flower UI
  2. What's Airflow?

    • Data Pipeline
    • Workflow Orchestration
  3. Overview of Airflow UI and concepts

    • Airflow UI
      • Pause/Unpause
      • Trigger DAG
      • Refresh
      • Recent Tasks
      • DAG Runs
      • Graph View
    • DAGs
    • Operators
    • Tasks
  4. Writing your first DAG (Single Operator)

    • Create a new DAG with PythonOperator
    • Defining DAG
      • Schedule
      • Task
    • Test the DAG
  5. Writing your second DAG (Multiple Operators)

    • Create a new DAG with PythonOperator
    • Define dependencies between tasks
    • Test the DAG
  6. Schedule your DAG

    • Fixed Interval
    • Cron Expression
    • Preset Airflow Scheduler
  7. Google Drive to GCS

    • Create a new DAG
    • Create a new connection for Google Drive via Service Account
    • Use GoogleDriveToGCSOperator to copy files from Google Drive to GCS
    • Test the DAG
  8. Working with Sensor

    • GoogleDriveFileSensor to wait for a file to be uploaded to Google Drive
  9. Scraping Data from Githubs to Postgres

    • SimpleHTTPOperator to get data from Github API
    • PostgresOperator to insert data into Postgres
  10. Trigger Other DAGs

    • Learn how to trigger another DAG
    • Getting to know TriggerDagRunOperator
  11. Task Decorators - Taskflow API

    • Simplified way to define tasks
    • Getting to know @task decorator
    • Using @task to define taks like PythonOperator
  12. Testing - In Progress

    • Unit Testing
    • DAG Integrity Testing
    • dag.test() method
  13. Dataset - Data-aware scheduling - In Progress

    • Trigger DAG based on the data availability
    • Wait for many datasets to be available
  14. Celery Executor (Local)

    • Monitor the task execution with Flower UI (To enable Flower UI, see chapter-0)
    • Add more workers to the Celery Executor
      • Duplicate airflow-worker service in docker-compose.yml and rename it
      • Restart Docker
  15. Dependencies between Tasks

    • Basic define dependencies between tasks
    • Fan-in and Fan-out
    • Trigger Rules
    • Conditional Trigger
  16. Managing Complex Tasks with TaskGroup

    • Group tasks together
    • Define dependencies between TaskGroups
  17. SLA - Service Level Agreement

    • Define SLA for a DAG
    • Define SLA for a task
    • Define SLA callback
  18. Airflow on Kubernetes - In Progress

    • Deploy Airflow on Kubernetes Cluster using Helm and Kind
    • Use Kubernetes Executor
    • Use KubernetesPodOperator
  19. Build Airflow Docker Image

    • Build Airflow Docker Image (Poetry for Package Management)
  20. Working with DataHub - In Progress

    • Setup DataHub on Local Development
    • Emit Metadata to DataHub

About

Sharing Apache Airflow and make it easier to hand-on during sharing...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published