So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flow to include data for the year 2021.
As a hint, Kestra makes that process really easy:
- You can leverage the backfill functionality in the scheduled flow to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from
2021-01-01
to2021-07-31
. Also, make sure to do the same for bothyellow
andgreen
taxi data (select the right service in thetaxi
input). - Alternatively, run the flow manually for each of the seven months of 2021 for both
yellow
andgreen
taxi data. Challenge for you: find out how to loop over the combination of Year-Month andtaxi
-type usingForEach
task which triggers the flow for each combination using aSubflow
task.
Within the execution for Yellow
Taxi data for the year 2020
and month 12
: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv
of the extract
task)?
- 128.3 MB
- 134.5 MB
- 364.7 MB
- 692.6 MB
What is the task run state of tasks bq_green_tripdata, bq_green_tmp_table and bq_merge_green when you run the flow with the taxi
input set to value yellow
?
SUCCESS
FAILED
SKIPPED
CANCELED
How do we deal with table schema in the Google Cloud ingestion pipeline?
- We don't define the schema at all because this is a data lake after all
- We let BigQuery autodetect the schema
- Kestra automatically infers the schema from the extracted data
- We explicitly define the schema in the tasks that create external source tables and final target tables
How does Kestra handles backfills in the scheduled flow?
- You need to define backfill properties in the flow configuration
- You have to run CLI commands to backfill the data
- You can run backfills directly from the UI from the Flow Triggers tab by selecting the time period
- Kestra doesn't support backfills
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw2 - TBD we may prefer creating a form with Kestra Apps if possible so that we can automate the grading process
- Check the link above to see the due date
Will be provided after the due date.