DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

JeremyMcCormick · 2025-04-09T22:37:14Z

This PR implements an ETL pipeline for exporting data from Cassandra and loading it into a BigQuery database.

Replica chunks are written to parquet files in a specific directory structure, using the new command ppdb-replication export-chunks. When a chunk has been fully exported, its directory is marked with a .ready file, indicating it can be uploaded to cloud storage.

The uploader copies the the parquet files into Google Cloud Storage by chunk and can be run using the ppdb-replication upload-chunks command. It also generates a manifest file with the chunk information. After a chunk is uploaded, the .ready file is replaced with a .uploaded file. If there is a failure, then a .failed file is written instead. It is not safe to run more than one uploader process at a time, and it is not anticipated that this would be needed. (In future, it is planned to replace the marker files with a replica chunk database for this coordination instead, which may be designed so that multiple uploaders could run at once.) After successfully uploading a chunk, the uploader publishes an event to a Pub/Sub topic, triggering a cloud function, which starts a Dataflow job to ingest the files into BigQuery.

The cloud function and the Dataflow job are implemented under cloud_functions/stage_chunk, which is deliberately separate from the existing Python source tree. Several helper scripts and a Makefile are included for deploying the cloud function and the Dataflow container and template. Currently, the stage-chunk job copies the data from the parquet files directly into the BigQuery production tables, but in future this will be updated to use staging tables instead, to avoid situations where the data fails to load into production and leaves the tables in an inconsistent state.

External scripts were used to create the target BigQuery database, as well as setup the necessary cloud infrastructure (These are not included in this PR.). These scripts and configuration files will eventually be moved into idf_deploy. A working environment can be setup using this shell script.

This PR does not represent the final version of this ETL pipeline, but an interim, working version. Additional enhancements will be added based on separate Jira tickets.

TODO

In the Beam job script, get the parameters from the manifest file instead of inferring them from the GCS object name
Add options to the uploader for specifying a wait time between chunk uploads, to avoid possibly overloading BigQuery with updates, as well as the time to wait between scanning for chunks that are ready (similar to the existing ppdb-replication run command). These settings could also be used to reduce load on the APDB, should this be necessary.
Find a better way to get schema version than apdb._schema.schemaVersion() (should this be exposed in the ApdbReplica interface in dax_apdb?)

codecov · 2025-04-09T23:15:13Z

Codecov Report

Attention: Patch coverage is 5.26316% with 54 lines in your changes missing coverage. Please review.

Project coverage is 43.53%. Comparing base (bee930f) to head (7d5cbfa).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
python/lsst/dax/ppdb/replicator.py	8.82%	31 Missing ⚠️
python/lsst/dax/ppdb/scripts/export_chunks_run.py	0.00%	15 Missing ⚠️
python/lsst/dax/ppdb/scripts/upload_chunks_run.py	0.00%	4 Missing ⚠️
python/lsst/dax/ppdb/scripts/__init__.py	0.00%	2 Missing ⚠️
python/lsst/dax/ppdb/scripts/replication_run.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
- Coverage   44.70%   43.53%   -1.18%     
==========================================
  Files          17       19       +2     
  Lines         718      735      +17     
  Branches       81       78       -3     
==========================================
- Hits          321      320       -1     
- Misses        363      384      +21     
+ Partials       34       31       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andy-slac

My biggest problem with this is that you read APDB data and S3 upload in the same process sequentially. I think we agreed before that you want to do it in separate processes with parquet files staged on local disk until they are moved to S3. Otherwise any problem on S3 side will cause you to re-read APDB potentially many times which is very undesirable.

python/lsst/dax/ppdb/sql/_chunk_exporter.py

python/lsst/dax/ppdb/scripts/export_chunks_run.py

python/lsst/dax/ppdb/sql/_chunk_exporter.py

JeremyMcCormick · 2025-04-17T21:22:32Z

@andy-slac

My biggest problem with this is that you read APDB data and S3 upload in the same process sequentially.

This should be resolved now. I split the process into an exporter for writing the Parquet files locally and an uploader for copying them into GCS. Each runs independently in a separate process with a different CLI command. For now, bookkeeping is being done with simple marker files on the local filesystem, but I can add a database for tracking this later once some refactoring has been done.

I will work on the batch writing of the Parquet files next.

Conflicts with mypy rule

JeremyMcCormick marked this pull request as draft April 9, 2025 22:37

JeremyMcCormick force-pushed the tickets/DM-49202 branch from 29cdb9f to bfd966a Compare April 9, 2025 23:13

JeremyMcCormick changed the title ~~DM-49202: Implement export of APDB for PPDB in BigQuery~~ DM-49202: Implement export of APDB data for PPDB in BigQuery Apr 10, 2025

JeremyMcCormick force-pushed the tickets/DM-49202 branch from bfd966a to 699faf1 Compare April 10, 2025 21:10

andy-slac approved these changes Apr 11, 2025

View reviewed changes

JeremyMcCormick force-pushed the tickets/DM-49202 branch 4 times, most recently from cbc2172 to 27d7c61 Compare April 17, 2025 21:20

JeremyMcCormick added 4 commits April 17, 2025 15:19

Remove empty line

c6285de

Implement chunk exporter for writing chunks to local parquet files

5fe0db7

Implement chunk uploader for copying local files to GCS

8975dea

Write parquet file in record batches

9f95d09

JeremyMcCormick force-pushed the tickets/DM-49202 branch from 61fff17 to 9f95d09 Compare April 17, 2025 22:20

JeremyMcCormick added 3 commits April 18, 2025 11:37

Refactor replicator so it runs the replication loop

10d0f8b

Improve logging and error handling

8604099

Remove unused imports

298bb86

JeremyMcCormick changed the title ~~DM-49202: Implement export of APDB data for PPDB in BigQuery~~ DM-49202: Implement ETL of APDB data to BigQuery for PPDB Apr 23, 2025

JeremyMcCormick added 3 commits April 23, 2025 14:15

Add option for exiting from replication if there are no more chunks

1a0497f

Add cloud function stub for ingest

cf07403

Fix bug in setting ready file path

9f48e69

JeremyMcCormick force-pushed the tickets/DM-49202 branch from f679278 to 9f48e69 Compare April 23, 2025 21:34

JeremyMcCormick added 6 commits April 23, 2025 15:14

Rename script

4add25c

Update requirements

25c13bd

Remove GCP permissions setup script

5362714

Add initial version of beam job for loading chunk parquet files

cc89d4c

Add scripts and configuration for building flex template

b1c3899

Remove old cloud function stub

830c288

JeremyMcCormick marked this pull request as ready for review April 28, 2025 23:14

JeremyMcCormick force-pushed the tickets/DM-49202 branch 4 times, most recently from d9725aa to 7d5cbfa Compare April 30, 2025 03:49

JeremyMcCormick added 11 commits April 29, 2025 23:15

Add a Makefile with some convenient commands

7686a69

Add try block for job execution

2ea07b0

Fix the Dockerfile

135a5a3

Add optional Google Cloud dependencies

190bb03

Change the cloud function to trigger from Pub/Sub

e976cb2

Add Google Cloud requirements

a22e1a9

Add pyarrow requirement

811d797

Lint

d07f9aa

Ignore missing types from google and pyarrow

12ab8fb

Ignore whitespace before semi-colon

d0daf62

Conflicts with mypy rule

Make a few improvements to cloud function

e0beaca

JeremyMcCormick force-pushed the tickets/DM-49202 branch 3 times, most recently from b8920f0 to ef701cb Compare April 30, 2025 07:09

JeremyMcCormick added 6 commits April 30, 2025 12:28

Change some environment variable names

a4b5a5d

Fix environment variable

cbe1e6d

Improve beam job structure

e7e6405

Include schema version in export

62bf702

Generate and upload a manifest to cloud storage

5e429a0

Mark chunks as uploaded instead of deleting

9ad2b2c

JeremyMcCormick force-pushed the tickets/DM-49202 branch from ef701cb to 9ad2b2c Compare April 30, 2025 19:32

Read the files to load from the manifest file

6d96996

JeremyMcCormick force-pushed the tickets/DM-49202 branch from ab6b1cb to 6d96996 Compare April 30, 2025 20:28

Add upload interval for pausing between file uploads

58fb5ef

JeremyMcCormick force-pushed the tickets/DM-49202 branch from a793b48 to 58fb5ef Compare April 30, 2025 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

JeremyMcCormick commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading

andy-slac left a comment •

edited

Loading

JeremyMcCormick commented Apr 17, 2025 •

edited

Loading

DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

Are you sure you want to change the base?

DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

Conversation

JeremyMcCormick commented Apr 9, 2025 • edited Loading

codecov bot commented Apr 9, 2025 • edited Loading

Codecov Report

andy-slac left a comment • edited Loading

Choose a reason for hiding this comment

JeremyMcCormick commented Apr 17, 2025 • edited Loading

JeremyMcCormick commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading

andy-slac left a comment •

edited

Loading

JeremyMcCormick commented Apr 17, 2025 •

edited

Loading