Skip to content

DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

JeremyMcCormick
Copy link
Contributor

@JeremyMcCormick JeremyMcCormick commented Apr 9, 2025

This PR implements an ETL pipeline for exporting data from Cassandra and loading it into a BigQuery database.

Replica chunks are written to parquet files in a specific directory structure, using the new command ppdb-replication export-chunks. When a chunk has been fully exported, its directory is marked with a .ready file, indicating it can be uploaded to cloud storage.

The uploader copies the the parquet files into Google Cloud Storage by chunk and can be run using the ppdb-replication upload-chunks command. It also generates a manifest file with the chunk information. After a chunk is uploaded, the .ready file is replaced with a .uploaded file. If there is a failure, then a .failed file is written instead. It is not safe to run more than one uploader process at a time, and it is not anticipated that this would be needed. (In future, it is planned to replace the marker files with a replica chunk database for this coordination instead, which may be designed so that multiple uploaders could run at once.) After successfully uploading a chunk, the uploader publishes an event to a Pub/Sub topic, triggering a cloud function, which starts a Dataflow job to ingest the files into BigQuery.

The cloud function and the Dataflow job are implemented under cloud_functions/stage_chunk, which is deliberately separate from the existing Python source tree. Several helper scripts and a Makefile are included for deploying the cloud function and the Dataflow container and template. Currently, the stage-chunk job copies the data from the parquet files directly into the BigQuery production tables, but in future this will be updated to use staging tables instead, to avoid situations where the data fails to load into production and leaves the tables in an inconsistent state.

External scripts were used to create the target BigQuery database, as well as setup the necessary cloud infrastructure (These are not included in this PR.). These scripts and configuration files will eventually be moved into idf_deploy. A working environment can be setup using this shell script.

This PR does not represent the final version of this ETL pipeline, but an interim, working version. Additional enhancements will be added based on separate Jira tickets.

TODO

  • In the Beam job script, get the parameters from the manifest file instead of inferring them from the GCS object name
  • Add options to the uploader for specifying a wait time between chunk uploads, to avoid possibly overloading BigQuery with updates, as well as the time to wait between scanning for chunks that are ready (similar to the existing ppdb-replication run command). These settings could also be used to reduce load on the APDB, should this be necessary.
  • Find a better way to get schema version than apdb._schema.schemaVersion() (should this be exposed in the ApdbReplica interface in dax_apdb?)

@JeremyMcCormick JeremyMcCormick marked this pull request as draft April 9, 2025 22:37
Copy link

codecov bot commented Apr 9, 2025

Codecov Report

Attention: Patch coverage is 5.26316% with 54 lines in your changes missing coverage. Please review.

Project coverage is 43.53%. Comparing base (bee930f) to head (7d5cbfa).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
python/lsst/dax/ppdb/replicator.py 8.82% 31 Missing ⚠️
python/lsst/dax/ppdb/scripts/export_chunks_run.py 0.00% 15 Missing ⚠️
python/lsst/dax/ppdb/scripts/upload_chunks_run.py 0.00% 4 Missing ⚠️
python/lsst/dax/ppdb/scripts/__init__.py 0.00% 2 Missing ⚠️
python/lsst/dax/ppdb/scripts/replication_run.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
- Coverage   44.70%   43.53%   -1.18%     
==========================================
  Files          17       19       +2     
  Lines         718      735      +17     
  Branches       81       78       -3     
==========================================
- Hits          321      320       -1     
- Misses        363      384      +21     
+ Partials       34       31       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JeremyMcCormick JeremyMcCormick changed the title DM-49202: Implement export of APDB for PPDB in BigQuery DM-49202: Implement export of APDB data for PPDB in BigQuery Apr 10, 2025
Copy link
Collaborator

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest problem with this is that you read APDB data and S3 upload in the same process sequentially. I think we agreed before that you want to do it in separate processes with parquet files staged on local disk until they are moved to S3. Otherwise any problem on S3 side will cause you to re-read APDB potentially many times which is very undesirable.

@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-49202 branch 4 times, most recently from cbc2172 to 27d7c61 Compare April 17, 2025 21:20
@JeremyMcCormick
Copy link
Contributor Author

JeremyMcCormick commented Apr 17, 2025

@andy-slac

My biggest problem with this is that you read APDB data and S3 upload in the same process sequentially.

This should be resolved now. I split the process into an exporter for writing the Parquet files locally and an uploader for copying them into GCS. Each runs independently in a separate process with a different CLI command. For now, bookkeeping is being done with simple marker files on the local filesystem, but I can add a database for tracking this later once some refactoring has been done.

I will work on the batch writing of the Parquet files next.

@JeremyMcCormick JeremyMcCormick changed the title DM-49202: Implement export of APDB data for PPDB in BigQuery DM-49202: Implement ETL of APDB data to BigQuery for PPDB Apr 23, 2025
@JeremyMcCormick JeremyMcCormick marked this pull request as ready for review April 28, 2025 23:14
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-49202 branch 4 times, most recently from d9725aa to 7d5cbfa Compare April 30, 2025 03:49
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-49202 branch 3 times, most recently from b8920f0 to ef701cb Compare April 30, 2025 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants