-
Notifications
You must be signed in to change notification settings - Fork 0
DM-49202: Implement ETL of APDB data to BigQuery for PPDB #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
29cdb9f
to
bfd966a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10 +/- ##
==========================================
- Coverage 44.70% 43.53% -1.18%
==========================================
Files 17 19 +2
Lines 718 735 +17
Branches 81 78 -3
==========================================
- Hits 321 320 -1
- Misses 363 384 +21
+ Partials 34 31 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
bfd966a
to
699faf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest problem with this is that you read APDB data and S3 upload in the same process sequentially. I think we agreed before that you want to do it in separate processes with parquet files staged on local disk until they are moved to S3. Otherwise any problem on S3 side will cause you to re-read APDB potentially many times which is very undesirable.
cbc2172
to
27d7c61
Compare
This should be resolved now. I split the process into an exporter for writing the Parquet files locally and an uploader for copying them into GCS. Each runs independently in a separate process with a different CLI command. For now, bookkeeping is being done with simple marker files on the local filesystem, but I can add a database for tracking this later once some refactoring has been done. I will work on the batch writing of the Parquet files next. |
61fff17
to
9f95d09
Compare
f679278
to
9f48e69
Compare
d9725aa
to
7d5cbfa
Compare
Conflicts with mypy rule
b8920f0
to
ef701cb
Compare
ef701cb
to
9ad2b2c
Compare
ab6b1cb
to
6d96996
Compare
a793b48
to
58fb5ef
Compare
This PR implements an ETL pipeline for exporting data from Cassandra and loading it into a BigQuery database.
Replica chunks are written to parquet files in a specific directory structure, using the new command
ppdb-replication export-chunks
. When a chunk has been fully exported, its directory is marked with a.ready
file, indicating it can be uploaded to cloud storage.The uploader copies the the parquet files into Google Cloud Storage by chunk and can be run using the
ppdb-replication upload-chunks
command. It also generates a manifest file with the chunk information. After a chunk is uploaded, the.ready
file is replaced with a.uploaded
file. If there is a failure, then a.failed
file is written instead. It is not safe to run more than one uploader process at a time, and it is not anticipated that this would be needed. (In future, it is planned to replace the marker files with a replica chunk database for this coordination instead, which may be designed so that multiple uploaders could run at once.) After successfully uploading a chunk, the uploader publishes an event to a Pub/Sub topic, triggering a cloud function, which starts a Dataflow job to ingest the files into BigQuery.The cloud function and the Dataflow job are implemented under
cloud_functions/stage_chunk
, which is deliberately separate from the existing Python source tree. Several helper scripts and aMakefile
are included for deploying the cloud function and the Dataflow container and template. Currently, thestage-chunk
job copies the data from the parquet files directly into the BigQuery production tables, but in future this will be updated to use staging tables instead, to avoid situations where the data fails to load into production and leaves the tables in an inconsistent state.External scripts were used to create the target BigQuery database, as well as setup the necessary cloud infrastructure (These are not included in this PR.). These scripts and configuration files will eventually be moved into
idf_deploy
. A working environment can be setup using this shell script.This PR does not represent the final version of this ETL pipeline, but an interim, working version. Additional enhancements will be added based on separate Jira tickets.
TODO
ppdb-replication run
command). These settings could also be used to reduce load on the APDB, should this be necessary.apdb._schema.schemaVersion()
(should this be exposed in theApdbReplica
interface indax_apdb
?)