Decouple Firestore backups and data syncing to BigQuery #841

we-ai · 2023-12-06T16:10:17Z

we-ai
Dec 6, 2023
Maintainer

Currently our Firestore data backup step is in upstream of data updating to BigQuery: Firestore collections --> data backups in Bucket --> data loaded to BigQuery. This flow has below drawbacks:

Currently Google doesn't have a solution to do partial backups, meaning at each time, we need to backup the whole of each collections in Firstore.
If we want to make data in BigQuery tables updated every hour, we have to back up our data in Firesotre every hour. Although backing up our data in Firestore every 6 hours might be good enough, but to let data flow to BigQuery timely, we do extra backups.

A better approach would be decoupling Firestore data backups and data syncs to BigQuery:

For data backups: Firestore collections --> data backups in Bucket. This process is expensive and could be done less frequently.
For data syncing to BigQuery: newly updated data (since last sync) in Firestore collections --> save updated data in Bucket (or another location) --> load updated data to BigQuery tables. This process is inexpensive and could be done more frequently.

Granular data syncs to BigQuery, like the one aforementioned, can be a bit challenging to design and implement. There should be multiple ways to achieve the goal. Discussion/brainstorm is needed to achieve a solution meeting our production need.

jacobmpeters · 2023-12-06T16:21:24Z

jacobmpeters
Dec 6, 2023
Collaborator

@we-ai The second approach would benefit our QC process significantly. Currently we nave no way of knowing which data have been updated recently because the BQ tables are regenerated from scratch every time the synch occurs. So, we run the QC checks on the entire dataset every time which is very computationally inefficient. With the approach you propose, we could run our QC checks on only the newly imported data.

Another benefit is that the schemas for the BigQuery tables could be persistent. Currently, the schemas can change from back-up to back-up because the tables are rewritten with each backup. We could define the schema in advance of data structure changes, so that we don't get caught off guard.

0 replies

anthonypetersen · 2023-12-06T16:51:15Z

anthonypetersen
Dec 6, 2023
Maintainer

Full Firestore backups need to happen at least once per day

0 replies

we-ai · 2023-12-06T22:07:57Z

we-ai
Dec 6, 2023
Maintainer Author

For syncing data from Firestore to BigQuery, the Stream Firestore to BigQuery extension might be helpful.

0 replies

we-ai · 2023-12-21T23:54:41Z

we-ai
Dec 21, 2023
Maintainer Author

During today's discussion, there were suggestions about developing our own solution to incrementally back up data from Firestore to Cloud Storage Bucket. There're 2 main steps about it:
1). Find out updated documents in Firestore since last export. This is easy.
2a). Replace existing documents of a collection in Bucket with updated ones, OR
2b). Export those updated documents to Bucket. The sum of original export and all updates will generate final updated collection.

The difficulty of this incremental backup approach is that it's not compatible with Google's solution, meaning that we have to develop the whole solution by our self. Although it's still achievable, but it'll take time and efforts. The main hurdle is file encryption.

Files/Objects in Cloud Storage Buckets are encrypted. Each collection from Firestore is encrypted to generated a file. To update content of a saved collection file in Bucket (as scenario 2a above), we need to download the file (which may be large), decrypt it, make changes, and write back to Bucket. Or we can leave the exported full collection file un-touched and maintain one or multiple update files (as scenario 2b above), then decrypt and combine those files when we need data recovery. Both scenarios require extra guard and maintains from us to ensure data integrity.

We could remove file encryption to make file updates smoother, but we shouldn't do so.

Any comments? @jonasalmeida @danielruss @FrogGirl1123 @anthonypetersen @JoeArmani

0 replies

JoeArmani · 2023-12-28T21:51:20Z

JoeArmani
Dec 28, 2023
Collaborator

Here’s a possibility. Please know I’m unaware of actual requirements and have only briefly looked into this.

•This is a rough idea to significantly lower reads/writes/data usage while maintaining Cloud Storage consistency with Firestore data.

•Cloud Storage can’t be queried like Firestore, so we optimize the file path structure for fetching based on Firestore collection names and document Ids.

• The AVRO file structure is lighter weight and faster than JSON and is compatible with Firestore and BigQuery. Parsing may require a dependency such as this one: https://www.npmjs.com/package/avro-js.

•Timing and flow of these operations could be handled in several ways.

•We could work with this diff data intraday and do the costly bulk export and BQ rebuild operations much less frequently.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Firestore backups and data syncing to BigQuery #841

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Decouple Firestore backups and data syncing to BigQuery #841

we-ai Dec 6, 2023 Maintainer

Replies: 5 comments

jacobmpeters Dec 6, 2023 Collaborator

anthonypetersen Dec 6, 2023 Maintainer

we-ai Dec 6, 2023 Maintainer Author

we-ai Dec 21, 2023 Maintainer Author

JoeArmani Dec 28, 2023 Collaborator

we-ai
Dec 6, 2023
Maintainer

jacobmpeters
Dec 6, 2023
Collaborator

anthonypetersen
Dec 6, 2023
Maintainer

we-ai
Dec 6, 2023
Maintainer Author

we-ai
Dec 21, 2023
Maintainer Author

JoeArmani
Dec 28, 2023
Collaborator