Replies: 5 comments
-
@we-ai The second approach would benefit our QC process significantly. Currently we nave no way of knowing which data have been updated recently because the BQ tables are regenerated from scratch every time the synch occurs. So, we run the QC checks on the entire dataset every time which is very computationally inefficient. With the approach you propose, we could run our QC checks on only the newly imported data. Another benefit is that the schemas for the BigQuery tables could be persistent. Currently, the schemas can change from back-up to back-up because the tables are rewritten with each backup. We could define the schema in advance of data structure changes, so that we don't get caught off guard. |
Beta Was this translation helpful? Give feedback.
-
Full Firestore backups need to happen at least once per day |
Beta Was this translation helpful? Give feedback.
-
For syncing data from Firestore to BigQuery, the Stream Firestore to BigQuery extension might be helpful. |
Beta Was this translation helpful? Give feedback.
-
During today's discussion, there were suggestions about developing our own solution to incrementally back up data from Firestore to Cloud Storage Bucket. There're 2 main steps about it: The difficulty of this incremental backup approach is that it's not compatible with Google's solution, meaning that we have to develop the whole solution by our self. Although it's still achievable, but it'll take time and efforts. The main hurdle is file encryption. Files/Objects in Cloud Storage Buckets are encrypted. Each collection from Firestore is encrypted to generated a file. To update content of a saved collection file in Bucket (as scenario 2a above), we need to download the file (which may be large), decrypt it, make changes, and write back to Bucket. Or we can leave the exported full collection file un-touched and maintain one or multiple update files (as scenario 2b above), then decrypt and combine those files when we need data recovery. Both scenarios require extra guard and maintains from us to ensure data integrity. We could remove file encryption to make file updates smoother, but we shouldn't do so. Any comments? @jonasalmeida @danielruss @FrogGirl1123 @anthonypetersen @JoeArmani |
Beta Was this translation helpful? Give feedback.
-
Here’s a possibility. Please know I’m unaware of actual requirements and have only briefly looked into this. •This is a rough idea to significantly lower reads/writes/data usage while maintaining Cloud Storage consistency with Firestore data. •Cloud Storage can’t be queried like Firestore, so we optimize the file path structure for fetching based on Firestore collection names and document Ids. • The AVRO file structure is lighter weight and faster than JSON and is compatible with Firestore and BigQuery. Parsing may require a dependency such as this one: https://www.npmjs.com/package/avro-js. •Timing and flow of these operations could be handled in several ways. •We could work with this diff data intraday and do the costly bulk export and BQ rebuild operations much less frequently. |
Beta Was this translation helpful? Give feedback.
-
Currently our Firestore data backup step is in upstream of data updating to BigQuery: Firestore collections --> data backups in Bucket --> data loaded to BigQuery. This flow has below drawbacks:
A better approach would be decoupling Firestore data backups and data syncs to BigQuery:
Granular data syncs to BigQuery, like the one aforementioned, can be a bit challenging to design and implement. There should be multiple ways to achieve the goal. Discussion/brainstorm is needed to achieve a solution meeting our production need.
Beta Was this translation helpful? Give feedback.
All reactions