-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use JSONL to upload to BigQuery #86
Open
judahrand
wants to merge
8
commits into
jmriego:jsonl
Choose a base branch
from
thread:upstream/jsonl
base: jsonl
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
9c2d29f
Use JSON rather than Avro
judahrand 1ffea7f
Add env vars to CI
judahrand e3cae09
Deal with linting errors
judahrand e60f75a
Update unittests
judahrand 53effff
Add `google-cloud-storage` dependency
judahrand 19e3531
Fix bugs
judahrand 7e8af19
Add documentation
judahrand c1272d3
Fix `test_logical_streams_from_pg_with_hard_delete_mapping`
judahrand File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this optional instead? I'd prefer if we don't force people to use GCS as I know not everyone is using it and it's a breaking change.
Should be OK to upload then read from GCS if this parameter is provided or use
load_table_from_file
if notThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO there's no real reason not to ingest via GCS. It's certainly not a priority for us and I don't think I'd have the time to make the change + add additional testing in the near future. Though you're more than welcome to make the necessary changes. It will uglify the code a bit too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is that we have been using it without uploading to GCS for years now and this would make it more difficult for us to have PII compliance unless we deleted the files later. I'm going to merge this to a new jsonl branch and I'll do that change and some testing on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation does delete the files:
pipelinewise-target-bigquery/target_bigquery/db_sync.py
Lines 443 to 446 in c1272d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! just for the sake of conversation. Let's say we make GCS mandatory. What would be the benefits of doing that?
I can see that it would be more similar to the Snowflake target maintained by PipelineWise that forces you to use a stage or S3. But I don't think you'd win any speed, debug capabilities (if we delete the file in a finally) or anything else.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html
^ This suggests that when your network is fast loading into GCS is better. At least the way we're running this (and I'd guess is common) both our DB and (obviously) BQ are in GCP (in the same region in fact). So the network connection is very fast (10Gbps+) This makes GCS the obvious choice (and probably uncompressed better than compressed).
Do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also given that we're already not using compression that also suggests that outside GCP loading via GCS is still quicker.