Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling of best effort de-duplication #272

Open
adelcast opened this issue May 19, 2020 · 0 comments
Open

Allow disabling of best effort de-duplication #272

adelcast opened this issue May 19, 2020 · 0 comments

Comments

@adelcast
Copy link

The connector is currently adding an InsertId per row, which is used by BigQuery to dedupe rows that have the same insertId (in a 1 minute window). Using insertIds throttles the ingestion rate to a maximum of 100k rows per second & 100 MB/s.

Insertions without a insertId disable best effort de-duplication [1], which increases the ingestion quota to a maximum of 1 GB/s. For high throughput applications, its desirable to disable dedupe, handling duplication on the query side.

A config option to disable best effort de-duplication (removing insertIds) would be a great knob to have, which could improve ingestion by 10x.

[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant