Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add disableBestEffortDeduplication config option #277

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adelcast
Copy link

The connector is currently adding an InsertId per row, which is used by
BigQuery to dedupe rows that have the same insertId (in a 1 minute
window). Using insertIds throttles the ingestion rate to a maximum of
100k rows per second & 100 MB/s.

Insertions without a insertId disable best effort de-duplication [1],
which increases the ingestion quota to a maximum of 1 GB/s. For high
throughput applications, its desirable to disable dedupe, handling
duplication on the query side.

[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication

Signed-off-by: Alejandro del Castillo [email protected]

@CLAassistant
Copy link

CLAassistant commented Jun 12, 2020

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2020

Codecov Report

Merging #277 into master will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #277      +/-   ##
============================================
+ Coverage     70.87%   70.98%   +0.11%     
- Complexity      301      302       +1     
============================================
  Files            32       32              
  Lines          1538     1544       +6     
  Branches        164      165       +1     
============================================
+ Hits           1090     1096       +6     
  Misses          390      390              
  Partials         58       58              
Impacted Files Coverage Δ Complexity Δ
...wepay/kafka/connect/bigquery/BigQuerySinkTask.java 59.37% <100.00%> (+0.42%) 28.00 <0.00> (+1.00)
...ka/connect/bigquery/config/BigQuerySinkConfig.java 82.12% <100.00%> (+0.35%) 29.00 <0.00> (ø)

@adelcast adelcast force-pushed the dev/adelcast/add_dedup_option branch from 6700049 to 37bcaa2 Compare June 15, 2020 02:44
@C0urante
Copy link
Collaborator

C0urante commented Jun 15, 2020

@adelcast This looks straightforward, useful, and well-implemented. Do you think it might make sense to expand on the existing unit tests for the BigQuerySinkTask class to verify this change?

@adelcast
Copy link
Author

I may need some pointers...I am looking at BigQuerySinkTaskTest.java. Looking at other tests, they mock at the bigQuery.insertAll level, which is several levels of abstraction beyond the area where the my change is. Any suggestion on how to create a test here?

@adelcast
Copy link
Author

hi @C0urante, any comments? would love to get this PR to a place where it can be merged....

@C0urante
Copy link
Collaborator

@adelcast apologies for the delay! Been working on my own feature for this connector and it's eaten my life a little bit :)

I think there are two ways we might go about testing this:

  1. Bite the bullet and mock everything, eventually the InsertAllRequest that gets passed to a mocked BigQuery instance and making assertions on that (probably making sure that RowToInsert::getId returns null for each RowToInsert in the request's getRows list).
  2. Make BigQuerySinkTask::getRecordRow package-private and unit test with that method as your point of entry.

Personally I think the second option would be cleaner, easier to test, and require less boilerplate, but I'm not sure how the maintainers of the repo feel about that, so just keep in mind that you might get asked to change it if you decide to go in that direction.

Hope this helps!

The connector is currently adding an InsertId per row, which is used by
BigQuery to dedupe rows that have the same insertId (in a 1 minute
window). Using insertIds throttles the ingestion rate to a maximum of
100k rows per second & 100 MB/s.

Insertions without a insertId disable best effort de-duplication [1],
which increases the ingestion quota to a maximum of 1 GB/s. For high
throughput applications, its desirable to disable dedupe, handling
duplication on the query side.

[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication

Signed-off-by: Alejandro del Castillo <[email protected]>
@adelcast adelcast force-pushed the dev/adelcast/add_dedup_option branch from 37bcaa2 to 6b1d4d4 Compare July 3, 2020 00:02
@adelcast
Copy link
Author

adelcast commented Jul 3, 2020

@C0urante thanks for the tips! I went ahead and implemented your option 2....

Copy link
Collaborator

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused--this looks like we are mocking the connector dependencies in the test now instead of calling getRowId in isolation. And it also looks like we're not actually verifying what the return value of getRowId is... am I missing something? 😅

@adelcast
Copy link
Author

adelcast commented Jul 7, 2020

So this was my thinking: getRecordRow will call getRowId if bestEffortDeduplication = True (to generate an insertId). So, an easy way to make sure the switch is working is to set bestEffortDeduplication = False, then add a record and verify that getRowId was not called. That will imply that no insertid was created and the code path is correct. Not exactly your option 2, but seemed easy enough....

Copy link
Collaborator

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed the times(0) in there.

I was thinking the getRowId method might be called unconditionally and just return null if best-effort deduplication is disabled. This wouldn't be super clean since the RowToInsert::of static constructor method doesn't permit null row IDs, so we'd have to do some branching based on whether the returned string is non-null or not. The advantage would be that we'd be able to more easily unit test this since all that'd be necessary would be to create and configure the task class, and then verify that getRowId returns null when (and only when) it should.

Still, this looks good enough as-is. ✅

@adelcast
Copy link
Author

adelcast commented Jul 8, 2020

@C0urante ah, I see what you mean....that would be another way to go. Since you don't seem to have an objection, I'll wait for someone else to chime in, if another approach is preferred. Thanks again for your guidance on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants