add disableBestEffortDeduplication config option #277

adelcast · 2020-06-12T19:35:10Z

The connector is currently adding an InsertId per row, which is used by
BigQuery to dedupe rows that have the same insertId (in a 1 minute
window). Using insertIds throttles the ingestion rate to a maximum of
100k rows per second & 100 MB/s.

Insertions without a insertId disable best effort de-duplication [1],
which increases the ingestion quota to a maximum of 1 GB/s. For high
throughput applications, its desirable to disable dedupe, handling
duplication on the query side.

[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication

Signed-off-by: Alejandro del Castillo [email protected]

CLAassistant · 2020-06-12T19:35:14Z

All committers have signed the CLA.

codecov-commenter · 2020-06-12T19:38:47Z

Codecov Report

Merging #277 into master will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #277      +/-   ##
============================================
+ Coverage     70.87%   70.98%   +0.11%     
- Complexity      301      302       +1     
============================================
  Files            32       32              
  Lines          1538     1544       +6     
  Branches        164      165       +1     
============================================
+ Hits           1090     1096       +6     
  Misses          390      390              
  Partials         58       58

Impacted Files	Coverage Δ	Complexity Δ
...wepay/kafka/connect/bigquery/BigQuerySinkTask.java	`59.37% <100.00%> (+0.42%)`	`28.00 <0.00> (+1.00)`
...ka/connect/bigquery/config/BigQuerySinkConfig.java	`82.12% <100.00%> (+0.35%)`	`29.00 <0.00> (ø)`

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java

C0urante · 2020-06-15T05:50:36Z

@adelcast This looks straightforward, useful, and well-implemented. Do you think it might make sense to expand on the existing unit tests for the BigQuerySinkTask class to verify this change?

adelcast · 2020-06-15T23:12:30Z

I may need some pointers...I am looking at BigQuerySinkTaskTest.java. Looking at other tests, they mock at the bigQuery.insertAll level, which is several levels of abstraction beyond the area where the my change is. Any suggestion on how to create a test here?

adelcast · 2020-06-22T21:54:09Z

hi @C0urante, any comments? would love to get this PR to a place where it can be merged....

C0urante · 2020-06-25T06:29:36Z

@adelcast apologies for the delay! Been working on my own feature for this connector and it's eaten my life a little bit :)

I think there are two ways we might go about testing this:

Bite the bullet and mock everything, eventually the InsertAllRequest that gets passed to a mocked BigQuery instance and making assertions on that (probably making sure that RowToInsert::getId returns null for each RowToInsert in the request's getRows list).
Make BigQuerySinkTask::getRecordRow package-private and unit test with that method as your point of entry.

Personally I think the second option would be cleaner, easier to test, and require less boilerplate, but I'm not sure how the maintainers of the repo feel about that, so just keep in mind that you might get asked to change it if you decide to go in that direction.

Hope this helps!

The connector is currently adding an InsertId per row, which is used by BigQuery to dedupe rows that have the same insertId (in a 1 minute window). Using insertIds throttles the ingestion rate to a maximum of 100k rows per second & 100 MB/s. Insertions without a insertId disable best effort de-duplication [1], which increases the ingestion quota to a maximum of 1 GB/s. For high throughput applications, its desirable to disable dedupe, handling duplication on the query side. [1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication Signed-off-by: Alejandro del Castillo <[email protected]>

adelcast · 2020-07-03T00:02:51Z

@C0urante thanks for the tips! I went ahead and implemented your option 2....

C0urante

I'm a little confused--this looks like we are mocking the connector dependencies in the test now instead of calling getRowId in isolation. And it also looks like we're not actually verifying what the return value of getRowId is... am I missing something? 😅

adelcast · 2020-07-07T19:38:42Z

So this was my thinking: getRecordRow will call getRowId if bestEffortDeduplication = True (to generate an insertId). So, an easy way to make sure the switch is working is to set bestEffortDeduplication = False, then add a record and verify that getRowId was not called. That will imply that no insertid was created and the code path is correct. Not exactly your option 2, but seemed easy enough....

C0urante

Ah, missed the times(0) in there.

I was thinking the getRowId method might be called unconditionally and just return null if best-effort deduplication is disabled. This wouldn't be super clean since the RowToInsert::of static constructor method doesn't permit null row IDs, so we'd have to do some branching based on whether the returned string is non-null or not. The advantage would be that we'd be able to more easily unit test this since all that'd be necessary would be to create and configure the task class, and then verify that getRowId returns null when (and only when) it should.

Still, this looks good enough as-is. ✅

adelcast · 2020-07-08T16:01:31Z

@C0urante ah, I see what you mean....that would be another way to go. Since you don't seem to have an objection, I'll wait for someone else to chime in, if another approach is preferred. Thanks again for your guidance on this.

C0urante reviewed Jun 14, 2020

View reviewed changes

kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java Outdated Show resolved Hide resolved

adelcast force-pushed the dev/adelcast/add_dedup_option branch from 6700049 to 37bcaa2 Compare June 15, 2020 02:44

adelcast force-pushed the dev/adelcast/add_dedup_option branch from 37bcaa2 to 6b1d4d4 Compare July 3, 2020 00:02

C0urante reviewed Jul 7, 2020

View reviewed changes

C0urante approved these changes Jul 7, 2020

View reviewed changes

C0urante mentioned this pull request Sep 8, 2020

Add config option to ignore insert ids #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add disableBestEffortDeduplication config option #277

add disableBestEffortDeduplication config option #277

adelcast commented Jun 12, 2020

CLAassistant commented Jun 12, 2020 •

edited

Loading

codecov-commenter commented Jun 12, 2020 •

edited

Loading

C0urante commented Jun 15, 2020 •

edited

Loading

adelcast commented Jun 15, 2020

adelcast commented Jun 22, 2020

C0urante commented Jun 25, 2020

adelcast commented Jul 3, 2020

C0urante left a comment

adelcast commented Jul 7, 2020

C0urante left a comment

adelcast commented Jul 8, 2020

add disableBestEffortDeduplication config option #277

Are you sure you want to change the base?

add disableBestEffortDeduplication config option #277

Conversation

adelcast commented Jun 12, 2020

CLAassistant commented Jun 12, 2020 • edited Loading

codecov-commenter commented Jun 12, 2020 • edited Loading

Codecov Report

C0urante commented Jun 15, 2020 • edited Loading

adelcast commented Jun 15, 2020

adelcast commented Jun 22, 2020

C0urante commented Jun 25, 2020

adelcast commented Jul 3, 2020

C0urante left a comment

Choose a reason for hiding this comment

adelcast commented Jul 7, 2020

C0urante left a comment

Choose a reason for hiding this comment

adelcast commented Jul 8, 2020

CLAassistant commented Jun 12, 2020 •

edited

Loading

codecov-commenter commented Jun 12, 2020 •

edited

Loading

C0urante commented Jun 15, 2020 •

edited

Loading