Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only list gcs folder blobs in GCSToBQLoadRunnable #270

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zachary-povey
Copy link

@zachary-povey zachary-povey commented May 18, 2020

Functionality for specifying a GCS folder within the bucket to dump objects into was added here. This PR addresses an issue with this feature I have outlined in #271.

I have passed through the folder prefix as a list option to the GCSToBQLoadRunnable, and all tests seem to be passing for me with the update. However a new integration test should added to recreate the issue and confirm this change fixes it.

However, this new test is not a straight forward addition of data and schema, as detailed in the readme section for adding integration tests. I thought it was best discussed here for how it should be implemented, but I would imagine it working as follows:

  • Add some objects in same GCS bucket but in a folder next to the target folder of connector
  • Run connector
  • Assert rows in target folder appeared
  • Assert rows in other folder did not appear

@CLAassistant
Copy link

CLAassistant commented May 18, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@mtagle mtagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having a file in another folder and making sure that it does not load from that.

@@ -359,7 +360,8 @@ private void startGCSToBQLoadTask() {
BucketInfo bucketInfo = BucketInfo.of(bucketName);
bucket = gcs.create(bucketInfo);
}
GCSToBQLoadRunnable loadRunnable = new GCSToBQLoadRunnable(getBigQuery(), bucket);
Storage.BlobListOption folderPrefixOption = Storage.BlobListOption.prefix(folderName);
GCSToBQLoadRunnable loadRunnable = new GCSToBQLoadRunnable(getBigQuery(), bucket, folderPrefixOption);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work as expected if there is no folder prefix configured?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. As the GCS_FOLDER_NAME_DEFAULT on the config is set to an empty string, we are filtering for objects with a path prefixed by an empty string which is the same as no filter at all.

At least that was how I thought it would work, and took the passing integration tests as confirmation :)

@criccomini
Copy link
Contributor

@zachary-povey your proposed integration test flow sounds good

@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2020

Codecov Report

Merging #270 into master will decrease coverage by 0.13%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master     #270      +/-   ##
============================================
- Coverage     70.87%   70.73%   -0.14%     
  Complexity      301      301              
============================================
  Files            32       32              
  Lines          1538     1541       +3     
  Branches        164      164              
============================================
  Hits           1090     1090              
- Misses          390      393       +3     
  Partials         58       58              
Impacted Files Coverage Δ Complexity Δ
...wepay/kafka/connect/bigquery/BigQuerySinkTask.java 58.33% <0.00%> (-0.62%) 27.00 <0.00> (ø)
...ay/kafka/connect/bigquery/GCSToBQLoadRunnable.java 15.38% <0.00%> (-0.11%) 5.00 <0.00> (ø)

@zachary-povey zachary-povey marked this pull request as ready for review June 9, 2020 15:30
@zachary-povey
Copy link
Author

Hi guys - i've added the integration test as described- it felt a bit hacky (I don't do much java) so feel free to make suggestions for refactoring.

On a different note are you aware of the integration test ever having race conditions? I was having intermittent issues where tables were the tables populated by streaming inserts wouldn't get created but on a re-run the problem would go away...

@zachary-povey zachary-povey requested a review from mtagle June 9, 2020 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants