Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSToBQLoadRunnable doesn't respect GCS folder #271

Open
zachary-povey opened this issue May 18, 2020 · 2 comments
Open

GCSToBQLoadRunnable doesn't respect GCS folder #271

zachary-povey opened this issue May 18, 2020 · 2 comments

Comments

@zachary-povey
Copy link

I have run into a problem when using the GCS->BQ batch mode of the BigQuerySinkConnector; each connector schedules it's own instance of the GCSToBQLoadRunnable which does not use the GCS folder when listing objects to load into BigQuery.

Because of this, if you have multiple connectors using the same bucket but different folders they all load all the objects in the bucket, irrespective of the folder they are in, and so you receive many duplicates in BQ. Further to this, only one instance will successfully delete the object and when the other instances try and fail, they will simply try again and again.

GCSToBQLoadRunnable can be seen here

@mtagle
Copy link
Contributor

mtagle commented May 18, 2020

Good catch! IIRC, folders for GCS loading were added after the initial implementation, I think the GCS -> BQ step for that was just overlooked.

GCSToBQLoadRunnable definitely should be modified to respect the configured folder.

@zachary-povey
Copy link
Author

Thought that might have been how it happened 🙂

I've raised a draft PR here with a suggested fix, need a bit of help getting a new integration test put in that re-creates the issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants