-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple topics from tables and generalize SchemaRetriever #245
Comments
I'm really glad to read about pluggable |
@mkubala no one is working on this at the moment. Our hope was that you would introduce the TableRouter as part of your PR. |
@criccomini Do you have any update on this issue ? Is someone working on it ? |
@bingqinzhou @mtagle Do you have any information ? |
@sebco59 we have someone beginning this now! :) |
@criccomini Nice ! On our side we tried to use Simple Message Transformation to deal with MultiSchema instead of Table Router implementation. But SMT doesn't work because of the way to retrieve schema on
We have started developing schema retriever logic but we tried to decouple it from the schema update logic. With the current implementation, the BQ table schema updates are entirely based on the latest version available in the schema registry, which doesn't do well with messages with an older incompatible schema. What do you think of this approach (SMT and Schema Update Logic) ? Update: |
@criccomini This seems like a great abstraction to place over the connector and allow it to service a variety of use cases. Excited to see this thing still growing and evolving as the years go by :) I think @sebco59's suggestion to rely on SMTs is a valuable one. A lot of the problems that are being addressed here aren't specific to BigQuery as an external system and can apply to connectors in general. An SMT is a great way to write code once and reuse it anywhere, and also should help keep the code base for this connector uncluttered. Specifically, I think users should either use the With regards to the comment about modifying I also think that if we evolve the On the subject of automatic schema unionization, I'm wondering about a potential gotcha: if the connector reads a record with a completely mismatched schema that was not meant to be in the topic to begin with, will automatic unionization of the table still happen? I think it will as long as any fields present in that record's schema which aren't already in the table are optional. We probably want to be cautious about accidentally adding columns to a table schema, since deleting them isn't permitted |
(We are using SMT approach now. PR is forthcoming :) ) |
Intro
We have had a number of PRs and issues recently that are attempting to do two things:
These include:
#238
#216
#175
#206
#178
And more.
Changes
I believe that we can address these issues with the following changes:
Add TableRouter
TableRouter
that takes aSinkRecord
and returns which table it should be written to. (Default:RegexTableRouter
)Generalize SchemaRetriever
SchemaRetriever
interface to have two methods:Schema getKeySchema(SinkRecord)
andSchema getValueSchema(SinkRecord)
. (Default:IdentitySchemaRetriever
, which just returnssinkRecord.keySchema()
andsinkRecord.valueSchema()
)Change *QueryWriter schema update logic
AdaptiveBigQueryWriter
.This change will be the largest. If we remove
SchemaRetriever.updateSchema
, we need a way forAdaptiveBigQueryWriter
to update BQ schemas when a batch insert fails. Given these rules:https://cloud.google.com/bigquery/docs/managing-table-schemas
The correct behavior is to do when a schema failure occurs is to have the adaptive writer union all fields from all insert batch with all fields in the existing BigQuery table. An example illustrates:
This will require that we have access to the insert batch's the
SinkRecord
for each row (not justRowToInsert
). It will also require thatAdaptiveBigQueryWriter
has theSchemaRetriever
wired in as well.I think the most straight-forward way to handle this is to have
TableWriterBuilder.addRow(SinkRecord record)
instead ofRowToInsert
. TheBuilder
can then keep aSortedMap<SinkRecord, RowToInsert>
, and pass that down the stack through toAdaptiveBigQueryWriter.performWriteRequest
.AdaptiveBigQueryWriter.attemptSchemaUpdate
can then be changed to implement the logic I field-union logic that I described above.One area that we'll have to be particularly careful about is dealing with repeated records and nested structures. They need to be properly unioned as well.
Benefits
This approach should give us a ton of flexibility including:
SinkRecord
(topic, key schema, value schema, message payload, etc.)SinkRecord's
.keySchema
and.valueSchema
rather than talking to the schema registry for schemas..keySchema()
and.valueSchema()
methods--you can implement a custom retriever for this case.The text was updated successfully, but these errors were encountered: