-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for topics with multiple schemas #238
Conversation
This contribution makes kafka-connect-bigquery compatible with topics that carry different types of messages, especially when Kafka is used for an event sourcing. Approach of putting several event types in a single topic is already supported by Kafka and Schema Registry (see https://www.confluent.io/blog/put-several-event-types-kafka-topic/).
@mtagle @criccomini @C0urante Any progress on this one? |
@mtagle @criccomini @C0urante Is there any chance for this feature to be merged and included in the next release? I do not want to exert pressure on anyone here, but it's been almost 3 months since I started work on this contribution after receiving a positive feedback in #175 and my team is blocked with their work on ML & reporting. I'm between a rock and a hard place and if I should start looking for another solution (write my own custom service or connector?) I'd like to know it before the team get really mad. |
kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java
Show resolved
Hide resolved
kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/SchemaRetriever.java
Outdated
Show resolved
Hide resolved
kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/SchemaRetriever.java
Outdated
Show resolved
Hide resolved
kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java
Outdated
Show resolved
Hide resolved
kcbq-api/src/main/java/com/wepay/kafka/connect/bigquery/api/TopicAndRecordName.java
Outdated
Show resolved
Hide resolved
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java
Outdated
Show resolved
Hide resolved
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java
Outdated
Show resolved
Hide resolved
...pay/kafka/connect/bigquery/schemaregistry/schemaretriever/SchemaRegistrySchemaRetriever.java
Outdated
Show resolved
Hide resolved
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/BigQuerySinkConnector.java
Show resolved
Hide resolved
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/BigQuerySinkConnector.java
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #238 +/- ##
============================================
+ Coverage 68.69% 68.95% +0.25%
- Complexity 267 273 +6
============================================
Files 32 32
Lines 1460 1498 +38
Branches 152 153 +1
============================================
+ Hits 1003 1033 +30
- Misses 409 417 +8
Partials 48 48
|
Most of the comments have been addressed. I left some food for thoughts regarding moving the Also I'd be glad if you could take a look on #237 - it's another feature, essential to start using multi schema topics. When they finally get merged, I'll expose a third PR, which brings support for resolving datasets based on the topic and record names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks really good! Added a few suggested changes.
* @param schemaType schema type used to resolve full subject when recordName is absent. | ||
* @return corresponding schema registry subject. | ||
*/ | ||
public String toSubject(KafkaSchemaRecordType schemaType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is leaking Confluent Schema Registry stuff into the API layer, which must be agnostic to specific schema implementations. The concept of a subject is specific to Confluent.
In fact, the entire concept of a record name is specific to the Avro/Confluent implementation. Protobufs will probably work with this since they also have record names, but what about JSON? IIRC, there is another PR that is about supporting JSON schemas.
I am concerned that this PR is far too tied to the Confluent Schema Registry, especially in the API layer.
I am open to changing the API interfaces to make it easier on the Confluent Schema Registry, but not in a way that excludes other potential implementations (especially ones like JSON, where there is active interest). Can you suggest how this can be altered to not leak Confluent Schema Registry assumptions into the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I'm working on decoupling this plugin from that particular schema implementation. The rough idea is to extract and encapsulate all the SR-related logic in a specialized classes (so that it should be easy to provide support for different schemas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responsibility for assembling Confluent Schema Registry subject names has been moved to SchemaRegistrySchemaRetriever
.
Also the recordName
field stays optional, so in case of an alternative implementations (like aforementioned JSON) developers won't have to bother with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so @whynick1 and I have done some talking and investigating. I think we have a good idea about the best approach to achieve not only what you're trying to do, but to evolve KCBQ to support things in a more generic way in general. This amounts to:
- Adding a pluggable TableRouter that takes a SinkRecord and returns which table it should be written to. (Default: RegexTableRouter)
- Changing SchemaRetriever interface to have two methods:
Schema getKeySchema(SinkRecord)
andSchema getValueSchema(SinkRecord)
. (Default: IdentitySchemaRetriever, which just returnssinkRecord.keySchema()
andsinkRecord.valueSchema()
) - Changing the way that schema updates are handled in the AdaptiveBigQueryWriter.
This approach should give us a ton of flexibility including:
- It will allow you to pick which table to route each individual message based on all of the information in SinkRecord (topic, key schema, value schema, message payload, etc.)
- It will allow us to fix some known schema-evolution bugs in cases where one field is added and another is dropped.
- It will allow us to use SinkRecord's .keySchema and .valueSchema rather than talking to the schema registry for schemas.
- It will make it easy to support JSON messages, even those that return null for the .keySchema() and .valueSchema() methods--you can implement a custom retriever for this case.
I am going to write up a GH issue today that documents in detail what needs to be done to implement these changes, and we can go from there.
Does this sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some note regarding @criccomini 's point 1, 2, 3.
- As an example to your case,
public class MultiSchemaRegexSTableRouter {
Map<TopicAndRecordName, TableId> topicsToBaseTableIds;
@override
public TableId getTable(SinkRecord sinkRecord) {
TopicAndRecordName key = new TopicAndRecordName(sinkRecord.topic(), sinkRecord.recordName());
if (topicsToBaseTableIds.contains(key) {
return topicsToBaseTableIds.get(key);
} else {
// return matched BQ table with topic & record name
}
}
}
- Splitting into
getKeySchema
,valueSchema
also provide an option to loadSinkRecord
's key schema (if has) to BQ for various purpose including deduplication, debugging, etc. - Automatic schema revolution now is default (should be tunable). If a new record uses new schema to insert (to BQ), KCBQ will try to update BQ schema with the latest one retrieved from Schema Registry. Now, we want to decouple from that, and instead derive new schema from
Record
, assuming json-likeRecord
might not has a schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea!
Yesterday I was struggling with changing an existing code on my local branch, responsible for picking up datasets based on both topic and record name, so that it would keep the API part agnostic and completely unaware of different dataset routing strategies. At the end of the day I was a little bit frustrated, because I had to either add yet another, configurable class (similar to SchemaRetriever but for datasets) and letting plugin users to remember about combining the right schema retriever with the right dataset resolver or add responsibility for routing datasets to SchemaRetriever, thus breaking SRP rule...
The proposed TableRouter.getTable
method solves all the problems!
@mkubala are you planning to update this with TableRouter? |
Hi @criccomini! I've been working on it as a part of my daily job and since we are behind the original schedule for BigQuery integration I found myself between a rock and a hard place. My client, who paid for the contribution made so far, does not want to spend more money and see the working integration in our product ASAP. I wonder if there is any chance that you could approve & merge this PR "as it is" and |
I understand. @rhauch please let me know if you start working on that. |
@mkubala sorry, I won’t have time to work on this. |
There is light in the end of tunnel! I spoke to my client this morning and they will let me finish the feature as part of my daily job if I give them a kind of hard estimate / deadline. How much time you will need to review this PR and release a new version of the plugin? |
Hello, |
@mustosm Maybe I'll find some spare time in the next couple of weeks to finish this feature. |
@mkubala thank you for your answer. If the feature is released by wepay it would be better. |
It's been used on the staging environment. |
Hi @mkubala 👋 |
👋 |
This contribution makes kafka-connect-bigquery compatible with topics that carry different types of messages, especially when Kafka is used for an event sourcing.
Approach of putting several event types in a single topic is already supported by Kafka and Schema Registry (see https://www.confluent.io/blog/put-several-event-types-kafka-topic/).
This PR has been derived from #219 and rebased on top of the recent changes made by @bingqinzhou