[RDP-1913]Reduce metadata refresh interval #76
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
According to https://wise.slack.com/archives/G01P8RBLGCC/p1693400446185769
The goal with this change is to provide more time for https://github.com/transferwise/kafka-health-checker to demote unhealthy brokers. We assume that faulty broker is in a zombie state, so it won't return PARTITION_MIGRATED exception that would force the metadata update. Default metadata.max.age.ms is 5 min. Producer’s delivery timeout is 7 min. Let’s say trouble starts at 13:01, health checker reacts to this and demotes the broker at 13:04, if Kafka client’s metadata was refreshed at 13:03, then next metadata refresh will be at 13:08, by that time we would already hit delivery timeout, which would be at 13:08. If producers producing to a changelog topic fail, then it forces the whole Kafka streams task to migrate to another instance, hence the rebalancing. Problem will be that exception will be thrown when produce fails within the delivery timeout, leading the Kafka Streams thread to be moved to another instance. Producer metadata is refreshed periodically, or if there’re certain exceptions returned by the broker.
Checklist
Details from ticket: RDP-1913
Reduce metadata refresh interval for Service Kafka clients