Is zombie fencing implemented correctly? #2025
Replies: 3 comments 7 replies
-
You have to configure a unique https://docs.spring.io/spring-kafka/docs/current/reference/html/#transaction-id-prefix
|
Beta Was this translation helpful? Give feedback.
-
@garyrussell yes I know that (the (And slightly tangential, but I think assigning a UUID to your transaction id prefix defeats the purpose, the whole idea is that if k8s creates a new instance while the zombie is still alive, they will have the same id, if each one of them is given a different UUID then zombie fencing will definitely not work). |
Beta Was this translation helpful? Give feedback.
-
@garyrussell 2 concurrent instances that are supposed to run concurrently MUST have unique ids, I agree, but my example was describing a single replica that k8s lost communication with, and so it started another instance. To my understanding that's the classic problem that zombie fencing is supposed to solve, and spring is not doing this correctly because it creates producers by demand, so you cannot guarantee the restarted instance will have the same number of producers as the zombie, and if the zombie has more producers you can still get dups. That being said, it's very possible that I'm misunderstanding fencing, which is why I opened a discussion and not an issue. If I'm misunderstanding I'd like to understand better. |
Beta Was this translation helpful? Give feedback.
-
Hi, I was going through the code and I think there's an issue with how the transactional producers are created on-the-fly that can break zombie fencing, but I might misunderstand how the implementation works.
As to my understanding, when we begin the transaction we start in the transaction manager:
spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/transaction/KafkaTransactionManager.java
Line 146 in c061fce
We check if there's an available producer, and if not we turn to the factory to create it:
spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/ProducerFactoryUtils.java
Line 96 in c061fce
Assuming we're configured without
producerPerConsumerPartition
we'll go tocreateTransactionalProducer
:spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/DefaultKafkaProducerFactory.java
Line 653 in c061fce
Then it seems to check again if there's an available producer in the cache (wondering why we check twice but that's beside the point), and then eventually we'll create the transaction producer here with a transactional id which is an atomically increased integer:
spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/DefaultKafkaProducerFactory.java
Line 793 in c061fce
Now, imagine we have a service instance in k8s that has 3 threads starting a transaction at the same time, we'll get 3 producers with producer id x1, x2 and x3.
Then that instant becomes a zombie, i.e it loses connectivity with k8s, and k8s starts up an instance to replace it.
The new instance, due to different timings, only starts 2 transactions at the same time, so it only gets 2 producers x1 and x2.
Meanwhile, our zombie is producing from all of its 3 producers.
Kafka will correctly fence and filter out the messages from x1 and x2 because it sees the new producers with the same id, but x3 will still be allowed to send messages, breaking out zombie fencing and our exactly-once semantics.
Is this a bug or am I misunderstanding something?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions