Is zombie fencing implemented correctly? #2025

tzachshabtay · 2021-12-02T20:26:16Z

tzachshabtay
Dec 2, 2021

Hi, I was going through the code and I think there's an issue with how the transactional producers are created on-the-fly that can break zombie fencing, but I might misunderstand how the implementation works.

As to my understanding, when we begin the transaction we start in the transaction manager:

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/transaction/KafkaTransactionManager.java

Line 146 in c061fce

    
           resourceHolder = ProducerFactoryUtils.getTransactionalResourceHolder(getProducerFactory(),

We check if there's an available producer, and if not we turn to the factory to create it:

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/ProducerFactoryUtils.java

Line 96 in c061fce

Producer<K, V> producer = producerFactory.createProducer(txIdPrefix);

Assuming we're configured without producerPerConsumerPartition we'll go to createTransactionalProducer:

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/DefaultKafkaProducerFactory.java

Line 653 in c061fce

return createTransactionalProducer(txIdPrefix);

Then it seems to check again if there's an available producer in the cache (wondering why we check twice but that's beside the point), and then eventually we'll create the transaction producer here with a transactional id which is an atomically increased integer:

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/core/DefaultKafkaProducerFactory.java

Line 793 in c061fce

    
           return doCreateTxProducer(txIdPrefix, "" + this.transactionIdSuffix.getAndIncrement(), this::cacheReturner);

Now, imagine we have a service instance in k8s that has 3 threads starting a transaction at the same time, we'll get 3 producers with producer id x1, x2 and x3.
Then that instant becomes a zombie, i.e it loses connectivity with k8s, and k8s starts up an instance to replace it.
The new instance, due to different timings, only starts 2 transactions at the same time, so it only gets 2 producers x1 and x2.
Meanwhile, our zombie is producing from all of its 3 producers.
Kafka will correctly fence and filter out the messages from x1 and x2 because it sees the new producers with the same id, but x3 will still be allowed to send messages, breaking out zombie fencing and our exactly-once semantics.

Is this a bug or am I misunderstanding something?
Thanks.

garyrussell · 2021-12-02T20:58:03Z

garyrussell
Dec 2, 2021

You have to configure a unique txIdPrefix for each instance; e.g. append a UUID to it.

https://docs.spring.io/spring-kafka/docs/current/reference/html/#transaction-id-prefix

...

Now, you can override the factory’s transactionalIdPrefix on the KafkaTemplate and the KafkaTransactionManager.

When using a transaction manager and template for a listener container, you would normally leave this to default to the producer factory’s property. This value should be the same for all application instances when using EOSMode.ALPHA. With EOSMode.BETA it is no longer necessary to use the same transactional.id, even for consumer-initiated transactions; in fact, it must be unique on each instance the same as producer-initiated transactions. ...

0 replies

tzachshabtay · 2021-12-02T23:24:50Z

tzachshabtay
Dec 2, 2021
Author

@garyrussell yes I know that (the x in my example was the prefix, i.e x1 is the first producer for instance x), but unless I misunderstand, the id prefix is assigned per instance, but each instance creates multiple producers, such that the transactional id is actually {prefix}.{incremented id}, so using a transactional id prefix does not solve this issue.

(And slightly tangential, but I think assigning a UUID to your transaction id prefix defeats the purpose, the whole idea is that if k8s creates a new instance while the zombie is still alive, they will have the same id, if each one of them is given a different UUID then zombie fencing will definitely not work).

1 reply

garyrussell Dec 6, 2021

But what about, say, 2 concurrent instances - they cannot have the same transactional.id or the broker will keep fencing both instances. They MUST have unique ids.

I think you might be misunderstanding fencing.

so using a transactional id prefix does not solve this issue.

I said a unique transactional id prefix.

tzachshabtay · 2021-12-06T18:04:47Z

tzachshabtay
Dec 6, 2021
Author

@garyrussell 2 concurrent instances that are supposed to run concurrently MUST have unique ids, I agree, but my example was describing a single replica that k8s lost communication with, and so it started another instance.
Meaning we have 2 concurrent instances that are NOT supposed to run together, so in this case you want the transactional id to be the SAME, otherwise the old one (i.e the zombie) will not be fenced, and you can have dups.

To my understanding that's the classic problem that zombie fencing is supposed to solve, and spring is not doing this correctly because it creates producers by demand, so you cannot guarantee the restarted instance will have the same number of producers as the zombie, and if the zombie has more producers you can still get dups.

That being said, it's very possible that I'm misunderstanding fencing, which is why I opened a discussion and not an issue. If I'm misunderstanding I'd like to understand better.

6 replies

tzachshabtay Dec 6, 2021
Author

@garyrussell Right, only that third concurrent producer might not be needed for a while (if ever), for example there was a lot of volume at the time the zombie was alive which didn't reproduce since, and the ...-3 producer will be allowed to send events for that possibly very long period of time.

A possible solution (the only solution I could think of) is if instead of creating producers on-the-fly, there will be a max concurrency configuration and then we create all the producers at startup (and then if the client is trying to produce an event when all of our producers are occupied, it will block until a producer becomes available).

garyrussell Dec 7, 2021

What is wrong with the solution I suggested (reset() the producer factory)?

tzachshabtay Dec 8, 2021
Author

@garyrussell that solution is nice but not bullet proof, if only producer 3 is sending, you will not get ProducerFencedException.

garyrussell Dec 8, 2021

True, but I don't believe it's possible to solve that - how would you do it if you created producers manually? I suppose your fixed cache size suggestion would work.

It will eventually trip because the cache is in the form of a queue and producers are pulled from the head and returned to the tail so, eventually you'll get a producer that is fenced.

If you have a better solution, feel free to submit a PR with a different ProducerFactory implementation.

tzachshabtay Dec 8, 2021
Author

Ok, thanks for the info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is zombie fencing implemented correctly? #2025

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is zombie fencing implemented correctly? #2025

tzachshabtay Dec 2, 2021

Replies: 3 comments · 7 replies

garyrussell Dec 2, 2021

tzachshabtay Dec 2, 2021 Author

garyrussell Dec 6, 2021

tzachshabtay Dec 6, 2021 Author

tzachshabtay Dec 6, 2021 Author

garyrussell Dec 7, 2021

tzachshabtay Dec 8, 2021 Author

garyrussell Dec 8, 2021

tzachshabtay Dec 8, 2021 Author

tzachshabtay
Dec 2, 2021

Replies: 3 comments 7 replies

garyrussell
Dec 2, 2021

tzachshabtay
Dec 2, 2021
Author

tzachshabtay
Dec 6, 2021
Author

tzachshabtay Dec 6, 2021
Author

tzachshabtay Dec 8, 2021
Author

tzachshabtay Dec 8, 2021
Author