Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventHub timeouts leads to Kafka producer clients getting stuck and always failing with InvalidPidMappingException (with default enable.idempotence=true) #261

Open
lgo opened this issue Nov 14, 2024 · 5 comments

Comments

@lgo
Copy link

lgo commented Nov 14, 2024

I'm reporting this as I've hit this a number of times and while I've worked around it, I'm filing this for two reasons:

  • This feels like a bug with interaction between EventHub and the Kafka client provided I have not encountered this in similar situations with Kafka, albeit I'm not even sure sure I've seen timeouts on actual Kafka deployments with our setup as we have transparent retries
  • For anyone else who runs into this mysterious problem, hopefully you find this and can resolve the issue

I have a relatively low set of timeouts configured provided specific requirements on some topics, with the following Kafka producer client configuration:

retries=1
linger.ms=2
request.timeout.ms=5000
delivery.timeout.ms=10011 # (request timeout * (retry + 1) + linger + 1)

In several situations (e.g. EventHub server restarts due to upgrades, excess consumer load hammering EventHub), we've observed that after we have any timeouts the Kafka producer client will get stuck and always fail with the following error:

org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.

The Kafka client in this situation will not self-recover, even if EventHub has recovered. Recovery is manual, through re-initialize the Kafka producer client. Of course, this only occurs with the default Kafka setting of enable.idempotence=true which introduces client transaction IDs. I've found this easy to reproduce by inducing a high load on EventHub such as having an amplified Kafka consumption rate, say a consumer deployed 100s of times or a Spark streaming job with many tasks.

@lgo
Copy link
Author

lgo commented Nov 15, 2024

Ah, seems like KIP-588 is relevant. It doesn't seem to be resolved, but does have a couple changes related to it. It's still a mystery to me why we're only seeing this for EventHub when it's likely we're also seeing timeouts on other clients but without thel asting impact.

@danewalton
Copy link
Member

danewalton commented Jan 17, 2025

We're hitting this issue as well, and not having transactions enabled.

@lgo
Copy link
Author

lgo commented Jan 17, 2025

We were advised that our retries/timeouts were simply too low and our options were to either increase closer to the defaults (which I believe were simply unacceptably high for our use-case) or set enable.idempotence=false to disable the involvement of internal producer IDs (for idempotence). I do still believe there's an Azure EventHub bug here because in all this time we have not observed the same non-recoverable errors with actual Kafka brokers, but 🤷

@lgo
Copy link
Author

lgo commented Jan 17, 2025

(and to be clear this is about having idempotence enabled which does some internal ID management, hence my earlier mention of transactions, rather than the public facing transactional commits feature of Kafka)

@danewalton
Copy link
Member

Yea we have added the recommended configs as well, and will have to wait and see if we hit it again. But we also aren't using transactions, so it seems weird to me that we would ever get this error. We don't set the transactional.id value, and therefore according to docs, transactions are not used. Docs for InvalidPidMappingException are

Image

so my thinking is we shouldn't even get this... odd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants