Skip to content

KAFKA-19122: updateClusterMetadata receives multiples PartitionInfo #19803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

brandboat
Copy link
Member

@brandboat brandboat commented May 24, 2025

This patch resolves the following issues in MetadataCache#toCluster:

  • Avoids duplicate Node entries when a broker has multiple endpoints.
  • Fixes a bug where fenced brokers result in NPE.
  • Ensures missing topic IDs are properly populated in cluster metadata.
  • Delete unused test code snippet.

@brandboat brandboat requested a review from chia7712 May 24, 2025 16:44
Copy link
Collaborator

@m1a2st m1a2st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brandboat for this patch, left a question.

Comment on lines +150 to +151
Map<Integer, Node> nodesById = image.cluster().brokers().values().stream()
.collect(Collectors.toMap(BrokerRegistration::id, broker -> broker.nodes().get(0)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn’t we filter out the fenced broker here?

Copy link
Member Author

@brandboat brandboat May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes a bug where fenced brokers result in NPE.

Like I mentioned in the PR description, this result in a NullPointerException.
Filtering out the fenced broker while Partition itself still includes that fenced broker in replicas can trigger the NPE as we already filter them out...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why we filtered out fenced brokers in the first place—was there a reason for doing so?

Copy link
Collaborator

@m1a2st m1a2st May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I'm a bit confused — shouldn't the partition leader broker not be a fenced broker? In KIP-841 has following invariants

  • a fenced or in-controlled-shutdown replica is not eligible to be in the ISR; and
  • a fenced or in-controlled-shutdown replica is not eligible to become leader.

Maybe I misunderstanding something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me, I meant "replicas", not leader.

Maybe an example will help clarify things:
Let’s say we have 3 brokers — broker0, broker1, and broker2 — and a topic called my-topic. Partition 0 of this topic has 3 replicas, one on each broker. Now suppose broker2 unexpectedly shuts down.

Here’s what happens next:

  1. DynamicTopicClusterQuotaPublisher#onMetadataUpdate gets triggered
  2. That leads to MetadataCache#toCluster being called
  3. In toCluster, nodesById gets constructed without the fenced broker (broker2 is filtered out)
  4. Then MetadataCache#toArray uses this nodesById, but in the TopicsImage the PartitionRegistration (my-topic partition-0) still has replicas [0, 1, 2]. Since broker2 isn’t in nodesById, we get a NullPointerException, kaboom!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants