GH-3879: Add cache to optimize header match performance. #3934

chickenchickenlove · 2025-05-30T14:06:02Z

Fixes: #3879
Issue link: #3879

In previous discussion, we decided to use LinkedHashMap for LruCache Implementation.
However, I think that we missed concurrency of KafkaListenerContainer.
We can set concurrency when we create KafkaListener like @KafkaListener(concurrency = 10, ....).

But, LinkedHashMap is not thread-safe.
So, I use ConcurrentLruCache provided by spring-framework to ensure thread-safe.

Fixes: spring-projects#3879 Issue link: spring-projects#3879 What Add a LRU cache for pattern match of KafkaHeaderMapper. Why? To improve CPU usage used by pattern match of KafkaHeaderMapper. Commonly, many Kafka records in the same topic will have the same header name. Currently, Pattern Match has O(M*N) time complexity, where M is pattern length, N is String length. If results of patterns match are cached and KafkaHeaderMapper uses it, KafkaHeaderMapper can expect improvement in terms of CPU usage. Signed-off-by: Sanghyeok An <[email protected]>

Signed-off-by: Sanghyeok An <[email protected]>

artembilan

Just one nit-pick.
I love everything else.
Thank you again!

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java

Signed-off-by: Sanghyeok An <[email protected]>

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java

artembilan · 2025-06-02T14:57:42Z

the pattern matching logic would still be executed repeatedly each time.

Right, my bad.
The problem is not about multi-value header names as caching concern was raised originally.
And that made my brain to think wrong direction.

The real problem that we always go against pattern matching when we could just cache the matching outcome to not go there any more if we got the same header name in the next record.

So, it sounds like we got a 3 states: single-value, multi-value and don't map.
I think we can use a LinkedHashMap as cache implementation and Boolean object as a value for header name.
So, every single time when we got the value from the cache (or better to say a method backed by cache) we would get a respective state for that header name.

Does this makes sense?
Thanks

chickenchickenlove · 2025-06-03T01:44:18Z

@artembilan
Thanks for your comments! 🙇‍♂️

Don’t map refers to patterns that have not yet been evaluated, right?
If so, the current implementation already covers what you’re describing.

// 1.
if (this.headerMatchedCache.isMultiValuePattern(headerName)) {
    return true;
}

// 2.
if (this.headerMatchedCache.isSingleValuePattern(headerName)) {
    return false;
}

// In this step, it is 'dont map' state.
// In case the pattern matches multi-value
// 3.
this.headerMatchedCache.cacheAsMultiValueHeader(headerName);
...

// In case the pattern doesn't match multi-value
// 4.
this.headerMatchedCache.cacheAsSingleValueHeader(headerName);

// After this, it is not in 'dont map' state.

The don’t map state will be resolved to either multi-value or single-value as a result of the pattern match.
So, after step 4, the state will be determined as either multi-value or single-value.

Also, ConcurrentMessageListenerContainer shares KafkaHeaderMapper instance between child containers.
So, they can cause race condition when LinkedHashMap is full and removeOldestEntry() is called.

This is the reason why I used two separate ConcurrentLruCache instances.

IMHO, the race condition that might occur in removeEldestEntry() could be a minor issue.
If you agree that the race condition in removeEldestEntry() is negligible, I'll replace ConcurrentLruCache with LinkedHashMap instead.

What do you think?
Please let me know your opinion 🙇‍♂️ .

artembilan · 2025-06-03T14:32:39Z

I see now.
The matchesForInbound() is called unconditionally for any header.
We go to the caching logic only after in the fromUserHeader().
I was under impression that we always cache the header name which has been just undergone pattern matching independently of the outcome.
So, I think the optimization you are doing here is just a half of the story.
May be it would be better to look into algorithm one more time and do optimization for all the pattern matching?
Or don't do at all.

It is breaking contract for the doesMatch() to return a nullable Boolean instead.
But that indeed would be great improvement when we would cache everything independently if it is multi, single or won't be mapped at all.

What do you think?

chickenchickenlove · 2025-06-04T13:00:16Z

@artembilan
Thank you for sharing your thoughts. 🙇‍♂️
I think it would be better to keep the current algorithm.

Currently, the method doesMatchMultiValueHeader()— which determines whether if a header is multi-value header — is called two places:
one is toHeaders(), the other one is fromHeaders().

The method matchesForInbound(...) is only called within toHeaders().
However, pattern matching for multi-value headers is performed in both toHeaders() and fromHeaders().

Also, in the context of toHeaders(), we need to consider headers like jsonTypes, so there are actually 4 logical states. (don't map, json types, single value, multi value).

Furthermore, I believe we should respect reserved header name for safety, such as KafkaHeaders.DELIVERY_ATTEMPT, KafkaHeaders.LISTENER_INFO.

To summarize

Pattern matching is perform in two places. however, matchesForInBound(...) covers only one of them.
Moving pattern matching logic to top of the method could potentially affect reserved header names.
If we try to determine the header type at the beginning, we might end up handling more than just the current four states. also, this may grow in the future (e.g, 5 or even 10 states), making the logic more complex.

For these reasons, I think it would be better to keep the current algorithm.
What do you think?
Please let me know!

artembilan · 2025-06-05T14:23:20Z

Right. I meant matchesForInBound(...) as an example where we do pattern matching.
Of course I have a fromHeaders() in mind as well.
Sorry if my original comment was not clear.

The rest of your analysis makes sense.
For me it feels like no caching at all would keep the logic more cleaner, but since you have already done a decent job for multi/single-value caching, we may just apply your fix as is.
In the end it is fully internal.

On the other hand, I wonder if we can come up with a CompositeHeaderMatcher which would do caching as well.
This way we would always cache any pattern matching results.
Right, that may lead to duplicate entries in the original this.matchers and that our new this.multiValueHeaderMatchers, but in the end they are just strings from those header names.
Does it make sense?

chickenchickenlove · 2025-06-06T01:03:47Z

@artembilan
Thank you for the thoughtful feedback. 🙇‍♂️

I have considered introducing a CompositeHeaderMatcher, but it seems difficult to implement effectively. 🥲
Currently, the HeaderMapper depends on the HeaderMatcher interface, and this interface is structured in a way that separates required functionalities across different classes. (NeverMatchHeaderMatcher, SimplePatternBasedHeaderMatcher).

As a result, even if we introduce a CompositeHeaderMatcher within the current implementation, it would be hard to utilize it effectively in methods such as doesMatchMultiValue(). we cannot keep code clean as well.
Therefore, IMHO, it is better to keep the current HeaderMatcher implementation as it is.

By the way, through our conversation, I came up with another idea.
It may use a bit more memory, but it could lead to a cleaner implementation.

The idea is for each HeaderMapper to maintain two separate caches:

One for storing headers that matched itself.
Another for storing headers that did not match but have already been seen.

For example,

protected static class SimplePatternBasedHeaderMatcher implements HeaderMatcher { 
   private final ConcurrentLruCache<String, Boolean> matchedCache = new ConcurrentLruCache(1000, ...);
   private final ConcurrentLruCache<String, Boolean> notMatchedCache = new ConcurrentLruCache(1000, ...);
   
   public boolean matchHeader(String headerName) {
      String header = headerName.toLowerCase(...);
      if (matchedCache.contains(header) {
        return true;
      }
      if (notmatchedCache.contains(header) {
        return false;
      }

      // Keep the original logics. + add result to caches.
      ...
   }

}

When using a ConcurrentLruCache with a capacity of 1000 and storing 1000 key-value pairs where keys are strings of length 50, the memory usage is approximately:

(2 bytes * 50) * 1000 entries = 100,000 bytes ≈ 100KB.

Therefore, using this approach would require about 200KB of additional memory per mapper, but it could lead to a cleaner implementation.
Considering that modern JVM typically have heap memory in the range of GB and that the number of HeaderMatcher instances is likely to be only a few dozen, the impact on memory usage would probably be negligible.
Additionally, having the HeaderMatcher reference the cache internally feels more semantically natural.
Also, in this structure, even if new cache-related states are introduced, it is expected that there will be no significant changes required from a code extensibility perspective.

If my suggestion was heading in an unreasonable direction, I apologize.
I'd appreciate hearing your thoughts on this. 🙇‍♂️

artembilan · 2025-06-06T13:22:47Z

This is great conversation, @chickenchickenlove , and I appreciate all your thinking and patience with my stubbornness.
Let me take this locally and see what I can do from pure code perspective.
I admit that all my review so far was just here on GH 😉

artembilan

Here is what I'm thinking about this caching implementation:
AbstractKafkaHeaderMapper.zip

Sorry about zip file because that is not easy to show different parts of the class changes.
Pay attention that your current HeaderPatternMatchCache is fully eliminated.

Signed-off-by: Sanghyeok An <[email protected]>

chickenchickenlove · 2025-06-06T14:20:42Z

@artembilan
Thanks for your comments and code!
It makes sense to me and better 👍
I hadn’t considered that a Function interface could be used that way with ConcurrentLruCache because of my stupidity 😅 .

Thanks for your patience !! 🙇‍♂️
I made a new commit.
When you have time, Please take another look...!

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java

Signed-off-by: Sanghyeok An <[email protected]>

artembilan

LGTM
Let’s here from @sobychacko for double checking since we have fought back and force with this feature for a while .

thank you again !

chickenchickenlove · 2025-06-07T01:06:59Z

Thanks for your review!!!
It was very helpful to me, again.
Thanks, always 🙇‍♂️

chickenchickenlove · 2025-06-13T15:54:46Z

@sobychacko Hi!
When you have time, Please take a look 🙇‍♂️

chickenchickenlove added 2 commits May 30, 2025 22:49

Remove useless comment.

e55bd5e

Signed-off-by: Sanghyeok An <[email protected]>

chickenchickenlove changed the title ~~spring-projectsGH-3879: Add cache to optimize header match performance.~~ GH-3879: Add cache to optimize header match performance. May 30, 2025

artembilan requested changes May 30, 2025

View reviewed changes

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java Outdated Show resolved Hide resolved

Addressing PR review

390f4bf

Signed-off-by: Sanghyeok An <[email protected]>

chickenchickenlove requested a review from artembilan May 30, 2025 22:24

artembilan requested changes May 30, 2025

View reviewed changes

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java Outdated Show resolved Hide resolved

artembilan requested changes Jun 6, 2025

View reviewed changes

Addressing PR review -s

f5e3b1f

Signed-off-by: Sanghyeok An <[email protected]>

artembilan requested changes Jun 6, 2025

View reviewed changes

spring-kafka/src/main/java/org/springframework/kafka/support/AbstractKafkaHeaderMapper.java Outdated Show resolved Hide resolved

Addressing PR review.

70ff893

Signed-off-by: Sanghyeok An <[email protected]>

artembilan approved these changes Jun 6, 2025

View reviewed changes

sobychacko approved these changes Jun 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-3879: Add cache to optimize header match performance. #3934

GH-3879: Add cache to optimize header match performance. #3934

Uh oh!

chickenchickenlove commented May 30, 2025

Uh oh!

artembilan left a comment

Uh oh!

Uh oh!

Uh oh!

artembilan commented Jun 2, 2025

Uh oh!

chickenchickenlove commented Jun 3, 2025

Uh oh!

artembilan commented Jun 3, 2025

Uh oh!

chickenchickenlove commented Jun 4, 2025 •

edited

Loading

Uh oh!

artembilan commented Jun 5, 2025

Uh oh!

chickenchickenlove commented Jun 6, 2025

Uh oh!

artembilan commented Jun 6, 2025

Uh oh!

artembilan left a comment

Uh oh!

chickenchickenlove commented Jun 6, 2025

Uh oh!

Uh oh!

artembilan left a comment

Uh oh!

chickenchickenlove commented Jun 7, 2025

Uh oh!

chickenchickenlove commented Jun 13, 2025

Uh oh!

Uh oh!

GH-3879: Add cache to optimize header match performance. #3934

Are you sure you want to change the base?

GH-3879: Add cache to optimize header match performance. #3934

Uh oh!

Conversation

chickenchickenlove commented May 30, 2025

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

artembilan commented Jun 2, 2025

Uh oh!

chickenchickenlove commented Jun 3, 2025

Uh oh!

artembilan commented Jun 3, 2025

Uh oh!

chickenchickenlove commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artembilan commented Jun 5, 2025

Uh oh!

chickenchickenlove commented Jun 6, 2025

Uh oh!

artembilan commented Jun 6, 2025

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

chickenchickenlove commented Jun 6, 2025

Uh oh!

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

chickenchickenlove commented Jun 7, 2025

Uh oh!

chickenchickenlove commented Jun 13, 2025

Uh oh!

Uh oh!

chickenchickenlove commented Jun 4, 2025 •

edited

Loading