Prefix Caching with HBM and latency test #1278

yuyanpeng-google · 2025-02-17T23:51:08Z

Description

Implements Prefix Caching in HBM and latency test in inference_microbenchmark.

Stores prefix tokens as a trie for fast lookup index of PrefixCache store in cache.

Insert longer Key replace shorter key to be the longest common prefix key.
The shorter key will never be returned even if longer key is erased, and should got evicted in the future.

Assume Key is equal length to tokens, which can be used to slice prompt and cache Value.
Should check the return key common prefix length by the caller.

If erase the Key not the leaf, nothing will happen.
If erased key match at a leaf, delete the node and ancestors would be the leaf after deleted.

Value will be moved to the cache, which means cannot used the same value reference after add_to_cache.

The jax may modified the value even stored in another python reference.
If the value need to be used after add_to_cache, make sure copy them before add_to_cache.

Return value copied from cache to avoid modified value in the cache, always copied the value before return.

Add PrefixCaching benchmark test in inference_microbenchmark.

Using half of the prefill_length as the common prefix and save 100 prefix in the cache.

Loading the cache (including jax.array.copy) appears to be independent of the prefill_length (tested with 128 and 1024), even though the saved cache sizes are different.

Using jax.profiler shows that the copy operation consumes a similar amount of time on TPU. This might be because the sizes aren't large or different enough to see a significant impact.

Part of results below

Prefix Cache benchmark results for prefill length 128:

PrefixCaching results:
	Per prefix size bytes: 0.124 GB
	Average save cache time: 12.142 ms
	Average fetch longest prefix time: 0.029 ms
	Average load cache time: 5.589 ms


Prefix Cache benchmark results for prefill length 1024:

PrefixCaching results:
	Per prefix size bytes: 0.220 GB
	Average save cache time: 12.987 ms
	Average fetch longest prefix time: 0.218 ms
	Average load cache time: 5.143 ms

FIXES: b/389788256
TESTED: unittest

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

MaxText/configs/base.yml

MaxText/inference_microbenchmark.py

MaxText/tests/prefix_cache_test.py

vipannalla

Looks good, I left few comments.

MaxText/prefix_cache.py

yuyanpeng-google · 2025-02-25T07:46:47Z

Modify the recursive erase to prevent stack recursive limit in long context.
Fix the comments.
Rebase to main.

mailvijayasingh

LGTM!

MaxText/prefix_cache.py

liurupeng · 2025-02-26T04:01:52Z

Synced with Yuyan offline, I think the current approach looks good

yuyanpeng-google · 2025-02-26T07:58:14Z

Rebase and squash

vipannalla

looks good

Implements Prefix Caching in HBM and latency test in inference_microbenchmark. Stores prefix tokens as a trie for fast lookup index of PrefixCache store in cache. Insert longer Key replace shorter key to be the longest common prefix key. The shorter key will never be returned even if longer key is erased, and should got evicted in the future. Assume Key is equal length to tokens, which can be used to slice prompt and cache Value. Should check the return key common prefix length by the caller. If erase the Key not the leaf, nothing will happen. If erased key match at a leaf, delete the node and ancestors would be the leaf after deleted. Value will be moved to the cache, which means cannot used the same value reference after add_to_cache. Value retrieved from cache should not be modified, too. It just return the reference. Add PrefixCaching benchmark test in inference_microbenchmark. Using proportion of the prefill_length in config as the common prefix and save specific number in config into the cache.

mailvijayasingh reviewed Feb 18, 2025

View reviewed changes

MaxText/configs/base.yml Outdated Show resolved Hide resolved

MaxText/inference_microbenchmark.py Outdated Show resolved Hide resolved

MaxText/tests/prefix_cache_test.py Outdated Show resolved Hide resolved

yuyanpeng-google marked this pull request as ready for review February 19, 2025 00:28

yuyanpeng-google requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners February 19, 2025 00:28

yuyanpeng-google changed the title ~~[WIP] Prefix Caching~~ Prefix Caching with HBM and latency test Feb 19, 2025

This was referenced Feb 19, 2025

Add prefix_cache module skeleton and Value inside #1265

Closed

Implement PrefixCacheTrie #1266

Closed

Add clone and eq in prefix_cache value #1267

Closed

Implement HBMCache #1268

Closed

vipannalla reviewed Feb 19, 2025

View reviewed changes

yuyanpeng-google requested review from richjames0, rni418 and gagika as code owners February 25, 2025 07:42

yuyanpeng-google force-pushed the yuyan-prefix-cache-dev branch from 0b4f412 to 754836f Compare February 25, 2025 07:43

yuyanpeng-google force-pushed the yuyan-prefix-cache-dev branch from de9868d to f42d4e0 Compare February 25, 2025 09:01

mailvijayasingh approved these changes Feb 25, 2025

View reviewed changes