-
Notifications
You must be signed in to change notification settings - Fork 324
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Prefix Caching with HBM and latency test
Implements Prefix Caching in HBM and latency test in inference_microbenchmark. Stores prefix tokens as a trie for fast lookup index of PrefixCache store in cache. Insert longer Key replace shorter key to be the longest common prefix key. The shorter key will never be returned even if longer key is erased, and should got evicted in the future. Assume Key is equal length to tokens, which can be used to slice prompt and cache Value. Should check the return key common prefix length by the caller. If erase the Key not the leaf, nothing will happen. If erased key match at a leaf, delete the node and ancestors would be the leaf after deleted. Value will be moved to the cache, which means cannot used the same value reference after add_to_cache. Value retrieved from cache should not be modified, too. It just return the reference. Add PrefixCaching benchmark test in inference_microbenchmark. Using proportion of the prefill_length in config as the common prefix and save specific number in config into the cache.
- Loading branch information
1 parent
d50683d
commit c8fc38c
Showing
4 changed files
with
990 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.