-
Notifications
You must be signed in to change notification settings - Fork 11.8k
server : separate the notion of position and KV tokens, remove prompt truncation #13576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Why not just keep track of the kv image delta in one variable? It looks incredibly inefficient to scan through the whole prompt to find the kv length when it could just be easily computed as the length of the prompt cache + kv image delta, which gets updated as new images are found in the prompt processing. Since there will be a small number of images its also easy to maintain a companion image prompt information stdvec which stores the position and kv usage of each image in the prompt and scan through that if dumping parts of the kv during context shift, that array will be very short and can be efficiently scanned and easily maintained without needing to scan through the text tokens in the prompt at all. Some possible applications might have very long text prompts such as scanning in a pdf with text and images in it and pdfs can go to 50k to 100k text tokens easily, not good to force walking through that whole long prompt every time just to find out how much kv is used. |
@steampunque I don't understand what you're saying about kv delta. Can you make a code example? The scan only happen on slot start, not on every decoding. In normal case, the variable server_tokens::n_kv will be used, no scan is needed |
Just to illustrate the high level concept:
This can be extended if needed by passing a stdvec to the mtmd_tokenize routine which returns an array containing where the images are positioned and how much kv they consume in case needed by other processing later. There will never be thousands of images, in most cases only 1 to 10s, but there can be tens to hundred thousands of text tokens wrapped around small number of images so this image information array if even needed is short. As an aside I never understood the whole idea of the built in "context shift" functionality when kv gets full. I believe this functionality should be completely removed from the server, not just auto-disabled when images are present. If kv hits a capacity limit generation should just stop there and full status returned so the caller can do whatever they want with it instead of executing some arbitrary token removal heuristic in the server. For a multiturn conversation contexts shifts can easily corrupt the prompt by getting rid of tokens in the middle of a prompt template. I don't think this problem should even exist because whatever process is calling the server should be responsible for handling the kv full, not the server itself. In an extended multiturn situation the process calling the server can delete earlier parts of the conversation to keep going and there will never be any worry about invalidating the prompt with a context shift going on in the server. If the input prompt is too big it should also not truncate prompt, it should return no tokens with kv full status. |
Then how your idea work in the case we want to discard last N positions (not N tokens) from the
It's the |
tools/server/server.cpp
Outdated
slot.n_past = slot.cache_tokens.get_common_prefix(prompt_tokens); | ||
slot.n_kv_tokens = slot.cache_tokens.n_kv_tokens(slot.n_past); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steampunque The "scan" loop for counting token is only needed in this particular logic and nowhere else in the code base, so currently there is no risk of performance degrade. We can simply merge these 2 calls into one single API so we will end up with a single loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 678d7b1
That I showed was just very high level rough concept illustrating to have mtmd_tokenize return information about the images and then use that information later in downstream prompt processing, to avoid having to recompute things which have already been figured out in the tokenize routine. More specifically :
|
I think you still don't get it. We basically had the API mtmd_image_tokens_get_n_tokens that returns number of kv tokens an image takes. This API cost nothing, read its source code to understand There is no need to do manual mapping image --> num of tokens like you said |
You made many changes to this PR after my comment so some of my concerns are already addressed. My comments were mainly addressing the original PR where you were scanning through the entire prompt seeking out images at every position to compute kv size which should not be necessary if you already know where the images are and how much space they take with a short companion array computed when tokenizing the prompt. If you have another way to do that efficiently without the companion array and still avoid walking through the whole prompt looking for images that is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improve the naming by avoiding the notion of "KV cache". We have 2 types of counts to consider: tokens and positions. So we can call them like that: n_tok/n_tokens
, n_pos/n_positions
.
Introducing n_kv/n_kv_tokens
seems unnecessary and a bit confusing. The "KV cache" is an implementation detail and we should aim so that the examples and the tools do not refer to it. They should only know about the "cache" or the "memory" of the context.
@ggerganov I made an important change in de8956a:
I know it may sound quite cumbersome, but I think we can potentially add a notion of "view" for Also while working on this, I think it's now time to start thinking about how to clean up / break the |
The motivation of this API is because currently, we have a bug where the context usage of Qwen VL models (using M-RoPE) is incorrectly tracked.
This is because for Qwen VL, a image can occupies multiple KV entries (let's not use the term "tokens" here so it's less confusing), but the image only occupies one position. For example:
My idea is to simply separate the notion of
n_past
andn_kv_tokens
. But turns out, it's a bit more complicated than I initially thought.To make it a bit easier, I ended up removing the prompt truncating logic, I remember someone added this feature with the argument that if the process the prompt only to have the ctx shift kicked in to remove processed token, it's wasteful. However, I don't agree with that:
n_prompt_tokens
,n_prompt_tokens_processed
Other than that, I'm 90% sure other things are being "migrated" correctly to
n_kv_tokens
; but just in case, @ggerganov please take your time to review this. Thanks!!Note: if this PR get merged, we must also add a breaking change in #9291