-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313
Conversation
@jukofyork I wanted to try this, but there seems to be a problem with DeepSeek R1 model conversion in your branch:
|
@fairydreaming I'm actually just reverting this as I realised it was going to be really hard to maintain I'm now just merging the older "with flash attention" PR with the "-mla" options, but trying to use at least struct ggml_tensor * q_states = ggml_concat(ctx0, q_nope_absorbed, q_pe, 0);
cb(q_states, "q_states", il);
struct ggml_tensor * k_states = ggml_concat(ctx0, kv_compressed, k_pe_view, 0);
cb(k_states, "k_states", il);
struct ggml_tensor * v_states = kv_compressed;
cb(v_states, "v_states", il);
// these nodes are added to the graph together so that they are not reordered
// by doing so, the number of splits in the graph is reduced
ggml_build_forward_expand(gf, q_states);
ggml_build_forward_expand(gf, k_states);
ggml_build_forward_expand(gf, v_states);
llm_build_kv_store(ctx0, hparams, cparams, kv_self, gf, k_states, v_states, n_tokens, kv_head, cb, il); I'll have it done in a couple of hours and there won't be any need to requant then too (closing this for now). |
IMPORTANT: This will require re-quantising all models that use this PR!!!
This is a vastly tided up continuation of #11446 and #12227 which allows the use of the
-mla
(--mla-attn
) option:-mla
option it essentially converts MLA into MQA (with very low KV-cache overhead, but at the cost of more compute).build_deepseek2()
code now uses the properllm_build_kv()
calls for both the non-MLA and MLA branches.the forced,F32
upcastno 2D x 2D optimisations, and the splitting of theq_b
andkv_b
tensors to extract the MQA (ie: RoPE part) separately (see below).NOTE: This will require re-quantising all models that use this, but this won't change and I intend to run some experiments over the next few days to find better quant rules for the newly split-up tensors (to hopefully avoid so many of the numerical problems that seem to plague this model).
I also plan to see if I can get back some of the lost performance my previous PR gave (but at the cost of a vastly more complex/unmaintainableDONEbuild_deepseek2()
due to all the 2D/3D views it used).I have left context shifting disabled for now, but I have been careful to move the RoPE parts to the first
n_rot
parameters so it should be possible eventually to get working withbuild_k_shift()
andbuild_defrag()
, etc. I can't cleanly add this currently though and if I try it will likely end up a confusing mess of overriding the GGUF file parameters forn_embd_k_gqa
,n_embd_v_gqa
. I've tried to do this as cleanly as the current code allows in:llama-kv-cache.cpp::llama_kv_cache_init()
,llama.cpp::llm_build_kv_store()
andllama.cpp::llm_build_kqv()
. I'm also not 100% clear on the ins-and-outs of the YaRN implementation and how it works for context shifting, etc.Things in
llama.cpp
andggml
I'm still a bit unsure of:-mla
option, and I'm not entirely confident I have them all (I looked at how the-fa
option was used and tried to copy that as best I could).Should I be using the
nb[]
values? I'm currently just quantising everything toBf16
(for the attention tensors anyway), so it's possible some of my views are not going to work when quantised... 😕