DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313

jukofyork · 2025-03-10T17:30:32Z

IMPORTANT: This will require re-quantising all models that use this PR!!!

This is a vastly tided up continuation of #11446 and #12227 which allows the use of the -mla (--mla-attn) option:

By default it won't use MLA and essentially converts MLA into MHA (with very large KV-cache overhead).
With the -mla option it essentially converts MLA into MQA (with very low KV-cache overhead, but at the cost of more compute).
The build_deepseek2() code now uses the proper llm_build_kv() calls for both the non-MLA and MLA branches.
There will likely be some performance regression because of this due to: ~~the forced F32 upcast~~, ~~no 2D x 2D optimisations~~, and the splitting of the q_b and kv_b tensors to extract the MQA (ie: RoPE part) separately (see below).

NOTE: This will require re-quantising all models that use this, but this won't change and I intend to run some experiments over the next few days to find better quant rules for the newly split-up tensors (to hopefully avoid so many of the numerical problems that seem to plague this model).

~~I also plan to see if I can get back some of the lost performance my previous PR gave (but at the cost of a vastly more complex/unmaintainable build_deepseek2() due to all the 2D/3D views it used).~~ DONE

I have left context shifting disabled for now, but I have been careful to move the RoPE parts to the first n_rot parameters so it should be possible eventually to get working with build_k_shift() and build_defrag(), etc. I can't cleanly add this currently though and if I try it will likely end up a confusing mess of overriding the GGUF file parameters for n_embd_k_gqa, n_embd_v_gqa. I've tried to do this as cleanly as the current code allows in: llama-kv-cache.cpp::llama_kv_cache_init(), llama.cpp::llm_build_kv_store() and llama.cpp::llm_build_kqv(). I'm also not 100% clear on the ins-and-outs of the YaRN implementation and how it works for context shifting, etc.

Things in llama.cpp and ggml I'm still a bit unsure of:

There seems to be so many different places to add and copy the -mla option, and I'm not entirely confident I have them all (I looked at how the -fa option was used and tried to copy that as best I could).
When taking views of possibly (likely) quantised tensors:

// {n_embd_head_qk_nope, kv_lora_rank, n_head}
struct ggml_tensor * wk_b_trans_view = ggml_view_3d(ctx0, model.layers[il].wk_b_trans,
        n_embd_head_qk_nope, kv_lora_rank, n_head,
        ggml_row_size(model.layers[il].wk_b->type, n_embd_head_qk_nope),
        ggml_row_size(model.layers[il].wk_b->type, kv_lora_rank * n_embd_head_qk_nope),
        0);
cb(wk_b_trans_view, "wk_b_trans_view", il);

Should I be using the nb[] values? I'm currently just quantising everything to Bf16 (for the attention tensors anyway), so it's possible some of my views are not going to work when quantised... 😕

What if any tests should I add for this? The only tests for the KV-cache creation look very outdated.

fairydreaming · 2025-03-11T10:01:11Z

@jukofyork I wanted to try this, but there seems to be a problem with DeepSeek R1 model conversion in your branch:

(llama.cpp) phm@epyc:~/projects/llama.cpp-mla-final-refactor$ git branch
* mla-final-refactor
(llama.cpp) phm@epyc:~/projects/llama.cpp-mla-final-refactor$ python3 convert_hf_to_gguf.py /mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-R1-bf16/ --outfile models/deepseek-r1-PR12313.gguf --outtype "f16"
INFO:hf-to-gguf:Loading model: models--deepseek-ai--DeepSeek-R1-bf16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-000163.safetensors'
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> F16, shape = {7168, 129280}
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.bfloat16 --> F16, shape = {18432, 7168}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,        torch.bfloat16 --> F16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_up.weight,          torch.bfloat16 --> F16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
Traceback (most recent call last):
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 5189, in <module>
    main()
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 5183, in main
    model_instance.write()
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 439, in write
    self.prepare_tensors()
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 4219, in prepare_tensors
    super().prepare_tensors()
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 298, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 4194, in modify_tensors
    (self.map_tensor_name(name.replace("kv_a_proj_with_mqa", "kv_a_proj")), kv_a_proj),
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phm/projects/llama.cpp-mla-final-refactor/convert_hf_to_gguf.py", line 214, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.self_attn.kv_a_proj.weight'

jukofyork · 2025-03-11T10:16:11Z

@fairydreaming I'm actually just reverting this as I realised it was going to be really hard to maintain llm_build_kqv() as every time somebody adds a model it will require lots of changes to get it mergeable again. It also seems like splitting off the two _mqa tensors loses all the gains I made in the last PR and even adding back most of the code into llm_build_kqv() it was still much worse...

I'm now just merging the older "with flash attention" PR with the "-mla" options, but trying to use at least llm_build_kv_store():

                    struct ggml_tensor * q_states = ggml_concat(ctx0, q_nope_absorbed, q_pe, 0);
                    cb(q_states, "q_states", il);

                    struct ggml_tensor * k_states = ggml_concat(ctx0, kv_compressed, k_pe_view, 0);
                    cb(k_states, "k_states", il);

                    struct ggml_tensor * v_states = kv_compressed;
                    cb(v_states, "v_states", il);

                    // these nodes are added to the graph together so that they are not reordered
                    // by doing so, the number of splits in the graph is reduced
                    ggml_build_forward_expand(gf, q_states);
                    ggml_build_forward_expand(gf, k_states);
                    ggml_build_forward_expand(gf, v_states);

                    llm_build_kv_store(ctx0, hparams, cparams, kv_self, gf, k_states, v_states, n_tokens, kv_head, cb, il);

I'll have it done in a couple of hours and there won't be any need to requant then too (closing this for now).

jukofyork added 8 commits March 10, 2025 15:26

Initial commit of all reformatted changes for deepseek2-mla

01f638b

Added missing MLA flags

9c0eb4f

Added missing default to llama_context_default_params()

eefa5bb

Fixed typo in mla_attn name

5636696

Fixed bad formatting of build_deepseek2()

e551cdc

Switched to have first n_rot so can use context shifting later

75e3d6a

Changed comment as shifting *could* be implemented now

5593282

Final tidy up and added TODOs and extra comments

c9faa1b

jukofyork mentioned this pull request Mar 10, 2025

Optimized DeepSeek V2/V3 implementation (MLA + flash attention) #12227

Closed

github-actions bot added examples python python script changes server labels Mar 10, 2025

jukofyork mentioned this pull request Mar 10, 2025

Optimized DeepSeek V2/V3 implementation (MLA) #11446

Closed

jukofyork added 3 commits March 10, 2025 17:40

Removed unused local variables from llm_build_kv()

f1297b6

Removed trailing whitespace in convert_hf_to_gguf.py

3649714

Added optimised version of MQA for MLA to llm_build_kqv()

c684535

jukofyork closed this Mar 11, 2025

jukofyork deleted the mla-final-refactor branch March 13, 2025 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313

DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313

Uh oh!

jukofyork commented Mar 10, 2025 •

edited

Loading

Uh oh!

fairydreaming commented Mar 11, 2025

Uh oh!

jukofyork commented Mar 11, 2025

Uh oh!

Uh oh!

DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313

DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA #12313

Uh oh!

Conversation

jukofyork commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairydreaming commented Mar 11, 2025

Uh oh!

jukofyork commented Mar 11, 2025

Uh oh!

Uh oh!

jukofyork commented Mar 10, 2025 •

edited

Loading