llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

ggerganov · 2025-03-04T15:12:48Z

Overview

The implementation in #11213 became too complicated, trying to make a lot of changes at once. This is an alternative implementation, which does not involve the abstraction of the llama_context. The PR introduces some new abstractions, improves the graph build handling and is an initial step for the next changes listed in section "Next" below.

Rework the old llm_build_context into new llm_graph_context implemented in llama-graph.h/.cpp
Introduce llm_graph_input_... classes for handling graph inputs in a safer and cleaner way
Introduce llm_graph_result for extracting important tensors such as embeddings and logits, instead of searching for them by tensor name
Introduce llm_memory_i concept that will abstract different cache/memory mechanisms. For now we have only llama_kv_cache as a type of memory
Rework session saving/loading using new llama_io_write_i and llama_io_read_i interfaces
Remove "worst case" concept from the graph building logic

API changes

The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.

Deprecate llama_kv_cache_... API
Add llama_kv_self_... API

void repro() {
    llama_backend_init();

    auto model_params = llama_model_default_params();
    model_params.n_gpu_layers = 33;

    auto model_path = "/home/user/models/DeepSeek-R1-Distill-Qwen-14B-IQ2_M.gguf";
    auto model = llama_load_model_from_file(model_path, model_params);
    fputs("model loaded\n", stdout);
    fflush(stdout);

    auto context_params = llama_context_default_params();
    auto ctx = llama_init_from_model(model, context_params);
    fputs("context created\n", stdout);
    fflush(stdout);

    auto state_size = llama_state_get_size(ctx);
    fputs(("State size: " + std::to_string(state_size) + "\n").c_str(), stdout);
    fflush(stdout);

    llama_free(ctx);
    llama_free_model(model);

    llama_backend_free();
}

…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci

ggerganov · 2025-03-15T07:01:33Z

@giladgd #12397 should fix this.

fairydreaming · 2025-03-17T18:16:15Z

@ggerganov I noticed that T5 models no longer work correctly after merging this PR so I investigated possible causes.

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong. Another problem is that in T5 decoder implementation build_attn() with llm_graph_input_attn_kv_unified inp expects "2D" V tensor as indicated by this assert:

llama.cpp/src/llama-graph.cpp

Line 1381 in 01e8f21

assert(v_cur->ne[0] == n_embd_v_gqa && v_cur->ne[1] == n_tokens);

but you pass "3D" V tensor here:

llama.cpp/src/llama-model.cpp

Line 9566 in 01e8f21

Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

This results in ggml_tranpose() transposing wrong dimensions in non-debug builds and assertion failure in debug builds.

I found that removing this single line fixed the problem. I'd correct it myself, but I'm not sure how do you intend to handle is_encoding problem, so I'm leaving it to you.

ggerganov · 2025-03-17T18:54:43Z

Another problem is that in T5 decoder implementation build_attn() with llm_graph_input_attn_kv_unified inp expects "2D" V tensor as indicated by this assert:

Thanks for catching that. I broke this in this commit: 70ef653. The reason was because I wanted to make the PR to produce the same graphs as on master and this extra reshape was causing some small differences. I think it is best to restore the reshape so that all 3 Q, K, V tensors are passed as 3D tensors for consistency.

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong.

Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment).

fairydreaming · 2025-03-17T20:41:32Z

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong.

Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment).

Do you mean something like this?

if (llama_model_has_encoder(model)) {
   llama_set_causal_attn(lctx, false);
   llama_encode(...);
   llama_set_causal_attn(lctx, true);
}

I just tested it and it works fine. Maybe an extra assert in encode() that would print some info if causal_attn is set true would be good to, otherwise existing code will silently stop working correctly for unknown reason.

ggerganov · 2025-03-18T08:40:51Z

Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set causal_attn = false internally for all encode calls and restore to the value it had before the call.

Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet.

fairydreaming · 2025-03-18T08:59:08Z

Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set causal_attn = false internally for all encode calls and restore to the value it had before the call.

Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet.

@ggerganov I guess the "cleanest" solution would be to add llm_graph_input_attn_no_cache_enc and build_attn_inp_no_cache_enc() that would be used only by encoder and would create KQ mask for encoder. I see that you already do similar thing with inp->pos_bucket - there are separate build_inp_pos_bucket_enc() and build_inp_pos_bucket_dec() methods in llm_graph_context for encoder and decoder.

It could always create non-causal mask since I don't know of any models that use causal attention in encoder. If any appears, handling it would be a matter of adding new causal_attn_enc flag in hparams and cparams and creating KQ mask for encoder based on its value.

ggerganov · 2025-03-18T09:21:06Z

It's hard to decide how to do it exactly. For now, here is a simple patch that should work:

#12447

fairydreaming · 2025-03-18T18:03:48Z

@ggerganov There seem to be another problem with the refactor that manifests when using CUDA backend with T5 models. From what I understand the problem is that you copy the encoder output here:

llama.cpp/src/llama-context.cpp

Line 1149 in c6af216

memcpy(cross.v_embd.data(), embd, ggml_nbytes(t_embd));

without making sure the encoder graph finished computation. When I added ggml_synchronize() call earlier it started working correctly:

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 42332acf..8d441b0c 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1100,7 +1100,8 @@ int llama_context::encode(llama_batch & inp_batch) {
                 {
                     // extract token embeddings
                     GGML_ASSERT(n_tokens*n_embd <= (int64_t) embd_size);
-                    ggml_backend_tensor_get_async(backend_embd, t_embd, embd, 0, n_tokens*n_embd*sizeof(float));
+                    ggml_backend_synchronize(backend_embd);
+                    ggml_backend_tensor_get(t_embd, embd, 0, n_tokens*n_embd*sizeof(float));
                 } break;
             case LLAMA_POOLING_TYPE_MEAN:
             case LLAMA_POOLING_TYPE_CLS:

…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci

giladgd · 2025-03-22T20:24:33Z

I'm getting a segmentation fault when using llama_adapter_lora_init with the latest master, and I think it might be related to this PR since I haven't encountered it before.
It only happens when not offloading layers to the GPU.

Here's a simple reproduction code:

void repro() {
    llama_backend_init();

    auto model_params = llama_model_default_params();
    model_params.n_gpu_layers = 0;

    // https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
    auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf";
    auto model = llama_model_load_from_file(model_path, model_params);
    fputs("model loaded\n", stdout);
    fflush(stdout);

    // https://huggingface.co/ngxson/test_gguf_lora_adapter/blob/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf
    auto lora_path = "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf";
    auto lora = llama_adapter_lora_init(model, lora_path);
    fputs("lora created\n", stdout);
    fflush(stdout);

    llama_adapter_lora_free(lora);
    llama_model_free(model);

    llama_backend_free();
}

Here's a stack trace from gdb on an Ubuntu 22.04 machine when compiled with no GPU support:

Stack trace

#0  0x00007fffceb0d360 in ggml_backend_cpu_aarch64_buffer_set_tensor (buffer=<optimized out>, tensor=0x5faeb80, data=0x609d090, offset=<optimized out>, size=262144) at /home/user/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:5632
        tensor_traits = 0x0
        OK = <optimized out>
#1  0x00007fffceee8a75 in operator() (dev=0x5c3a3a0, orig=<optimized out>, __closure=<synthetic pointer>) at /home/user/llama.cpp/src/llama-adapter.cpp:316
        offs = 100330736
        size = 262144
        ctx_gguf = <optimized out>
        read_buf = <optimized out>
        gguf_file = <optimized out>
        ctx_gguf = <optimized out>
        read_buf = <optimized out>
        gguf_file = <optimized out>
        offs = <optimized out>
        size = <optimized out>
#2  llama_adapter_lora_init_impl (model=..., path_lora=0x1c26f6e6022 <error: Cannot access memory at address 0x1c26f6e6022>, adapter=...) at /home/user/llama.cpp/src/llama-adapter.cpp:321
        orig = {a = <optimized out>, b = 0x5f8b910}
        dev = {a = 0x5c3a3a0, b = 0x5d688f0}
        it = {first = "blk.9.ffn_up.weight", second = {a = 0x5faeb80, b = 0x5faecf0}}
        __for_range = std::unordered_map with 0 elements = {[""] = {a = 0x0, b = 0xc3a4c491c3b8c290}<error reading variable: Cannot access memory at address 0xc2b8c2a0c3a3c4b8>...}
        __for_begin = <optimized out>
        __for_end = <optimized out>
        gguf_file = {pimpl = std::unique_ptr<llama_file::impl> = {get() = 0x5b417c0}}
        read_buf = std::vector of length 262144, capacity 262144 = {0 '\000', 60 '<', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000'...}
        set_tensor = <optimized out>
        __func__ = "llama_adapter_lora_init_impl"
        ctx_init = 0x5c3a3a0
        meta_gguf_params = <optimized out>
        ctx_gguf = std::unique_ptr<gguf_context> = {get() = 0x5faeb80}
        ctx = std::unique_ptr<ggml_context> = {get() = 0x7fffffff7bb8}
        n_tensors = <optimized out>
        ctx_map = std::map with 2 elements = {[0x7fffceb41720 <ggml_backend_cpu_aarch64_buffer_type()::ggml_backend_cpu_buffer_type_aarch64>] = 0x5a19490, [0x7ffff408bd40 <ggml_backend_cpu_buffer_from_ptr_type()::ggml_backend_cpu_buffer_type>] = 0x5a1b920}
        ctx_for_buft = <optimized out>
        ab_map = std::map with 225 elements = {["blk.0.attn_k.weight"] = {a = 0x5f64aa0, b = 0x5f64c10}, ["blk.0.attn_output.weight"] = {a = 0x5f64d80, b = 0x5f64ef0}, ["blk.0.attn_q.weight"] = {a = 0x5f65060, b = 0x5f651d0}, ["blk.0.attn_v.weight"] = {a = 0x5f65340, b = 0x5f654b0}, ["blk.0.ffn_down.weight"] = {a = 0x5f64200, 
            b = 0x5f64370}, ["blk.0.ffn_gate.weight"] = {a = 0x5f644e0, b = 0x5f64650}, ["blk.0.ffn_up.weight"] = {a = 0x5f647c0, b = 0x5f64930}, ["blk.1.attn_k.weight"] = {a = 0x5f65ec0, b = 0x5f66030}, ["blk.1.attn_output.weight"] = {a = 0x5f661a0, b = 0x5f66310}, ["blk.1.attn_q.weight"] = {a = 0x5f66480, b = 0x5f665f0}, 
          ["blk.1.attn_v.weight"] = {a = 0x5f66760, b = 0x5f668d0}, ["blk.1.ffn_down.weight"] = {a = 0x5f65620, b = 0x5f65790}, ["blk.1.ffn_gate.weight"] = {a = 0x5f65900, b = 0x5f65a70}, ["blk.1.ffn_up.weight"] = {a = 0x5f65be0, b = 0x5f65d50}, ["blk.10.attn_k.weight"] = {a = 0x5f672e0, b = 0x5f67450}, 
          ["blk.10.attn_output.weight"] = {a = 0x5f675c0, b = 0x5f67730}, ["blk.10.attn_q.weight"] = {a = 0x5f678a0, b = 0x5f67a10}, ["blk.10.attn_v.weight"] = {a = 0x5f67b80, b = 0x5f67cf0}, ["blk.10.ffn_down.weight"] = {a = 0x5f66a40, b = 0x5f66bb0}, ["blk.10.ffn_gate.weight"] = {a = 0x5f66d20, b = 0x5f66e90}, 
          ["blk.10.ffn_up.weight"] = {a = 0x5f67000, b = 0x5f67170}, ["blk.11.attn_k.weight"] = {a = 0x5f68700, b = 0x5f68870}, ["blk.11.attn_output.weight"] = {a = 0x5f689e0, b = 0x5f68b50}, ["blk.11.attn_q.weight"] = {a = 0x5f68cc0, b = 0x5f68e30}, ["blk.11.attn_v.weight"] = {a = 0x5f68fa0, b = 0x5f69110}, 
          ["blk.11.ffn_down.weight"] = {a = 0x5f67e60, b = 0x5f67fd0}, ["blk.11.ffn_gate.weight"] = {a = 0x5f68140, b = 0x5f682b0}, ["blk.11.ffn_up.weight"] = {a = 0x5f68420, b = 0x5f68590}, ["blk.12.attn_k.weight"] = {a = 0x5f69b20, b = 0x5f69c90}, ["blk.12.attn_output.weight"] = {a = 0x5f69e00, b = 0x5f69f70}, 
          ["blk.12.attn_q.weight"] = {a = 0x5f6a0e0, b = 0x5f6a250}, ["blk.12.attn_v.weight"] = {a = 0x5f6a3c0, b = 0x5f6a530}, ["blk.12.ffn_down.weight"] = {a = 0x5f69280, b = 0x5f693f0}, ["blk.12.ffn_gate.weight"] = {a = 0x5f69560, b = 0x5f696d0}, ["blk.12.ffn_up.weight"] = {a = 0x5f69840, b = 0x5f699b0}, 
          ["blk.13.attn_k.weight"] = {a = 0x5f6af40, b = 0x5f6b0b0}, ["blk.13.attn_output.weight"] = {a = 0x5f6b220, b = 0x5f6b390}, ["blk.13.attn_q.weight"] = {a = 0x5f6b500, b = 0x5f6b670}, ["blk.13.attn_v.weight"] = {a = 0x5f6b7e0, b = 0x5f6b950}, ["blk.13.ffn_down.weight"] = {a = 0x5f6a6a0, b = 0x5f6a810}, 
          ["blk.13.ffn_gate.weight"] = {a = 0x5f6a980, b = 0x5f6aaf0}, ["blk.13.ffn_up.weight"] = {a = 0x5f6ac60, b = 0x5f6add0}, ["blk.14.attn_k.weight"] = {a = 0x5f6c360, b = 0x5f6c4d0}, ["blk.14.attn_output.weight"] = {a = 0x5f6c640, b = 0x5f6c7b0}, ["blk.14.attn_q.weight"] = {a = 0x5f6c920, b = 0x5f6ca90}, 
          ["blk.14.attn_v.weight"] = {a = 0x5f6cc00, b = 0x5f6cd70}, ["blk.14.ffn_down.weight"] = {a = 0x5f6bac0, b = 0x5f6bc30}, ["blk.14.ffn_gate.weight"] = {a = 0x5f6bda0, b = 0x5f6bf10}, ["blk.14.ffn_up.weight"] = {a = 0x5f6c080, b = 0x5f6c1f0}, ["blk.15.attn_k.weight"] = {a = 0x5f6d780, b = 0x5f6d8f0}, 
          ["blk.15.attn_output.weight"] = {a = 0x5f6da60, b = 0x5f6dbd0}, ["blk.15.attn_q.weight"] = {a = 0x5f6dd40, b = 0x5f6deb0}, ["blk.15.attn_v.weight"] = {a = 0x5f6e020, b = 0x5f6e190}, ["blk.15.ffn_down.weight"] = {a = 0x5f6cee0, b = 0x5f6d050}, ["blk.15.ffn_gate.weight"] = {a = 0x5f6d1c0, b = 0x5f6d330}, 
          ["blk.15.ffn_up.weight"] = {a = 0x5f6d4a0, b = 0x5f6d610}, ["blk.16.attn_k.weight"] = {a = 0x5f6eba0, b = 0x5f6ed10}, ["blk.16.attn_output.weight"] = {a = 0x5f6ee80, b = 0x5f6eff0}, ["blk.16.attn_q.weight"] = {a = 0x5f6f160, b = 0x5f6f2d0}, ["blk.16.attn_v.weight"] = {a = 0x5f6f440, b = 0x5f6f5b0}, 
          ["blk.16.ffn_down.weight"] = {a = 0x5f6e300, b = 0x5f6e470}, ["blk.16.ffn_gate.weight"] = {a = 0x5f6e5e0, b = 0x5f6e750}, ["blk.16.ffn_up.weight"] = {a = 0x5f6e8c0, b = 0x5f6ea30}, ["blk.17.attn_k.weight"] = {a = 0x5f6ffc0, b = 0x5f70130}, ["blk.17.attn_output.weight"] = {a = 0x5f702a0, b = 0x5f70410}, 
          ["blk.17.attn_q.weight"] = {a = 0x5f70580, b = 0x5f706f0}, ["blk.17.attn_v.weight"] = {a = 0x5f70860, b = 0x5f709d0}, ["blk.17.ffn_down.weight"] = {a = 0x5f6f720, b = 0x5f6f890}, ["blk.17.ffn_gate.weight"] = {a = 0x5f6fa00, b = 0x5f6fb70}, ["blk.17.ffn_up.weight"] = {a = 0x5f6fce0, b = 0x5f6fe50}, 
          ["blk.18.attn_k.weight"] = {a = 0x5f713e0, b = 0x5f71550}, ["blk.18.attn_output.weight"] = {a = 0x5f716c0, b = 0x5f71830}, ["blk.18.attn_q.weight"] = {a = 0x5f719a0, b = 0x5f71b10}, ["blk.18.attn_v.weight"] = {a = 0x5f71c80, b = 0x5f71df0}, ["blk.18.ffn_down.weight"] = {a = 0x5f70b40, b = 0x5f70cb0}, 
          ["blk.18.ffn_gate.weight"] = {a = 0x5f70e20, b = 0x5f70f90}, ["blk.18.ffn_up.weight"] = {a = 0x5f71100, b = 0x5f71270}, ["blk.19.attn_k.weight"] = {a = 0x5f72800, b = 0x5f72970}, ["blk.19.attn_output.weight"] = {a = 0x5f72ae0, b = 0x5f72c50}, ["blk.19.attn_q.weight"] = {a = 0x5f72dc0, b = 0x5f72f30}, 
          ["blk.19.attn_v.weight"] = {a = 0x5f730a0, b = 0x5f73210}, ["blk.19.ffn_down.weight"] = {a = 0x5f71f60, b = 0x5f720d0}, ["blk.19.ffn_gate.weight"] = {a = 0x5f72240, b = 0x5f723b0}, ["blk.19.ffn_up.weight"] = {a = 0x5f72520, b = 0x5f72690}, ["blk.2.attn_k.weight"] = {a = 0x5f73c20, b = 0x5f73d90}, 
          ["blk.2.attn_output.weight"] = {a = 0x5f73f00, b = 0x5f74070}, ["blk.2.attn_q.weight"] = {a = 0x5f741e0, b = 0x5f74350}, ["blk.2.attn_v.weight"] = {a = 0x5f744c0, b = 0x5f74630}, ["blk.2.ffn_down.weight"] = {a = 0x5f73380, b = 0x5f734f0}, ["blk.2.ffn_gate.weight"] = {a = 0x5f73660, b = 0x5f737d0}, 
          ["blk.2.ffn_up.weight"] = {a = 0x5f73940, b = 0x5f73ab0}, ["blk.20.attn_k.weight"] = {a = 0x5f75040, b = 0x5f751b0}, ["blk.20.attn_output.weight"] = {a = 0x5f75320, b = 0x5f75490}, ["blk.20.attn_q.weight"] = {a = 0x5f75600, b = 0x5f75770}, ["blk.20.attn_v.weight"] = {a = 0x5f758e0, b = 0x5f75a50}, 
          ["blk.20.ffn_down.weight"] = {a = 0x5f747a0, b = 0x5f74910}, ["blk.20.ffn_gate.weight"] = {a = 0x5f74a80, b = 0x5f74bf0}, ["blk.20.ffn_up.weight"] = {a = 0x5f74d60, b = 0x5f74ed0}, ["blk.21.attn_k.weight"] = {a = 0x5f76460, b = 0x5f765d0}, ["blk.21.attn_output.weight"] = {a = 0x5f76740, b = 0x5f768b0}...}
        str_endswith = <optimized out>
#3  0x00007fffceee90b2 in llama_adapter_lora_init (model=0x59ad9e0, path_lora=0x7ffff41505d0 "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf") at /home/user/llama.cpp/src/llama-adapter.cpp:333
        adapter = 0x5d69b40
        __func__ = "llama_adapter_lora_init"
#4  0x00007ffff412563b in repro () at /home/user/repro/repro.cpp:236
        model_params = {devices = 0x0, n_gpu_layers = 0, split_mode = LLAMA_SPLIT_MODE_LAYER, main_gpu = 0, tensor_split = 0x0, progress_callback = 0x0, progress_callback_user_data = 0x0, kv_overrides = 0x0, vocab_only = false, use_mmap = true, use_mlock = false, check_tensors = false}
        model_path = 0x7ffff4150568 "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
        model = 0x59ad9e0
        lora_path = 0x7ffff41505d0 "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf"
        lora = <optimized out>

ggerganov · 2025-03-25T09:18:52Z

@giladgd Could you confirm that the cause is actually #12332? I think weights repacking is currently not compatible with using LoRA adapters.

giladgd · 2025-03-25T23:16:23Z

@ggerganov I've run some tests and you're right, #12332 is indeed the cause. Sorry for the confusion.

okaris · 2025-06-20T05:49:48Z

Did this change enable the api as promised in #11577 ?

The plan is to refactor the llama_kv_cache to become a public object with an API that would allow such kind of operations (#11213).

My primary goal is to be able to simply modify n_ctx without reloading the entire model.

ggerganov · 2025-06-20T06:00:49Z

The promised API change is now available - see the llama_memory_ API.

Changing the n_ctx however is still not supported.

You don't have to reload the model to change n_ctx. Just create a new context using the same model.

okaris · 2025-06-20T08:14:17Z

Thanks for the quick reply!
I’ve found llama_init_from_model to create a new context. I suppose it’s my responsibility to free the previous context before I initialize and set the new one, correct?

ggerganov · 2025-06-20T08:18:11Z

Yes, you have to free it with llama_free().

Note that if you want to move the existing memory data (i.e. the cache that you have in the old context) to the new context, you can use the llama_state_ API. See the llama-save-load-state example.

github-actions bot added android Issues specific to Android examples python python script changes server labels Mar 4, 2025

ggerganov force-pushed the gg/llama-kv-cache-v2 branch 7 times, most recently from 766edbf to 62ba774 Compare March 7, 2025 11:20

ggerganov marked this pull request as ready for review March 7, 2025 11:26

ggerganov requested a review from ngxson as a code owner March 7, 2025 11:26

slaren approved these changes Mar 11, 2025

View reviewed changes

ggerganov force-pushed the gg/llama-kv-cache-v2 branch from 62ba774 to a170669 Compare March 11, 2025 11:53

ggerganov added 14 commits March 12, 2025 16:04

llama : refactor llama_context, llama_kv_cache, llm_build_context

5590925

ggml-ci

graph : don't mutate the KV cache during defrag

75624a2

ggml-ci

context : reduce virtuals + remove test function

5aa3518

ggml-ci

context : move interface implementation to source file + factory

0a6648c

ggml-ci

graph : move KV cache build functions to llama_context impl

cc9fa25

ggml-ci

graph : remove model reference from build_pooling

29c9ef5

ggml-ci

graph : remove llama_model reference

bc82560

ggml-ci

kv_cache : provide rope factors

ff95ffd

ggml-ci

graph : rework inputs to use only unique_ptr, remove attn input abstr…

562a478

…action ggml-ci

context : remove llama_context_i abstraction

d0cb319

ggml-ci

context : clean-up

a4fc4e8

ggml-ci

graph : clean-up

af9f6b8

ggml-ci

llama : remove redundant keywords (struct, enum)

226ff01

ggml-ci

model : adapt gemma3

5fc6dbd

ggml-ci

ggerganov mentioned this pull request Mar 15, 2025

context : fix init of n_outputs #12397

Merged

steampunque mentioned this pull request Mar 17, 2025

Eval bug: b4882 broke t5 #12435

Closed

ggerganov mentioned this pull request Mar 18, 2025

context : always use non-causal attention for encoder graphs #12447

Merged

ngxson mentioned this pull request Mar 18, 2025

Eval bug: Gemma3 <unused32> spam #12433

Closed

giladgd mentioned this pull request Mar 22, 2025

feat: Gemma 3 Support withcatai/node-llama-cpp#440

Closed

5 tasks

s-u mentioned this pull request Mar 22, 2025

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Closed

Yangxiaoz mentioned this pull request Mar 23, 2025

Eval bug: A Silu operand overflow occurred , causing the program to malfunction. #12523

Closed

ExtReMLapin mentioned this pull request Mar 26, 2025

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Open

ggerganov mentioned this pull request Mar 26, 2025

llama : make loras compatible with repacking #12593

Merged

ngxson mentioned this pull request Mar 27, 2025

llama : fix non-causal mask for gemma 3 #12615

Merged

compilade mentioned this pull request May 2, 2025

llama : initial Mamba-2 support #9126

Merged

9 tasks

MarcusDunn mentioned this pull request May 21, 2025

compile error for update-llama-cpp-2025-05-21 branch utilityai/llama-cpp-rs#739

Closed

llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

Conversation

ggerganov commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

API changes

Next

Uh oh!

ggerganov commented Mar 10, 2025

Uh oh!

giladgd commented Mar 14, 2025

Uh oh!

ggerganov commented Mar 15, 2025

Uh oh!

fairydreaming commented Mar 17, 2025

Uh oh!

ggerganov commented Mar 17, 2025

Uh oh!

fairydreaming commented Mar 17, 2025

Uh oh!

ggerganov commented Mar 18, 2025

Uh oh!

fairydreaming commented Mar 18, 2025

Uh oh!

ggerganov commented Mar 18, 2025

Uh oh!

fairydreaming commented Mar 18, 2025

Uh oh!

giladgd commented Mar 22, 2025

Uh oh!

ggerganov commented Mar 25, 2025

Uh oh!

giladgd commented Mar 25, 2025

Uh oh!

okaris commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

okaris commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

Uh oh!

ggerganov commented Mar 4, 2025 •

edited

Loading

okaris commented Jun 20, 2025 •

edited

Loading