llama_decode + llama_get_logits_ith() JNI android integration , GGML_ABORT("fatal error"); #14245

code4profite · 2025-06-17T14:41:54Z

code4profite
Jun 17, 2025

My question is : is their a function to populating the output_ids array based on ubatch.logits. ?

I'm working with the latest llama.cpp and using the llama_decode + llama_get_logits_ith() workflow in a JNI integration. I’m getting a fatal error from this function:

float * llama_context::get_logits_ith(int32_t i) {
...
if (j < 0) {
throw std::runtime_error(format("batch.logits[%d] != true", i));
}
...
GGML_ABORT("fatal error");
}
This happens during the second iteration of a text generation loop. Here's the situation:

I call llama_decode() with a batch containing 1 token, at position n_cur, with logits[0] = true.

Then I call llama_get_logits_ith(n_cur - 1).

The first decode works (prompt prefill), and the first generated token works too.

On the second iteration, calling llama_get_logits_ith(n_cur - 1) throws an error because output_ids[n_cur - 1] == -1, which means batch.logits wasn't true during decode.

However, I did set logits[0] = true in every decode batch. Why is this still failing?

This is the function :

` extern "C"
JNIEXPORT jstring JNICALL
Java_com_llamacppreactnativeapp_LlamaBridge_generateText(
JNIEnv env,
jclass clazz,
jlong context_ptr,
jstring prompt_
) {
LOGD("generateText called with context_ptr: %p", reinterpret_cast<void>(context_ptr));
auto* wrapper = reinterpret_cast<LlamaContext*>(context_ptr);
if (!wrapper || !wrapper->model || !wrapper->ctx) {
LOGE("Error in generateText: Invalid LlamaContext wrapper or its members.");
return env->NewStringUTF("Error: Model or context not loaded.");
}

const char *prompt = env->GetStringUTFChars(prompt_, 0);
if (!prompt) {
    LOGE("Error: GetStringUTFChars failed for prompt.");
    return env->NewStringUTF("Error: Invalid prompt input.");
}
const int prompt_len = strlen(prompt);

const llama_vocab* vocab = llama_model_get_vocab(wrapper->model);
if (!vocab) {
    env->ReleaseStringUTFChars(prompt_, prompt);
    LOGE("Error: Failed to get model vocabulary.");
    return env->NewStringUTF("Error: Failed to get model vocabulary.");
}

int n_tokens_prompt = llama_tokenize(vocab, prompt, prompt_len, nullptr, 0, true, false);
if (n_tokens_prompt < 0) {
    n_tokens_prompt = -n_tokens_prompt;
    LOGW("llama_tokenize required buffer size for prompt: %d. Resizing token buffer.", n_tokens_prompt);
}

std::vector<llama_token> tokens(n_tokens_prompt);
int actual_n_tokens_prompt = llama_tokenize(vocab, prompt, prompt_len, tokens.data(), n_tokens_prompt, true, false);
if (actual_n_tokens_prompt < 0) {
    env->ReleaseStringUTFChars(prompt_, prompt);
    LOGE("Error: Tokenization failed with code %d.", actual_n_tokens_prompt);
    return env->NewStringUTF("Error: Prompt tokenization failed.");
}
tokens.resize(actual_n_tokens_prompt);
env->ReleaseStringUTFChars(prompt_, prompt);

if (tokens.empty()) {
    LOGW("Prompt tokenized to zero tokens. Returning empty string.");
    return env->NewStringUTF("");
}
LOGD("Prompt tokenized into %zu tokens.", tokens.size());

if (tokens.size() >= llama_n_ctx(wrapper->ctx)) {
    LOGE("Prompt (%zu tokens) is too long for the context window (%d tokens).", tokens.size(), llama_n_ctx(wrapper->ctx));
    return env->NewStringUTF("Error: Prompt is too long for the model's context window.");
}

// --- Manual Batch Initialization and Population for Prompt ---
// Initialize batch once with a capacity sufficient for the prompt + generated tokens.
// llama_batch_init(capacity_for_tokens, embedding_dim_if_used, max_parallel_sequences_per_token)
// We use 0 for embedding_dim as we are providing tokens.
// We use 1 for max_parallel_sequences_per_token as we have a single sequence (ID 0).
llama_batch batch = llama_batch_init(llama_n_ctx(wrapper->ctx), 0, 1);
LOGD("Initialized batch with capacity %d, max_seq: %d", llama_n_ctx(wrapper->ctx), 1);

// IMPORTANT: Check if llama_batch_init failed to allocate any internal buffers
if (!batch.token || !batch.pos || !batch.seq_id || !batch.logits) {
    LOGE("CRITICAL ERROR: llama_batch_init failed to allocate internal buffers.");
    // If batch was partially allocated, llama_batch_free would handle it.
    // But if fundamental allocations fail, it's safer to just return error.
    llama_batch_free(batch); // Ensure proper cleanup of any partially allocated memory
    return env->NewStringUTF("Error: Internal batch allocation failed.");
}

// Populate the batch for the prompt
for (int i = 0; i < tokens.size(); i++) {
    batch.token[i]   = tokens[i];
    batch.pos[i]     = i;
    // Correct way to set seq_id: access the inner array element
    batch.seq_id[i][0] = 0; // For sequence ID 0
    batch.n_seq_id[i] = 1;
    //batch.logits[i]  = false; // Only request logits for the last token of the prompt
    batch.logits[i] = true;
}
if (tokens.size() > 0) {
    batch.logits[tokens.size() - 1] = true; // Request logits for the last token only
}
batch.n_tokens = tokens.size(); // Set the number of tokens in the batch
LOGD("Populated prompt batch with %d tokens.", batch.n_tokens);

// First decode call for the prompt (prefill)
LOGD("Calling llama_decode for prompt (prefill).");
int decode_result = llama_decode(wrapper->ctx, batch);
if (decode_result != 0) {
    llama_batch_free(batch);
    LOGE("Error evaluating prompt tokens. Decode result: %d", decode_result);
    return env->NewStringUTF("Error evaluating prompt tokens.");
}
LOGD("Prompt successfully decoded.");

std::string output_text;
int n_cur = tokens.size(); // Current number of tokens in the context
const int max_tokens_to_generate = 128; // Maximum tokens to generate

LOGD("Starting token generation loop. Current tokens in context: %d, Max generation length: %d", n_cur, max_tokens_to_generate);

// --- Token Generation Loop ---
while (n_cur < llama_n_ctx(wrapper->ctx) && (n_cur - tokens.size()) < max_tokens_to_generate) {
    // Get logits from the last token decoded in the previous step
    // n_cur - 1 is the position of the last token in the context
    float* logits = llama_get_logits_ith(wrapper->ctx, n_cur - 1);
    /*
        You now have a very clear description of the problem, including:

        The exact GGML_ABORT location.
        The precise condition (j < 0 because output_ids[i] is -1).
        Your JNI loop code demonstrating how you set up llama_batch for single-token generation.
        The observation that output_ids[n_cur] is -1 even after a successful llama_decode call for that token.
    */
    if (!logits) {
        LOGE("Error: Could not get logits for prediction at position %d.", n_cur - 1);
        break;
    }

    const int n_vocab = llama_vocab_n_tokens(vocab);
    if (n_vocab <= 0) {
        LOGE("Error: Invalid vocabulary size %d.", n_vocab);
        break;
    }

    // Simple greedy sampling (select the token with the highest logit)
    llama_token id = 0; // Start with token 0 (often <unk> or padding, but useful for comparison)
    for (int i = 1; i < n_vocab; i++) {
        if (logits[i] > logits[id]) {
            id = i;
        }
    }

    // Check for end-of-sequence token
    if (id == llama_vocab_eos(vocab)) {
        LOGD("Generated EOS token. Stopping generation.");
        break;
    }

    // Convert token ID to piece (text)
    char token_buffer[32]; // Sufficient buffer for a single token piece
    int ret = llama_token_to_piece(vocab, id, token_buffer, sizeof(token_buffer), 0, false);

    if (ret > 0) {
        token_buffer[ret] = '\0'; // Null-terminate the string
        output_text += token_buffer;
        LOGD("Generated token: '%s' (ID: %d)", token_buffer, id);
    } else {
        LOGW("Could not get text for token ID %d. llama_token_to_piece returned %d", id, ret);
    }

    llama_batch token_batch = llama_batch_init(1, 0, 1); // Single-token batch

    // Prepare the batch for the next single token prediction
    // Reset n_tokens to 0 for a single token batch
    token_batch.n_tokens = 0; 
    
    // Populate the first slot (index 0) of the batch for the new token
    token_batch.token[0]   = id;
    token_batch.pos[0]     = n_cur; // Position of the new token in the context
    token_batch.seq_id[0][0] = 0; // Still sequence ID 0
    token_batch.n_seq_id[0] = 1;
    token_batch.logits[0]  = true; // Always request logits for the single token in the batch
    token_batch.n_tokens   = 1; // Now there's one token in the batch

    LOGD("Before llama_decode call for next token. Context ptr: %p, Batch n_tokens: %d", (void*)wrapper->ctx, batch.n_tokens);
    decode_result = llama_decode(wrapper->ctx, token_batch);
    llama_batch_free(token_batch); // Free immediately after
    LOGD("After llama_decode call. Result: %d", decode_result);
    if (decode_result != 0) {
        LOGE("Error evaluating generated token. Decode result: %d", decode_result);
        break; // Break the loop on error
    }
    LOGD("Token successfully decoded.");

    n_cur++; // Increment current context token count
}

llama_batch_free(batch); // Free the batch only once at the very end of the function
LOGI("Text generation complete. Output length: %zu", output_text.length());
return env->NewStringUTF(output_text.c_str());

} `

gemini told me this:
Verify the llama-kv-cache-unified.cpp file manually check the apply_ubatch function in your local llama.cpp/src/llama-kv-cache-unified.cpp file to ensure the if (do_logits) block is present.

Can you help me reason through what causes output_ids[i] = -1 even when logits=true?
is their a function to populating the output_ids array based on ubatch.logits. ?

Answered by ggerganov

Jun 18, 2025

Try to replace this line:

float* logits = llama_get_logits_ith(wrapper->ctx, n_cur - 1);

with

float* logits = llama_get_logits_ith(wrapper->ctx, -1);

View full answer

ggerganov · 2025-06-18T09:35:49Z

ggerganov
Jun 18, 2025
Maintainer

Try to replace this line:

float* logits = llama_get_logits_ith(wrapper->ctx, n_cur - 1);

with

float* logits = llama_get_logits_ith(wrapper->ctx, -1);

1 reply

code4profite Jun 18, 2025
Author

thx ggerganov. it works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama_decode + llama_get_logits_ith() JNI android integration , GGML_ABORT("fatal error"); #14245

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

llama_decode + llama_get_logits_ith() JNI android integration , GGML_ABORT("fatal error"); #14245

Uh oh!

Uh oh!

code4profite Jun 17, 2025

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Jun 18, 2025 Maintainer

Uh oh!

code4profite Jun 18, 2025 Author

code4profite
Jun 17, 2025

Replies: 1 comment 1 reply

ggerganov
Jun 18, 2025
Maintainer

code4profite Jun 18, 2025
Author