cannot find tokenizer merges in model file #120

flatsiedatsie · 2024-09-27T21:08:06Z

Noticed this error loading the Llama 1B and 3B models.

I'm updating Wllama now, hopefully that fixes it.

felladrin · 2024-09-27T21:37:51Z

I agree it may be due to the outdated version using an older version of llama.cpp, because Llama 1B is available to try on https://github.ngxson.com/wllama/examples/main/dist/ and it's working fine there. I've also tried the 3B model, and it's all good.
Let us know if the update solved the issue!

flatsiedatsie · 2024-09-27T23:16:30Z

Odd, it didn't solve it.

I tried re-downloading the model itself, but that didn 't help.

Then I tried Firefox for comparison, and actually noticed the same error.

I'm attempting a non-chunked version of the model next.

https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked
->
https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-1b-instruct-q4_k_m.gguf

flatsiedatsie · 2024-09-27T23:17:33Z

bingo

flatsiedatsie · 2024-09-28T07:52:38Z

I re-chunked the 1B using the very latest version of llama.cpp.

Now it loads, but only outputs a single word before giving this error:

// Looking back, this error may have just been my code trying to unload Wllama after inference was complete, and failing.

felladrin · 2024-09-28T09:30:56Z

Could you try these splits and confirm if they work? (Those are the ones I'm using without issues on Wllama v1.16.2)

We need first to find out if the problem is with:

The split files
The wllama lib
Your code around wllama (maybe a config or sampling config can be causing the problem, so you can first try setting the sampling as { temp: 0 })

flatsiedatsie · 2024-09-28T15:41:49Z

Here is the chunked model that only outputs one word by the way: https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf

flatsiedatsie · 2024-09-28T15:53:37Z

Strange. Now I don't get any output.

~~I don't think I see it doing that looping action normally. Could it be loading each chunk as if it were the total model?~~

flatsiedatsie · 2024-09-28T16:02:54Z

Setting the sampling to minimal worked!

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

felladrin · 2024-09-28T16:58:07Z

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

Not at all! I leave it always enabled too! :D

Setting the sampling to minimal worked!

Interesting! Did it work with https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf ?

If so, we can conclude it is something specific with the config passed to Wllama? (If that’s the case, have you found the specific config combination that caused the issue?)

flatsiedatsie · 2024-09-28T21:22:41Z

I used to have this enabled all the time too, but I've removed it now.

//model_settings['n_seq_max'] = 1;
//model_settings['n_batch'] = 1024; //2048

But re-enabling it as a test had no (negative) effect.

I still have this enabled:

model_settings['embeddings'] = false;

Should I remove that?

flatsiedatsie · 2024-09-28T22:03:47Z

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?

[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

Aha! There is a function to get the Jinja template from the GGUF, and then Wllama's uses a dependency on @huggingface/jinja to apply that template.

flatsiedatsie · 2024-09-29T07:54:35Z

felladrin · 2024-09-29T09:03:00Z

Does this error also happen when configuring Wllama with n_threads: 1? (forcing it to single-thread)

felladrin · 2024-09-29T09:04:21Z

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?
[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

There is, by using the @huggingface/jinja package (the same as Transformers.js uses).

Here's the same logic used in https://github.ngxson.com/wllama/examples/main/dist/:

import { Template } from "@huggingface/jinja";

const wllama = new Wllama(/*...*/);
 
await wllama.loadModelFromUrl(/*...*/);

export const formatChat = async (wllama: Wllama, messages: Message[]) => {
  const defaultChatTemplate =
    "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}";

  const template = new Template(
    wllama.getChatTemplate() ?? defaultChatTemplate,
  );

  return template.render({
    messages,
    bos_token: await wllama.detokenize([wllama.getBOS()]),
    eos_token: await wllama.detokenize([wllama.getEOS()]),
    add_generation_prompt: true,
  });
};

const messages = [
    {
        "role": "user",
        "content": "Hi!"
    },
    {
        "role": "assistant",
        "content": "Hello! How may I help you today?"
    }
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
]

const prompt = formatChat(wllama, messages);
// <|im_start|>user
// Hi!<|im_end|>
// <|im_start|>assistant
// Hello! How may I help you today?<|im_end|>
// <|im_start|>user
// How many R's are there in the word strawberry?<|im_end|>
// <|im_start|>assistant

flatsiedatsie · 2024-09-29T13:18:29Z

Oh wow, diving into your info I realized there is even an abstraction layer above Transformers.js.

// Wait, no, it's just to use the API.

flatsiedatsie · 2024-09-29T16:54:24Z

I've implemented your templating approach, thank you! Much simpler than creating an entire Transformers.js instance.

danielhanchen · 2024-09-30T10:12:51Z

This might be related to unslothai/unsloth#1065, unslothai/unsloth#1062 - temporary fixes are provided for Unsloth finetuners, and can confirm with the Hugging Face team at ggml-org/llama.cpp#9692 it's tokenizers causing issues

ngxson · 2024-09-30T14:05:36Z

This problem is reported on the upstream repo: ggml-org/llama.cpp#9692

ngxson added the llama.cpp related Issues related to llama.cpp upstream source code, mostly unrelated to wllama label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot find tokenizer merges in model file #120

cannot find tokenizer merges in model file #120

flatsiedatsie commented Sep 27, 2024

felladrin commented Sep 27, 2024

flatsiedatsie commented Sep 27, 2024

flatsiedatsie commented Sep 27, 2024

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

felladrin commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

felladrin commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 29, 2024

felladrin commented Sep 29, 2024

felladrin commented Sep 29, 2024

flatsiedatsie commented Sep 29, 2024 •

edited

Loading

flatsiedatsie commented Sep 29, 2024

danielhanchen commented Sep 30, 2024 •

edited

Loading

ngxson commented Sep 30, 2024

cannot find tokenizer merges in model file #120

cannot find tokenizer merges in model file #120

Comments

flatsiedatsie commented Sep 27, 2024

felladrin commented Sep 27, 2024

flatsiedatsie commented Sep 27, 2024

flatsiedatsie commented Sep 27, 2024

flatsiedatsie commented Sep 28, 2024 • edited Loading

felladrin commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024 • edited Loading

flatsiedatsie commented Sep 28, 2024 • edited Loading

felladrin commented Sep 28, 2024

flatsiedatsie commented Sep 28, 2024 • edited Loading

flatsiedatsie commented Sep 28, 2024 • edited Loading

flatsiedatsie commented Sep 29, 2024

felladrin commented Sep 29, 2024

felladrin commented Sep 29, 2024

flatsiedatsie commented Sep 29, 2024 • edited Loading

flatsiedatsie commented Sep 29, 2024

danielhanchen commented Sep 30, 2024 • edited Loading

ngxson commented Sep 30, 2024

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 28, 2024 •

edited

Loading

flatsiedatsie commented Sep 29, 2024 •

edited

Loading

danielhanchen commented Sep 30, 2024 •

edited

Loading