Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot find tokenizer merges in model file #120

Open
flatsiedatsie opened this issue Sep 27, 2024 · 18 comments
Open

cannot find tokenizer merges in model file #120

flatsiedatsie opened this issue Sep 27, 2024 · 18 comments
Labels
llama.cpp related Issues related to llama.cpp upstream source code, mostly unrelated to wllama

Comments

@flatsiedatsie
Copy link
Contributor

Noticed this error loading the Llama 1B and 3B models.

Screenshot 2024-09-27 at 22 57 33

I'm updating Wllama now, hopefully that fixes it.

@felladrin
Copy link
Contributor

I agree it may be due to the outdated version using an older version of llama.cpp, because Llama 1B is available to try on https://github.ngxson.com/wllama/examples/main/dist/ and it's working fine there. I've also tried the 3B model, and it's all good.
Let us know if the update solved the issue!

@flatsiedatsie
Copy link
Contributor Author

Odd, it didn't solve it.

I tried re-downloading the model itself, but that didn 't help.

Then I tried Firefox for comparison, and actually noticed the same error.

Screenshot 2024-09-28 at 00 58 45

I'm attempting a non-chunked version of the model next.

https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked
->
https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-1b-instruct-q4_k_m.gguf

@flatsiedatsie
Copy link
Contributor Author

bingo

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 28, 2024

I re-chunked the 1B using the very latest version of llama.cpp.

Now it loads, but only outputs a single word before giving this error:

Screenshot 2024-09-28 at 09 51 14

// Looking back, this error may have just been my code trying to unload Wllama after inference was complete, and failing.

@felladrin
Copy link
Contributor

Could you try these splits and confirm if they work? (Those are the ones I'm using without issues on Wllama v1.16.2)

We need first to find out if the problem is with:

  1. The split files
  2. The wllama lib
  3. Your code around wllama (maybe a config or sampling config can be causing the problem, so you can first try setting the sampling as { temp: 0 })

@flatsiedatsie
Copy link
Contributor Author

Here is the chunked model that only outputs one word by the way: https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 28, 2024

Strange. Now I don't get any output.

Screenshot 2024-09-28 at 17 49 40 Screenshot 2024-09-28 at 17 48 22

I don't think I see it doing that looping action normally. Could it be loading each chunk as if it were the total model?

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 28, 2024

Setting the sampling to minimal worked!

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

@felladrin
Copy link
Contributor

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

Not at all! I leave it always enabled too! :D

Setting the sampling to minimal worked!

Interesting! Did it work with https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf ?

If so, we can conclude it is something specific with the config passed to Wllama? (If that’s the case, have you found the specific config combination that caused the issue?)

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 28, 2024

I used to have this enabled all the time too, but I've removed it now.

//model_settings['n_seq_max'] = 1;
//model_settings['n_batch'] = 1024; //2048

But re-enabling it as a test had no (negative) effect.

I still have this enabled:

model_settings['embeddings'] = false;

Should I remove that?

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 28, 2024

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?

[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

Aha! There is a function to get the Jinja template from the GGUF, and then Wllama's uses a dependency on @huggingface/jinja to apply that template.

@flatsiedatsie
Copy link
Contributor Author

Screenshot 2024-09-29 at 09 52 22

@felladrin
Copy link
Contributor

Screenshot 2024-09-29 at 09 52 22

Does this error also happen when configuring Wllama with n_threads: 1? (forcing it to single-thread)

@felladrin
Copy link
Contributor

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?

[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

There is, by using the @huggingface/jinja package (the same as Transformers.js uses).

Here's the same logic used in https://github.ngxson.com/wllama/examples/main/dist/:

import { Template } from "@huggingface/jinja";

const wllama = new Wllama(/*...*/);
 
await wllama.loadModelFromUrl(/*...*/);

export const formatChat = async (wllama: Wllama, messages: Message[]) => {
  const defaultChatTemplate =
    "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}";

  const template = new Template(
    wllama.getChatTemplate() ?? defaultChatTemplate,
  );

  return template.render({
    messages,
    bos_token: await wllama.detokenize([wllama.getBOS()]),
    eos_token: await wllama.detokenize([wllama.getEOS()]),
    add_generation_prompt: true,
  });
};

const messages = [
    {
        "role": "user",
        "content": "Hi!"
    },
    {
        "role": "assistant",
        "content": "Hello! How may I help you today?"
    }
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
]

const prompt = formatChat(wllama, messages);
// <|im_start|>user
// Hi!<|im_end|>
// <|im_start|>assistant
// Hello! How may I help you today?<|im_end|>
// <|im_start|>user
// How many R's are there in the word strawberry?<|im_end|>
// <|im_start|>assistant

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented Sep 29, 2024

Oh wow, diving into your info I realized there is even an abstraction layer above Transformers.js.

// Wait, no, it's just to use the API.

@flatsiedatsie
Copy link
Contributor Author

I've implemented your templating approach, thank you! Much simpler than creating an entire Transformers.js instance.

@danielhanchen
Copy link

danielhanchen commented Sep 30, 2024

This might be related to unslothai/unsloth#1065, unslothai/unsloth#1062 - temporary fixes are provided for Unsloth finetuners, and can confirm with the Hugging Face team at ggml-org/llama.cpp#9692 it's tokenizers causing issues

@ngxson ngxson added the llama.cpp related Issues related to llama.cpp upstream source code, mostly unrelated to wllama label Sep 30, 2024
@ngxson
Copy link
Owner

ngxson commented Sep 30, 2024

This problem is reported on the upstream repo: ggml-org/llama.cpp#9692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama.cpp related Issues related to llama.cpp upstream source code, mostly unrelated to wllama
Projects
None yet
Development

No branches or pull requests

4 participants