Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Jeronymous · 2025-02-02T13:39:13Z

It seems that the string replacements in the post-processing of the tokenizer are not included in the GGUF model.
Hence some LLM with fancy tokenizers can have the output text a bit weird with tools like ollama that use GGUF models.

I noticed it with Lucie Instruct: https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct#test-with-ollama

The tokenizer include several post-processing steps that are discarded:
https://huggingface.co/OpenLLM-France/Lucie-7B/raw/main/tokenizer.json

"decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "ByteFallback"
      },
      {
        "type": "Metaspace",
        "replacement": "▁",
        "add_prefix_space": true,
        "prepend_scheme": "always"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\n "
        },
        "content": "\n"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\t "
        },
        "content": "\t"
      },
...

Those are supposed to remove extra space (introduced in the pre-processing to have "uniform" subword tokens, i.e. sam e token represente for a word whether it comes after a space or after something starting a new sentence (start of string, apostrophe, quotation mark, ...).

@ggerganov I would be happy to contribute to this repo to solve this bug :)

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-02-03T09:00:09Z

Patches are welcome - better to open in the llama.cpp repository for these kind of changes.

Jeronymous · 2025-02-03T16:17:28Z

Thank you for your answer @ggerganov
Do you have any hint on where I should look at ?
Is there already a class for text post-processing bricks, or would it be a new thing ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Jeronymous commented Feb 2, 2025

ggerganov commented Feb 3, 2025

Jeronymous commented Feb 3, 2025

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Comments

Jeronymous commented Feb 2, 2025

ggerganov commented Feb 3, 2025

Jeronymous commented Feb 3, 2025