tokenkit🔁

Tokenization Transfer for LLMs

tokenkit is a toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

News

2025-04-02: The initial release of tokenkit with support for cross-tokenizer distillation via ALM and Zero-Shot Tokenizer Transfer via FVT!

Why Transfer Across Tokenizers?

LLMs are bound to the tokenizer they were pretrained with. This limits their adaptability, reusability and modularity. Tokenizer transfer can lift this limitation. For example:

If we want to reuse an LLM trained primarily on English in another language, we might want to update its tokenizer to one that is more suitable for the new language.
If we want to combine (e.g., token-level ensemble) two LLMs, we need to transfer them to a common tokenizer.
If we want to experiment with better tokenization schemes (e.g., byte-level tokenization), we might want to transfer an existing LLM to this tokenizer instead of training a new one expensively from scratch.
If we want to transfer knowledge from a large teacher model to a smaller student model (which uses another tokenizer), we might want to use cross-tokenizer distillation to directly transfer the teacher's knowledge to the student without the need to first transfer the teacher to the student's tokenizer.

This library aims to let you accomplish all of this.

Installation

tokenkit is primarily implemented in Jax, using PyTorch for data loading (so your PyTorch installation does not need to support an accelerator). Recommended installation:

# Clone the repository & install the library
git clone https://github.com/bminixhofer/tokenkit

# Create a new virtual environment
python -m venv tokenkit_env
. tokenkit_env/bin/activate

# Install torch & jax 0.5.0
# Jax installation instructions: https://docs.jax.dev/en/latest/installation.html#installation
# PyTorch installation instructions: https://pytorch.org/get-started/locally/
# For example:
pip install torch jax[tpu]==0.5.0 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Currently, tokenkit relies on forks of `transformers` and `lm_eval`
pip install git+https://github.com/bminixhofer/transformers
pip install git+https://github.com/bminixhofer/lm-evaluation-harness

# Install the library and the remaining dependencies
pip install -r requirements.txt
pip install -e .
pip install paxml==1.4.0 praxis==1.4.0 --no-deps

Features

Cross-Tokenizer Distillation

tokenkit supports Approximate Likelihood Matching (ALM) for cross-tokenizer distillation. ALM usually performs best, but we have also implemented the following baselines:

You can run cross-tokenizer distillation using the scripts/cross_tokenizer_distill.py script. See examples for examples on transferring to different subword tokenizers and to byte-level tokenization.

Zero-Shot Tokenizer Transfer

tokenkit supports Zero-Shot Tokenizer Transfer (ZeTT) via Fast Vocabulary Transfer (FVT). Zero-Shot Tokenizer Transfer is usually used to obtain a good initialization for additional training, but can in some cases also be useful on its own. See our ZeTT paper for more details.

You can run Zero-Shot Tokenizer Transfer using the scripts/zett.py script.

🚧 We are working on implementing more ZeTT methods (including hypernetwork training introduced here).

Token-Level Ensembling & Evaluating Transferred Models

tokenkit supports autoregressive generation & loglikelihood scoring evaluation by implementing a Jax backend to the LM Evaluation Harness. Alongside generating from single models, you can also generate from token-level ensembles of models. There are some predefined ensembles in configs/models. For example, this evaluates a token-level ensemle of Llama and Qwen on MMLU:

python3 scripts/eval_lockstep.py \
  models=llama_qwen \
  eval.tasks=[mmlu]

To evaluate pretrained byte-level models, you'll need to pass embeddings to expand the input ids with (i.e., to use as n-gram embeddings). For example:

python3 scripts/eval.py \
  +main.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-Byte\' \
  +main.tokenizer_name=\'benjamin/Gemma2-2B-IT-Byte:source=Gemma2:conversion=prebyteified\' \
  expand_input_ids=true \
  +expand.pretrained_model_name_or_path=\'benjamin/gemma-2-2b-it-flax\' \
  +expand.tokenizer_name=\'google/gemma-2-2b-it:source=Gemma2\' \
  eval.tasks=[mmlu]

To evaluate any other model (e.g., subword-to-subword transferred models), use for example the following:

python3 scripts/eval.py \
  +main.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer\' \
  +main.tokenizer_name=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer:source=Gemma2:conversion=prebyteified\' \
  eval.tasks=[mmlu] \

Citation

To refer to this repository or to cite Approximate Likelihood Matching, please use this citation:

@article{alm,
  title={Cross-Tokenizer Distillation via Approximate Likelihood Matching},
  author={Minixhofer, Benjamin and Vuli{\'c}, Ivan and Ponti, Edoardo Maria},
  journal={arXiv preprint arXiv:2503.20083},
  year={2025}
}

Please use this citation for Zero-Shot Tokenizer Transfer:

@inproceedings{zett,
title={Zero-Shot Tokenizer Transfer},
author={Benjamin Minixhofer and Edoardo Ponti and Ivan Vuli{\'c}},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=RwBObRsIzC}
}

Acknowledgments

Constituent projects (ALM, ZeTT) were supported by a Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137; 2022-) awarded to Ivan Vulić, by the Google Cloud Research Credits program with the award GCP329647813, and by Cloud TPUs from Google’s TPU Research Cloud (TRC). The name tokenkit and the README layout were inspired by mergekit.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
configs		configs
examples		examples
scripts		scripts
tokenkit		tokenkit
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokenkit🔁

Tokenization Transfer for LLMs

News

Contents

Why Transfer Across Tokenizers?

Installation

Features

Cross-Tokenizer Distillation

Zero-Shot Tokenizer Transfer

Token-Level Ensembling & Evaluating Transferred Models

Citation

Acknowledgments

About

Releases

Packages

Languages

License

bminixhofer/tokenkit

Folders and files

Latest commit

History

Repository files navigation

tokenkit🔁

Tokenization Transfer for LLMs

News

Contents

Why Transfer Across Tokenizers?

Installation

Features

Cross-Tokenizer Distillation

Zero-Shot Tokenizer Transfer

Token-Level Ensembling & Evaluating Transferred Models

Citation

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages