Skip to content

kyutai-labs/moshivis

MđŸ‘ïžshiVis: Teaching Speech Models to Converse about Images

CI checks

[Preprint] [Demo] [Models on Hugging Face]

MoshiVis is a Vision Speech Model (VSM) directly building on the speech-text foundation model Moshi and augmenting it with the ability to freely discuss about an image while maintaining its natural conversation style and low latency. In total, MoshiVis adds $\sim$ 206M adapters parameters on top of the 7B Moshi and a pretrained frozen 400M PaliGemma2 vision encoder.

This repository currently contains inference code to run your own MoshiVis server supporting three different backends via a webUI frontend. We are also planning to release training/finetuning code in the future. For more information about our speech codec Mimi and speech model Moshi, please visit the original Moshi repo. For more technical details on MoshiVis, see our blog post and preprint.

Talk to MoshiVis now on our live demo !

Schema representing the structure of MoshiVis.

To inject visual inputs in the stream of speech tokens from Moshi, we extend the core transformer with a cross-attention mechanism to infuse visual information into the speech tokens stream. To maintain Moshi's low-latency and reduce memory usage, the cross-attention projection weights are shared across layers. Moreover, to ensure that Moshi’s original conversational abilities are not lost in the process, the cross-attention modules feature a gating mechanism that allows the model to modulate the visual input stream at will.

For more details on MoshiVis, including our training pipeline, synthetic data generation pipeline, and ablation experiments on the gating mechanism see our preprint.

Model Release

We release MoshikaVis, based on the original Moshika (female voice) checkpoints from Moshi's open-source release. For the image embedding part, we rely on publicly available off-the-shelf image-text encoders: The checkpoints we release use the frozen weights of a vision encoder from the PaliGemma2 family, specifically on the weights provided at huggingface. Note that for convenience, each MoshiVis checkpoint contains the full model: i.e., the vision adaptation modules weights are bundled together with the weights of Mimi (speech codec), the Helium text tokenizer, image encoder, and base Moshi model.

For each model, we release several variants compatible with three different backends and quantization formats. Further instructions for each backend can be found below.

Backend Moshika
PyTorch BF16
Rust BF16 Q8_0
MLX BF16

All model weights (excluding the bundled vision encoder) are released under the CC-BY 4.0 license; The bundled vision encoder (PaliGemma2's vision encoder) is released under the Gemma license.

Organisation of the Repository

For the frontend, we recommend using the provided web UI as it allows for additional echo cancellation that helps the overall model quality. To obtain the client, you can either (i) build it yourself from the sources in client as described here or (ii) download the pre-built static version we provide:

# Download prebuilt client sources
# option 1: using uv dependency manager
uv run scripts/get_static_client.py

# OR option 2: with pip
pip install fire rich huggingface_hub
python scripts/get_static_client.py

Most commands below will serve this UI by default using the https protocol (see more info here). To connect via https, you will need to generate SSL certificates first, as follows:

# Generate the SSL certificates in the root directory
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout key.pem -out cert.pem

We provide three different backends for the MoshiVis inference stack in this repo. While we hope that the present codebase will work on Windows, we do not provide official support for it.

For the PyTorch and MLX backends, we recommend using uv to setup and run the code, as it will manage all dependencies for you transparently.

uv is provided as a lightweight binary and can be installed as:

curl -LsSf https://astral.sh/uv/install.sh | sh

PyTorch Backend

Note: At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory ($\sim$ 24GB).

You can start the MoshiVis PyTorch server with the following command and then access the web UI on https://localhost:8008

cd kyuteye_pt
uv run server configs/moshika-vis.yaml --port 8088

Note that if your GPU is on a distant machine, you may need to forward the remote 8088 port to your localhost using ssh -L flag. Then connects to https://localhost:8088 as mentionned previously.

Rust Backend

For the Rust backend, you will need a recent version of the Rust toolchain. To compile GPU support, you will need a valid CUDA installation, in particular with nvcc.

In order to run the Rust inference server, use the following command:

cd kyuteye_rs
pip install pkg-config
cargo run --features cuda --bin moshi-backend -r -- --config configs/config-moshika-vis.json standalone --vis

When using macOS, you can replace --features cuda with --features metal.

Alternatively you can use config-moshika-vis-q8.json rather than config-moshika-vis.json to use the quantized q8 model. You can also change some of the server options (e.g., starting port) in the json file directly.

Once the server has printed 'standalone worker listening', this means the model is ready. By default the Rust server will be accessible at https://localhost:8088.

MLX Backend

We provide a MLX model checkpoint in bfloat16 as well as quantized checkpoints using q4 and q8.

To start the MoshiVis MLX backend you can then run the following commands:

cd kyuteye_mlx
# In bfloat16 - weights will be downloaded from HF
uv run server

# In q4
uv run server -q 4

# In q8
uv run server -q 8

You can then access the web UI at http://localhost:8008.

Note that unlike other backends, not all settings available in the web UI are propagated to the MLX backend. Instead, you can configure some options directly via the command line e.g. --text-temperature.

Frontends

WebUI

We recommend using the WebUI frontend as explained here. If you want to build the sources yourself, follow these steps (further installation and build instructions can be found in the client directory):

via NPM.

cd client
npm install
npm run build

via Docker. If you have docker installed, you can also build the client via

docker buildx bake client

After building the sources, the static dir for the web UI can then be found in the client/dist directory, and will be used as default for the different backend.

Rust Command Line

Alternatively, we also provide a command line interface for the Rust backend:

cd kyuteye_rs;
cargo run --bin moshi-cli -r -- tui --host localhost

Troubleshooting

http vs https

By default, the web UI server starts with the https protocol rather than http: Accessing a server that is not localhost via http may cause issues with using the microphone in the web UI (in some browsers this is only allowed using https).

To use an https connection, you will first need to setup SSL certificates:

# Generate the SSL certificates in the root directory
# pip install openssl
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout key.pem -out cert.pem

Note that if you want to use a http connection instead you can:

  • For the PyTorch backend, add the flag --ssl False
  • For the MLX backend, http is the default and https can be used with --ssl certdir where certdir is the directory that contains the certificates.

Note that when using https you may get warnings from the browser about the site being unsafe. When using chrome for instance, you can bypass these by selecting "Details" or "Advanced", then "Visit this unsafe site" or "Proceed to localhost (unsafe)".

License

The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license.

The model weights (excluding the vision encoder) for the models are released under the CC-BY 4.0 license; the vision encoder is licensed under Apache 2.0.

All images displayed in the web UI are obtained under the free Unsplash license. For the precise list of image urls and authors, please refer to this file.

Datasets

We also release two data-related artifacts to accompany MoshiVis:

  • In the ssvd directory, we include code and instructions to reproduce our synthetic visual dialogue datasets described in Section 3.3 and Appendix E of our preprint
  • For evaluation purposes, we also release Babillage on HuggingFace, which contains spoken versions of three common VLM benchmarks (COCO-Captions 2014, OCR-VQA and VQAv2) for prompting the model's visual understanding in audio form.

Citation

If you use MoshiVis in your research, please cite our work:

@article{kyutai2025moshivis,
  author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
  Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
  year = {2025},
  title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2503.15633}
}

@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
      Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}