[Preprint] [Demo] [Models on Hugging Face]
MoshiVis is a Vision Speech Model (VSM) directly building on the speech-text foundation model Moshi and augmenting it with the ability to freely discuss about an image while maintaining its natural conversation style and low latency. In total, MoshiVis adds
This repository currently contains inference code to run your own MoshiVis server supporting three different backends via a webUI frontend. We are also planning to release training/finetuning code in the future. For more information about our speech codec Mimi and speech model Moshi, please visit the original Moshi repo. For more technical details on MoshiVis, see our blog post and preprint.
Talk to MoshiVis now on our live demo !
To inject visual inputs in the stream of speech tokens from Moshi, we extend the core transformer with a cross-attention mechanism to infuse visual information into the speech tokens stream. To maintain Moshi's low-latency and reduce memory usage, the cross-attention projection weights are shared across layers. Moreover, to ensure that Moshiâs original conversational abilities are not lost in the process, the cross-attention modules feature a gating mechanism that allows the model to modulate the visual input stream at will.
For more details on MoshiVis, including our training pipeline, synthetic data generation pipeline, and ablation experiments on the gating mechanism see our preprint.
We release MoshikaVis, based on the original Moshika (female voice) checkpoints from Moshi's open-source release. For the image embedding part, we rely on publicly available off-the-shelf image-text encoders: The checkpoints we release use the frozen weights of a vision encoder from the PaliGemma2 family, specifically on the weights provided at huggingface. Note that for convenience, each MoshiVis checkpoint contains the full model: i.e., the vision adaptation modules weights are bundled together with the weights of Mimi (speech codec), the Helium text tokenizer, image encoder, and base Moshi model.
For each model, we release several variants compatible with three different backends and quantization formats. Further instructions for each backend can be found below.
Backend | Moshika |
---|---|
PyTorch | BF16 |
Rust | BF16 Q8_0 |
MLX | BF16 |
All model weights (excluding the bundled vision encoder) are released under the CC-BY 4.0 license; The bundled vision encoder (PaliGemma2's vision encoder) is released under the Gemma license.
For the frontend, we recommend using the provided web UI as it allows for additional echo cancellation that helps
the overall model quality. To obtain the client, you can either (i) build it yourself from the sources in client
as described here or (ii) download the pre-built static
version we provide:
# Download prebuilt client sources
# option 1: using uv dependency manager
uv run scripts/get_static_client.py
# OR option 2: with pip
pip install fire rich huggingface_hub
python scripts/get_static_client.py
Most commands below will serve this UI by default using the https
protocol (see more info here). To connect via https
, you will need to generate SSL certificates first, as follows:
# Generate the SSL certificates in the root directory
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout key.pem -out cert.pem
We provide three different backends for the MoshiVis inference stack in this repo. While we hope that the present codebase will work on Windows, we do not provide official support for it.
- A PyTorch version in the
kyuteye_pt
directory. - A Rust version (as used in the online demo) is in the
kyuteye_rs
directory. - A MLX version (tested on a MacBook Pro M3) is in the
kyuteye_mlx
directory
For the PyTorch and MLX backends, we recommend using uv to setup and run the code, as it will manage all dependencies for you transparently.
uv
is provided as a lightweight binary and can be installed as:
curl -LsSf https://astral.sh/uv/install.sh | sh
Note: At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (
$\sim$ 24GB).
You can start the MoshiVis PyTorch server with the following command and then access the web UI on https://localhost:8008
cd kyuteye_pt
uv run server configs/moshika-vis.yaml --port 8088
Note that if your GPU is on a distant machine, you may need to forward the remote 8088 port to your localhost using ssh -L
flag. Then connects to https://localhost:8088 as mentionned previously.
For the Rust backend, you will need a recent version of the Rust toolchain. To compile GPU support, you will need a valid CUDA installation, in particular with
nvcc
.
In order to run the Rust inference server, use the following command:
cd kyuteye_rs
pip install pkg-config
cargo run --features cuda --bin moshi-backend -r -- --config configs/config-moshika-vis.json standalone --vis
When using macOS, you can replace --features cuda
with --features metal
.
Alternatively you can use config-moshika-vis-q8.json
rather than config-moshika-vis.json
to use the
quantized q8 model. You can also change some of the server options (e.g., starting port) in the json file directly.
Once the server has printed 'standalone worker listening', this means the model is ready. By default the Rust server will be accessible at https://localhost:8088.
We provide a MLX model checkpoint in bfloat16
as well as quantized checkpoints
using q4
and q8
.
To start the MoshiVis MLX backend you can then run the following commands:
cd kyuteye_mlx
# In bfloat16 - weights will be downloaded from HF
uv run server
# In q4
uv run server -q 4
# In q8
uv run server -q 8
You can then access the web UI at http://localhost:8008.
Note that unlike other backends, not all settings available in the web UI are propagated to the MLX backend. Instead, you can configure some options directly via the command line e.g. --text-temperature
.
We recommend using the WebUI frontend as explained here. If you want to build the sources yourself, follow these steps (further installation and build instructions can be found in the client
directory):
via NPM.
cd client
npm install
npm run build
via Docker. If you have docker
installed, you can also build the client via
docker buildx bake client
After building the sources, the static dir for the web UI can then be found in the
client/dist
directory, and will be used as default for the different backend.
Alternatively, we also provide a command line interface for the Rust backend:
cd kyuteye_rs;
cargo run --bin moshi-cli -r -- tui --host localhost
By default, the web UI server starts with the https
protocol rather than http
: Accessing a server that is not localhost via http
may cause issues with using the microphone in the web UI (in some browsers this is only allowed using https).
To use an https
connection, you will first need to setup SSL certificates:
# Generate the SSL certificates in the root directory
# pip install openssl
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout key.pem -out cert.pem
Note that if you want to use a http
connection instead you can:
- For the PyTorch backend, add the flag
--ssl False
- For the MLX backend,
http
is the default andhttps
can be used with--ssl certdir
wherecertdir
is the directory that contains the certificates.
Note that when using https
you may get warnings from the browser about the site being unsafe.
When using chrome for instance, you
can bypass these by selecting "Details" or "Advanced", then "Visit this unsafe
site" or "Proceed to localhost (unsafe)".
The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license.
The model weights (excluding the vision encoder) for the models are released under the CC-BY 4.0 license; the vision encoder is licensed under Apache 2.0.
All images displayed in the web UI are obtained under the free Unsplash license. For the precise list of image urls and authors, please refer to this file.
We also release two data-related artifacts to accompany MoshiVis:
- In the
ssvd
directory, we include code and instructions to reproduce our synthetic visual dialogue datasets described in Section 3.3 and Appendix E of our preprint - For evaluation purposes, we also release
Babillage
on HuggingFace, which contains spoken versions of three common VLM benchmarks (COCO-Captions 2014, OCR-VQA and VQAv2) for prompting the model's visual understanding in audio form.
If you use MoshiVis in your research, please cite our work:
@article{kyutai2025moshivis,
author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
year = {2025},
title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
journal = {ArXiv},
url = {https://arxiv.org/abs/2503.15633}
}
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}