Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tts : implement sesame CSM + Mimi decoder #12648

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 29, 2025

Related to #12392

Tbh it is more complicated than expected.

This PR only contains the backbone + decoder:

How to try this?

By default, all GGUF files are downloaded from ggml-org Hugging Face's account

# build (make sure to have LLAMA_CURL enabled)
cmake -B build -DLLAMA_CURL=ON
cmake --build build -j --target llama-tts-csm

# run it
./build/bin/llama-tts-csm -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Alternatively, GGUF files can be converted using convert_mimi_to_gguf.py and convert_csm_to_gguf.py under example/tts directory. These script uses transformers.AutoModel under the hood, so they will also handle downloading safetensors file automatically.

Note: it pronounces "Xuan" incorrectly, but the rest is OK

output.mp4

How Sesame CSM works?

The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).

  1. The input text will firstly be processed by backbone, the output is (1) a RVQ semantic code and (2) the raw embedding from last layer, after norm
  2. These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
  3. At this point, 32 RVQ are generated, it then get "squash" back into one single vector, then pass back the the backbone
  4. Repeat from step 1 to generate the next codes
flowchart TD
    A[Input Text, vocab 128_256 tokens] -- prompt input --> B

    subgraph Backbone
        B[Backbone transformer]
        B --> C[Output logits, vocab 65632 tokens]
        B --> D[Output Raw embd, vector of 2048 elem]
    end

    D -- vector input --> Proj
    C -- sampling --> Stoken[RVQ semantic token]
    Stoken --> Fin
    Stoken --> H

    subgraph Decoder
        Proj[Projector, reduce size to 1024]
        Fin[Input vocab: 65632 tokens] -- vector dim 2048 --> Proj
        Proj --> F[Decoder transformer]
        F --> G[Output logits: vocab 2051 tokens]
    end

    G -- sampling --> HH[RVQ acoustic token]
    HH -- generate next token --> Fin
    HH -- repeated 31 times --> H[Collected 32 RVQ tokens & audio embeddings, matrix: 2048 x 32]

    H -- sum all vectors --> I[single vector of 2048]
    I -- generate next token --> B
    I -- is zero vec? --> K[Stop generation]

Loading

@github-actions github-actions bot added examples python python script changes labels Mar 29, 2025
@ngxson ngxson mentioned this pull request Mar 30, 2025
4 tasks
@ngxson ngxson changed the title tts : implement sesame backbone + decoder tts : implement sesame CSM + Mimi decoder Mar 30, 2025
@ngxson ngxson marked this pull request as ready for review March 30, 2025 12:30
@arch-btw
Copy link
Contributor

Really nice!

I'm having some issues with longer sentences, or is that just the model's limitations?
For example:

-p "[0]Hi! How are you? I hope you"

Works, but:

-p "[0]Hi! How are you? I hope you are doing well"

Will go in an infinite loop of token generation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2025

I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates.

Will reach out to Sesame team to confirm if I'm doing this correctly

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2025

It should perform better in long text for now, tested with the text below:

[0]Einstein's parents were secular, middle-class Jews. His father, Hermann Einstein, was originally a featherbed salesman and later ran an electrochemical factory with moderate success. His mother, the former Pauline Koch, ran the family household. He had one sister, Maria (who went by the name Maja), born two years after Albert.

Note: long text can be entered via -f file.txt

Result (the long silence added was due to the conversion wav --> mp4, the original wav file doesn't have it):

output.2.mp4


// then, decode the semantic_tok to generate acoustic tokens
llama_token tok = semantic_tok;
int n_codes = 32;
Copy link
Member

@ggerganov ggerganov Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description in the PR, shouldn't this be:

Suggested change
int n_codes = 32;
int n_codes = 31;

Edit: ref

  • These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
  • At this point, 32 RVQ are generated, it then get "squad" back into one single vector, then pass back the the backbone

Copy link
Collaborator Author

@ngxson ngxson Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in fact it's a bit tricky here, I should document this a bit more clear:

After the decode, we also want to get the embeddings of the 31 generated tokens (acoustic tokens), so that we can "squash" them back to 1D vector. We could do that by:

  1. Having the audio_embd codebook in backbone and look it up --> possible but will be a bit messy, because we need to modify the cgraph quite a lot
  2. Having the audio_embd codebook inside the user-space code --> may requires too much hacking
  3. Reuse the audio_embd from decoder --> seems to be the simplest way

So what I ended up doing is: after the 31st acoustic token generated, I do another decoder.decode pass just to get lookup the audio_embd of that 31st token. The output logits of this pass will be discarded (and also that's why in conversion script, we have a full-zero codebook page added just for this pass - so that build_lora_mm_id doesn't read out-of-bound)

The equivalent python is here (note: their _embed_tokens can do both text token and audio token embedding lookup): https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L155-L158

@ggerganov
Copy link
Member

I tried the mimi instructions, but the generated audio is corrupted:

python examples/tts/convert_mimi_to_gguf.py

make -j && ./bin/llama-mimi kyutai-mimi.gguf dummy1
output.mp4

However, the ./bin/llama-tts-csm example works correctly:

output.mp4

Will go in an infinite loop of token generation.

You might want to try top-k sampling with small k ~ 5. OuteTTS also often tends to go in infinite loops with greedy sampling.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

I tried the mimi instructions, but the generated audio is corrupted

Thanks for testing. This should be fixed in my last commit: e31a75c , the input codes should be in "streaming" layout, meaning 1-31, 1-31, 1-31,...

You might want to try top-k sampling with small k ~ 5. OuteTTS also often tends to go in infinite loops with greedy sampling.

Hmm yeah I'm thinking about re-using the existing sampling infrastructure, so that I can add temperature and top-k sampling. The problem is that output vocab size is always smaller than the model's defined vocab size. One trick that I have in mind is to set the unused logits to -inf in cgraph (by doing ggml_view then ggml_scale(cur, -inf)), WDYT?

@ggerganov
Copy link
Member

The problem is that output vocab size is always smaller than the model's defined vocab size.

Could we not fix the vocab size when creating the models?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

Hmm ok I see what you mean, I was looking at llama_sampler_sample and was thinking that it has a fixed n_vocab taken from llama_context

But turns out I can just make my own llama_sampler_sample with a different n_vocab, will give it a try!

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

I added top-k 50 and temp 0.9 sampling , these values are taken from python code. It does work better, but still in some cases it struggles with long text.

I think because they also train the model to have audio and text tokens interleaved, but I still haven't found the python code. I only found this on their website:

Training samples are structured as alternating interleaved patterns of text and audio, with speaker identity encoded directly in the text representation.

@ggerganov
Copy link
Member

Does the Python implementation also struggle with that? If not, then it might indicate a bug in the ggml implementation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

With this text:

[0]How do we know when someone truly understands us? It is rarely just our words—it is in the subtleties of voice: the rising excitement, the thoughtful pause, the warm reassurance. Voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion. Today's digital voice assistants lack essential qualities to make them truly useful. Without unlocking the full power of voice, they cannot hope to effectively collaborate with us. A personal assistant who speaks only in a neutral tone has difficulty finding a permanent place in our daily lives after the initial novelty wears off.

The llama.cpp version finishes generation after about 800 codes, the result is:

output.3.mp4

On my local macbook, with mlx-audio, the output is cut off after ~10s

audio_000.mp4

On HF space, the generate seems fine, though it gets cur off after 30s (I think it's limited so the zero GPU timeout is not reached)

audio.2.mp4

So I think both llama.cpp and mlx-audio implementations are missing something. I don't yet have time to try the official python code (given that it's only runnable on nvidia GPU). If someone has time & the GPU, could you please try?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 1, 2025

Ok so after confirmed with the sesame team, the problem was that I misidentify the bug. I thought that the summation in the "squash" step is 31 acoustic embeddings, but it is actually sum of all 32 embeddings. The reason why sum of 32 doesn't work earlier for me was because I used greedy sampling.

Now with both sum of 32 + topK/temp sampling implemented, it works like magic!

(Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that)

output.4.mp4

Sesame team also confirm to me that the input text and audio will be interleaved by turn: <text_utt1><audio_utt1><text_utt2><audio_utt2>...<text_uttN><audio_uttN>, should be easy to implement, will do that today.

@ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT?

@ggerganov
Copy link
Member

(Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that)

Btw, you can convert very easily with ffmpeg and it won't have silence:

ffmpeg -i output.wav output.mp4

@ggerganov
Copy link
Member

@ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT?

Yes, GPU sampling should be supported eventually in libllama. The samplers would need to implement a call that appends nodes to an existing cgraph:

llama.cpp/include/llama.h

Lines 1180 to 1183 in 2bb3597

// TODO: API for internal libllama usage for appending the sampling to an existing ggml_cgraph
//void (*apply_ggml) (struct llama_sampler * smpl, ...);
};

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 2, 2025

Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns.

I observed kinda same thing on the python demo, so I think it's something to do with the model.

@ngxson ngxson requested a review from ggerganov April 2, 2025 15:33
@ggerganov
Copy link
Member

but the generated audio has a silence gap between 2 turns.

I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better:

csm-demo.txt

[0]Hey how are you doing.[1]Pretty good, pretty good.[0]I'm great, so happy to be speaking to you. What about you?[1]Me too, this is some cool stuff huh?

Maybe double-check that the tokenization is correct, compared to the HF space demo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants