-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tts : implement sesame CSM + Mimi decoder #12648
base: master
Are you sure you want to change the base?
Conversation
Really nice! I'm having some issues with longer sentences, or is that just the model's limitations?
Works, but:
Will go in an infinite loop of token generation. |
I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates. Will reach out to Sesame team to confirm if I'm doing this correctly |
It should perform better in long text for now, tested with the text below:
Note: long text can be entered via Result (the long silence added was due to the conversion wav --> mp4, the original wav file doesn't have it): output.2.mp4 |
examples/tts/tts-csm.cpp
Outdated
|
||
// then, decode the semantic_tok to generate acoustic tokens | ||
llama_token tok = semantic_tok; | ||
int n_codes = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the description in the PR, shouldn't this be:
int n_codes = 32; | |
int n_codes = 31; |
Edit: ref
- These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
- At this point, 32 RVQ are generated, it then get "squad" back into one single vector, then pass back the the backbone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in fact it's a bit tricky here, I should document this a bit more clear:
After the decode, we also want to get the embeddings of the 31 generated tokens (acoustic tokens), so that we can "squash" them back to 1D vector. We could do that by:
- Having the
audio_embd
codebook in backbone and look it up --> possible but will be a bit messy, because we need to modify the cgraph quite a lot - Having the
audio_embd
codebook inside the user-space code --> may requires too much hacking - Reuse the
audio_embd
from decoder --> seems to be the simplest way
So what I ended up doing is: after the 31st acoustic token generated, I do another decoder.decode
pass just to get lookup the audio_embd
of that 31st token. The output logits of this pass will be discarded (and also that's why in conversion script, we have a full-zero codebook page added just for this pass - so that build_lora_mm_id
doesn't read out-of-bound)
The equivalent python is here (note: their _embed_tokens
can do both text token and audio token embedding lookup): https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L155-L158
I tried the mimi instructions, but the generated audio is corrupted: python examples/tts/convert_mimi_to_gguf.py
make -j && ./bin/llama-mimi kyutai-mimi.gguf dummy1 output.mp4However, the output.mp4
You might want to try |
Thanks for testing. This should be fixed in my last commit: e31a75c , the input codes should be in "streaming" layout, meaning 1-31, 1-31, 1-31,...
Hmm yeah I'm thinking about re-using the existing sampling infrastructure, so that I can add temperature and top-k sampling. The problem is that output vocab size is always smaller than the model's defined vocab size. One trick that I have in mind is to set the unused logits to -inf in cgraph (by doing |
Could we not fix the vocab size when creating the models? |
Hmm ok I see what you mean, I was looking at But turns out I can just make my own |
I added top-k 50 and temp 0.9 sampling , these values are taken from python code. It does work better, but still in some cases it struggles with long text. I think because they also train the model to have audio and text tokens interleaved, but I still haven't found the python code. I only found this on their website:
|
Does the Python implementation also struggle with that? If not, then it might indicate a bug in the ggml implementation. |
With this text:
The llama.cpp version finishes generation after about 800 codes, the result is: output.3.mp4On my local macbook, with audio_000.mp4On HF space, the generate seems fine, though it gets cur off after 30s (I think it's limited so the zero GPU timeout is not reached) audio.2.mp4So I think both llama.cpp and |
Ok so after confirmed with the sesame team, the problem was that I misidentify the bug. I thought that the summation in the "squash" step is 31 acoustic embeddings, but it is actually sum of all 32 embeddings. The reason why sum of 32 doesn't work earlier for me was because I used greedy sampling. Now with both sum of 32 + topK/temp sampling implemented, it works like magic! (Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that) output.4.mp4Sesame team also confirm to me that the input text and audio will be interleaved by turn: @ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT? |
Btw, you can convert very easily with ffmpeg -i output.wav output.mp4 |
Yes, GPU sampling should be supported eventually in Lines 1180 to 1183 in 2bb3597
|
Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns. I observed kinda same thing on the python demo, so I think it's something to do with the model. |
I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better: csm-demo.txt
Maybe double-check that the tokenization is correct, compared to the HF space demo? |
Related to #12392
Tbh it is more complicated than expected.
This PR only contains the backbone + decoder:
How to try this?
By default, all GGUF files are downloaded from ggml-org Hugging Face's account
Alternatively, GGUF files can be converted using
convert_mimi_to_gguf.py
andconvert_csm_to_gguf.py
underexample/tts
directory. These script usestransformers.AutoModel
under the hood, so they will also handle downloading safetensors file automatically.Note: it pronounces "Xuan" incorrectly, but the rest is OK
output.mp4
How Sesame CSM works?
The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).