voicebox? #13

lucidrains · 2023-06-19T19:46:03Z

lucidrains
Jun 19, 2023
Maintainer

can someone knowledgeable comment on this new paper using flow matching? https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/

Is it better than soundstorm?

seastar105 · 2023-06-22T05:30:26Z

seastar105
Jun 22, 2023

it seems Voicebox is much similar to NaturalSpeech2. both of two are fully non-autoregressive by using duration predictor, while soundstorm needs autoregressive Text-to-Semantic(T2S) transformer for TTS. so, Voicebox has more controllability than soundstorm.

Although NaturalSpeech2 predicts soundstream's latent and Voicebox predicts mel spectrogram, Voicebox also tried Encodec's pre-quantized latent and post-quantized latent as prediction target. so, that's why i said both of two are similar.

skipping and repeating was classic robustness problem in autoregressive tts model since tacotron due to its unstable attention. T2S model may not suffer from such problem, but i can say Voicebox would be better than soundstorm, at least, in terms of controllability.

12 replies

lucidrains Jul 1, 2023
Maintainer Author

actually got a couple TTS companies reach out asking me to implement voicebox. I think soundstorm is better though; am I missing something?

lucidrains Jul 1, 2023
Maintainer Author

i've seen so much research fall by the way side in favor of the bitter lesson

unless there is something remarkable about flow matching, should just scale up. look at audiopalm..

lucidrains Jul 1, 2023
Maintainer Author

i guess i should read the flow matching paper tomorrow, just to see if i'm missing something, or perhaps the technique is orthogonal to NAR, and can be incorporated into soundstorm

lucidrains Aug 3, 2023
Maintainer Author

ahh ok, yeah it isn't an orthogonal concept. they claim it to be more efficient than diffusion (18 vs 250 steps) under construction now https://github.com/lucidrains/voicebox-pytorch

Jiang-Stan Oct 17, 2023

Hi, lucidrains,
It seems that the inference speed of VoiceBox is much faster than other diffusion based methods?

Which is the method Voicebox compared to? NaturalSpeech2?

Is there any obvious strengthen or weakness between NaturalSpeech2 and Voicebox? Could you give me some insight？

Thanks a lot in advance! Hope to hear from you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

voicebox? #13

{{title}}

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

voicebox? #13

lucidrains Jun 19, 2023 Maintainer

Replies: 1 comment · 12 replies

seastar105 Jun 22, 2023

lucidrains Jul 1, 2023 Maintainer Author

lucidrains Jul 1, 2023 Maintainer Author

lucidrains Jul 1, 2023 Maintainer Author

lucidrains Aug 3, 2023 Maintainer Author

Jiang-Stan Oct 17, 2023

lucidrains
Jun 19, 2023
Maintainer

Replies: 1 comment 12 replies

seastar105
Jun 22, 2023

lucidrains Jul 1, 2023
Maintainer Author

lucidrains Jul 1, 2023
Maintainer Author

lucidrains Jul 1, 2023
Maintainer Author

lucidrains Aug 3, 2023
Maintainer Author