voicebox? #13
Replies: 1 comment 12 replies
-
it seems Voicebox is much similar to NaturalSpeech2. both of two are fully non-autoregressive by using duration predictor, while soundstorm needs autoregressive Text-to-Semantic(T2S) transformer for TTS. so, Voicebox has more controllability than soundstorm. Although NaturalSpeech2 predicts soundstream's latent and Voicebox predicts mel spectrogram, Voicebox also tried Encodec's pre-quantized latent and post-quantized latent as prediction target. so, that's why i said both of two are similar. skipping and repeating was classic robustness problem in autoregressive tts model since tacotron due to its unstable attention. T2S model may not suffer from such problem, but i can say Voicebox would be better than soundstorm, at least, in terms of controllability. |
Beta Was this translation helpful? Give feedback.
-
can someone knowledgeable comment on this new paper using flow matching? https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/
Is it better than soundstorm?
Beta Was this translation helpful? Give feedback.
All reactions