Replies: 2 comments
-
This is a cool idea! If it's not too much effort I would say go for it. One thing that was nice about training Voicebox was the semantic tokens had a direct length correlation with the audio, so you could take random crops of any sample to maximize your ability to fill batches with data. An idea to make that possible here would be to preprocess the dataset with Montreal Forced Alignment or something similar to get the text <-> audio alignment manually and use that (only) during training to crop longer samples down to your max duration or less, but still keep the 1:1 correlation of text to audio. Also I should mention that tuning the learning rate / warmup a little bit for your dataset is probably worth running a few experiments if you can afford it — the hyperparams specified in the paper aren't really optimal if your dataset is two orders of magnitude smaller. |
Beta Was this translation helpful? Give feedback.
-
hey folks, I've adopted the flow + e2tts to solve a slightly different problem of speech restoration from heavily degraded samples. I am quite impressed with the results, the network is 10% trained in a completely self-supervised manner and the loss is still dropping. You can hear examples https://sparkling-rabanadas-3082be.netlify.app/ A few things I've observed that might be helpful for text-to-speech.
I am planning try and train a multi-lingual model on 300k hours of speech |
Beta Was this translation helpful? Give feedback.
-
Hi folks - the original model requires a ton of compute to train so wanted to open a discussion topic for @lucasnewman to share tips.
I have limited cloud credits left to try to train this so I want to maximize the results I get, also am trying to train a larger model on more data -
paper says 200khrs of data had same performance as the 50khrs of data with 2x the compute
On my end, one thing that I've noticed is that the batch size ends up having a lot of 0s if the dataset has variable sample lengths - I was thinking of binning data by duration and building a dynamic batchloader to improve this, but wondering if there are other tricks
I see that lucas is just using smaller duration samples only which probably helps get decent utilization without spending much time implementing anything
Beta Was this translation helpful? Give feedback.
All reactions