Training efficiency discussion #30

darylsew · 2024-09-02T18:22:14Z

darylsew
Sep 2, 2024

Hi folks - the original model requires a ton of compute to train so wanted to open a discussion topic for @lucasnewman to share tips.

I have limited cloud credits left to try to train this so I want to maximize the results I get, also am trying to train a larger model on more data -
paper says 200khrs of data had same performance as the 50khrs of data with 2x the compute

On my end, one thing that I've noticed is that the batch size ends up having a lot of 0s if the dataset has variable sample lengths - I was thinking of binning data by duration and building a dynamic batchloader to improve this, but wondering if there are other tricks

I see that lucas is just using smaller duration samples only which probably helps get decent utilization without spending much time implementing anything

lucasnewman · 2024-09-02T19:21:21Z

lucasnewman
Sep 2, 2024

On my end, one thing that I've noticed is that the batch size ends up having a lot of 0s if the dataset has variable sample lengths - I was thinking of binning data by duration and building a dynamic batchloader to improve this, but wondering if there are other tricks

This is a cool idea! If it's not too much effort I would say go for it.

One thing that was nice about training Voicebox was the semantic tokens had a direct length correlation with the audio, so you could take random crops of any sample to maximize your ability to fill batches with data.

An idea to make that possible here would be to preprocess the dataset with Montreal Forced Alignment or something similar to get the text <-> audio alignment manually and use that (only) during training to crop longer samples down to your max duration or less, but still keep the 1:1 correlation of text to audio.

Also I should mention that tuning the learning rate / warmup a little bit for your dataset is probably worth running a few experiments if you can afford it — the hyperparams specified in the paper aren't really optimal if your dataset is two orders of magnitude smaller.

0 replies

skirdey · 2024-09-06T18:47:15Z

skirdey
Sep 6, 2024

hey folks, I've adopted the flow + e2tts to solve a slightly different problem of speech restoration from heavily degraded samples. I am quite impressed with the results, the network is 10% trained in a completely self-supervised manner and the loss is still dropping. You can hear examples https://sparkling-rabanadas-3082be.netlify.app/

A few things I've observed that might be helpful for text-to-speech.

Size of the transformer plays crucial role in audio generation part. The loss for 12/8, 12/12 and 512/768 dimensions stagnate at around 0.3. When moving to 20/16 the loss is around 0.2 and still decreasing.
ScheduleFree Adam works well, and it allows to easily stop and restart training process (https://github.com/facebookresearch/schedule_free) - you just need to find the right LR for the network config.
This model will most likely require over-training past the loss convergence, i've seen a few examples in a smaller model where subjectively restoration was better later in training but the loss was the same.

I am planning try and train a multi-lingual model on 300k hours of speech

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training efficiency discussion #30

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Training efficiency discussion #30

darylsew Sep 2, 2024

Replies: 2 comments

lucasnewman Sep 2, 2024

skirdey Sep 6, 2024

darylsew
Sep 2, 2024

lucasnewman
Sep 2, 2024

skirdey
Sep 6, 2024