Step By Step Tutorial? #85

MethanJess · 2023-11-26T02:18:24Z

MethanJess
Nov 26, 2023

I understand that this may be a rookie request, but it would be greatly appreciated if a user-friendly guide/tutorial could be made, I have been struggling for hours trying to understand how to use this project.

I would just like to clone a Voice using a 4-hour audio file and use it as text-to-speech, locally on Windows, but I don't even know where to begin.

The README is very vague and hard to understand, it doesn't even teach you how to install eSpeak-ng on Windows (which is supposed to be a requirement).

Answered by IIEleven11

Nov 27, 2023

Ok this took longer than I thought. But I tested it, it works, I suggest you follow everything exactly. Also keep in mind, I have no degree, I taught myself to code, so this is kinda scary but here you go. I tried to automate as much as I could for you. Let me know how it goes!

https://github.com/IIEleven11/StyleTTS2FineTune

I understand that this may be a rookie request, but it would be greatly appreciated if a user-friendly guide/tutorial could be made, I have been struggling for hours trying to understand how to use this project.

I would just like to clone a Voice using a 4-hour audio file and use it as text-to-speech, locally on Windows, but I don't even know where to begin.

The READ…

View full answer

IIEleven11 · 2023-11-27T13:31:36Z

IIEleven11
Nov 27, 2023

I understand that this may be a rookie request, but it would be greatly appreciated if a user-friendly guide/tutorial could be made, I have been struggling for hours trying to understand how to use this project.

I would just like to clone a Voice using a 4-hour audio file and use it as text-to-speech, locally on Windows, but I don't even know where to begin.

The README is very vague and hard to understand, it doesn't even teach you how to install eSpeak-ng on Windows (which is supposed to be a requirement).

I have a guide I can share. I hesitate though as it's a complex repo which in turn requires a complex guide. There are a lot of very nuanced things that can go wrong. While I do have a very successful and impressive result. I'm almost sure the steps I am taking could be improved and or corrected in some way. I'd be open for any corrections of course. Give me a bit and I'll post it though.

4 replies

MethanJess Nov 27, 2023
Author

Hello @IIEleven11 I'd like to see the guide, even if its still being worked on.
Not sure what Kreevoz means by I want "a system-wide TTS integration under Windows" I would just like to ouput the generated audio in any way that is possible. (like just download the audio as mp3)

If it doesn't work on windows, I could try a Linux Virtual Machine using QEMU or something.

Kreevoz Nov 27, 2023

Ah OK nevermind, misunderstood your original intention, Jess.

IIEleven11 Nov 27, 2023

Hmm I meant for fine tuning. Are you just trying to run like a zero shot? There is a hugging face UI in one of these posts that is very simple. Also I totally forgot I was going to share the guide. I'll tidy it up and post it now.

As for windows... My experience with the coqui repo and windows has been nothing but a nightmare and a massive waste of time. It almost feels like they intentionally broke or obfuscated it at this point. I use WSL2 though, which I highly suggest you try.

IIEleven11 Nov 27, 2023

Oh another thing, I am taking an RVC v2 Mangio Crepe model I made with the same dataset/1000 epochs and mapping that on top of this model I made with StyleTTS2. I found the issue with RVC was being forced to use models trainer on a completely different speaker. Ya this makes it more complicated but "unrecognizable from a human is the goal". You can ignore that part of the guide or I can omit it I guess

persuc · 2023-11-27T16:03:49Z

persuc
Nov 27, 2023

If anyone prefers to run the model locally, I made a hacky CLI to do that. https://github.com/persuck/StyleTTS2

0 replies

fakerybakery · 2023-11-27T16:53:47Z

fakerybakery
Nov 27, 2023

Try the hugging face demo or the local GUI.

For the offline local web gui check out the GPL fork (GPL because of phonemizer): https://github.com/NeuralVox/StyleTTS2

0 replies

IIEleven11 · 2023-11-27T18:07:19Z

IIEleven11
Nov 27, 2023

Ok this took longer than I thought. But I tested it, it works, I suggest you follow everything exactly. Also keep in mind, I have no degree, I taught myself to code, so this is kinda scary but here you go. I tried to automate as much as I could for you. Let me know how it goes!

https://github.com/IIEleven11/StyleTTS2FineTune

I understand that this may be a rookie request, but it would be greatly appreciated if a user-friendly guide/tutorial could be made, I have been struggling for hours trying to understand how to use this project.

I would just like to clone a Voice using a 4-hour audio file and use it as text-to-speech, locally on Windows, but I don't even know where to begin.

The README is very vague and hard to understand, it doesn't even teach you how to install eSpeak-ng on Windows (which is supposed to be a requirement).

2 replies

martinambrus Dec 6, 2023

@IIEleven11 would you have a tutorial on building a model from scratch? I'm currently trying to train a custom EN-US model from ~1000 WAV files and I would be grateful for pointers, namely some numbers such as a good validation loss for 1st / 2nd epoch and such

I tried this with a small sample of 222 WAV files and the model somewhat started to pick up the speech but after 200 + 100 epochs, the speech was still barely recognizable

I'm fairly new to TTS training, so I'd appreciate any input :)

IIEleven11 Dec 6, 2023

I haven't trained from scratch with StyleTTS2 so I wouldn't feel comfortable answering that. But I can tell you that you that 222 wav files is probably too small of a dataset for training.

A very very big lesson I have learned is that dataset is extremely important. Way more important than you think. At each point I always thought my dataset was good enough. I would train, hear poor output and go look at my dataset, only to find issues with the segmentation, quality, and or artifiacts/background noise. It's a very tedious task but if youre having issues that'd be a great place to start.

Instead of training with that small of a dataset maybe attempt to fine tune with it instead, that way you could at least get a feel with how the model is reacting to your audio and what parts of it are easily trained on and what parts aren't.

teamblubee · 2023-12-14T16:21:57Z

teamblubee
Dec 14, 2023

to train from scratch you can do this:

ensure your wavs are 24khz
then you get transcriptions and align the audio to the text
segment the long audio files into shorter segments: BEWARE they need to be AT LEAST 1 second long and probably shorter than 10 sections
phonemize this data to have wav file | phonemize | 0 ; we'll default to zero if you need this tutorial
break that data into 3 sets: 1 for training, 1 for validation, 1 for OOD testing
once you have those files created, fill them in the data params of the config file, you will need at least 2 config one for pre training where the mel specturm will be created and another for training
use the train_first.py with config_file pointing to the first configuration, let that run, you'll get some .pth files
do the same with train_second where you set the first_stage_path to the last .pth file you received from the first stage.

for some, the biggest issue will be gpu memory, for most projects it will take at least 16gp of memory but that will will not be enough for the second phase training which i've seen can take up to 80gb of memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step By Step Tutorial? #85

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Step By Step Tutorial? #85

Replies: 5 comments · 6 replies

MethanJess Nov 27, 2023 Author

Replies: 5 comments 6 replies

MethanJess Nov 27, 2023
Author