-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inference code example #8
Comments
I'll put together a code this weekend. |
okay so I took at shot at hacking together some inference code. I trained a model for 400k steps on the MushanW/GLOBE dataset; when I test it I get a cacophony which is starting to resemble the tts prompts.. but the intermediate melspec is of very poor quality so something might be wrong with my approach.
|
I'm also looking forward to it |
I haven't figured out what's up with the text conditioning yet, but here's a rough sample (it doesn't use the duration predictor) of the generation flow in a notebook. I left in some debugging outputs so you can see the flow resolving visually. The voice cloning aspect seems to work fine with different speakers, fwiw, they just say nonsense at the moment 😅 (This is from a quick ~100M param model I trained with ~1/100th the FLOPs used in the paper.) |
@lucasnewman does it always output the reference audio, regardless of what you use as input for the reference text? I also left out the duration predictor, I wound up just simply doubling the input duration and doubling the reference text, since if it can't do the doubling, it sure won't work for anything else 🙃 I couldn't get it to generate anything aside from the input reference audio. I was told by the author "train it more", but I put considerable resources into it, and it never improved. |
Yep, this is exactly what I'm doing, and more or less what you see in the notebook — I just hard-coded the duration to keep it simple.
Make sure you took the duration fix from a few days ago if you explicitly passing it as an int because otherwise it will stop generation after the conditioning. You don't need to retrain your model as it only affects sampling. |
The output sample seems to be gibberish after what I assume to be the prompt(?) Thank you for your work though! generated.mp4 |
@lucasnewman The code I was using was based off of a modified version of voicebox, though I did try training an early version of this repo, but at the time it was giving nan's. Just to be clear, if you put in any other text, do you still get the exact reference audio? The model I trained always ignored the text embeddings, I'm just wondering if you have the same issue. It looks like it just learns to pass through the input. Another thing you can try is, pass in a masked region to see if it can do the training objective in inference mode. Can it do the infill? (When I tested this with my model, it was just gibberish, but the unmasked regions were the original audio basically). |
I haven't tried, but that would correlate to what I was referencing with the text conditioning.
Yeah, this is effectively the same task with a different mask region, so I would expect similar results for now since the text conditioning doesn't seem to be working right. I don't actually have a ton of extra time to spend on debugging it, but you're welcome to run some experiments! The latest version of the code is almost exactly what I trained. |
I'm trying to train this model with another repo (I've slightly changed voicebox repo) with around of 300 hours of data. Only at around 400 000 iterations it started to output something sounding like a speech (but the speech was gibberish). I also get random noise a lot, as if model would be unable to fill in the blanks. The voice model uses also doesn not resemble target voice for now. I'm thinking about increasing gradient accumulation to match paper's batch in case model just doesn't see enough of data per iteration. |
I think the model is hard to train when trying to directly learn the alignment from text to mel-spec, has anyone get some reasonable result? I also get some speech with the same timbre but the speech is not expected for text input,so I think the model doesn't get alignment properly |
is there any pre-trained checkpoint for this model available ? -thanks in advance |
i was eventually able to get inference to work by changing this line sample = e2tts.sample(mel[:,:25].to("cuda"), text = text) But i only get noise on output. Has anyone else been able to get inference to work? |
shared checkpoint here |
@eschmidbauer Were you successful in getting it to generate speech from the text input? |
@Coice no, but maybe my inference script needs work. Maybe someone else is able to generate speech and share the code |
I'll add what I know in case someone's interested. I't trying to train E2 TTS with another repo on a small dataset. I had to rewrote some code because I used Voicebox repo. I tried to train it for a couple of weeks, but the network only generated noise. I decided to print gradients for all parameters and found out that attention gradients were always zero. After some digging I found out I accidentally turned my attention dropout to 1. When I fixed it I got something resembling speech instead of noise. Model still can speak many sounds properly, but at least I see now that it learns. If your model outputs only noise even at 400 000 iterations (400 is just an example, theoretically at this stage it should be able to generate something), I would recommend to double check gradients: maybe there's some mistake and gradients are None or they might be zero, or you might have vanishing gradients. |
@cyber-phys FWIW that lines up with my napkin math and your sample sounds similar to my experiments. I tried a trick where I used a scale factor that ramps from (0, 1] for the random times selection for a few thousand steps, forcing the model to learn stronger conditioning from close-to-noise time steps, which seemed to help a little bit with pronunciation in a low data regime (you could recognize words with ~10k hours of audio training), but nothing close to the quality of Voicebox, which obviously has a big alignment advantage. It seems like you need a bunch of training over 50k+ hours of audio to make a dent on this one, which is kind of cool because it's possible to just brute force the alignment, but also probably out of reach for most academic/unfunded settings, unfortunately. |
I have a training job running that saw around 2,000,000 samples of speech out of 13M total. I am training on multi-lingual datasets so most likely it will take awhile before it can do coherent speech. But it does "speak" a combination of languages now, with no apparent alignment to the text prompt. You can find latest checkpoint here https://drive.google.com/drive/folders/11m6ftmJbxua7-pVkQCA6qbfLMlsfC_Ls?usp=drive_link
|
I would like to share a sample based another modified repo: really needs a lot resource(only get 4X4090 gpus) and train nearly two week and the result seems need more training. I only use two datasets: gigaspeech and libiritts. yy-2.mp4text: you are very handsome. ref_yy-2.mp4 |
Could you show the code for scale factor trick? Or link to it if it's included in this repo. |
It was just an experiment to see if the text conditioning was working at all — I'm not sure it's a great idea in general. My intuition was that the joint training objective is particularly difficult for alignment because the "fingerprint" of the flow is pretty well established in the first ~2-3% of the ODE steps and at that point the model will primarily use the flow from the previous timestep for prediction. If we force the model to predict from near-noise earlier, we can bias the training objective towards the text conditioning at the start. (Also I forgot to mention that I used phonemes instead of the raw byte encoding to make it a little easier on the model because I'm using a smaller dataset.) You can reproduce it with something like:
And then in your trainer class define num_time_scale_steps and do:
|
good idea, how to do inference when apply time_scale? will it use less NFE steps? |
Has anyone got good result already (both text aligment and sound similar/quality) 4xA100 a week,30k hours, still produce incosistent speech with txt |
@acul3 I'm working with a similar setup, just using GLOBE. Here’s what I have so far—could you share your intermediate results as well?
e2tts_680k.mp4 |
@changjinhan how long you train it? i am training multilingual (indonesia, and malay) the output its acceptable, but seem hard to follow text here is my config
can you share yours? appreciated |
@acul3 I trained it for 6 days and my config is as follows:
|
gotcha, had refactored some stuff and mistakenly entered in 'unet' since paper mentioned 'unet style skip connections' lol.. that could def be it.
i think i changed too many things at once so gonna try to just replicate your initial quick-training results with a smaller model and the phonemizer and change one thing at a time from there - had some issues using the phonemizer on the repo actually, will post in issues. |
dataset used
|
Hello, can you share the pretrain? |
@manmay-nakhashi you are welcome to join us for dinner, but you'll have to fly here 😄 hit me up if you are ever in SF though! |
think i will also just fix the transformer to always use concat unet skips, unless there are any objections |
@lucidrains ha ha, i am coming there in next month mostly. |
@manmay-nakhashi that's when we are planning on dinner! when are you arriving? send me an email and we can coordinate |
@ILG2021 i'll share one pretrained model tomorrow, not fully trained though, it can at-least save some alignment time. |
Thank you. |
Has anyone noticed this repository: https://github.com/feizc/FluxMusic? They also implicitly use a doublestream structure. This kind of multimodal fusion architecture might truly be the direction for future development. |
looking forward to your model sharing . @manmay-nakhashi |
@lucidrains can we keep interpolation method in ?. |
added it back! |
@skirdey are you near SF? do you want to join us for dinner mid next month? you were the one who alerted me to this paper! |
@manmay-nakhashi thanks for your checkpoint, but current lucidrains main branch doesn't work with it. Could you share correct commit/branch/link to actual code (especially, inference script)? |
@Oktai15 i trained on 0.9.1 , but i'll see over weekend, what's difference between current and 0.9.1 which is not working. |
@manmay-nakhashi can you share the inference script? |
you can always just install the specific version |
will the trained duration predictor model be saved? and when inference,duration model is used ? |
@manmay-nakhashi would you mind sharing what code you used for inference on 0.9.1? thanks! |
|
Thank You @manmay-nakhashi for the code and unfortunately, I still get errors with tokenizer on ==0.9.1 when trying to encode text |
@BlazJurisic Take new g2p changes from the current master |
Hi @manmay-nakhashi,
|
I may have some miss-understanding about the DurationPredictor? can anyone give me some help? |
is that means 25 frames of target speaker is enought to syntheize his timbre? |
Trained on more then 100,000 hours of speech, over 500,000 steps, about half an epoch, but still can not get good result. |
sample的结果里有各种重叠音效,是不是sample过程有什么问题啊? |
Since E2TTS takes Mel-spectrogram as input and its job is to filled out the masked part (this is why there is a crop in input mel), I believe that the result still return the cropped input. Therefore, we need to remove the beginning part of output. FYI, I am also having difficulties inferring with this model. In the paper, the author said that G2P is removed, so using |
Is there inference code? I could not find any. but I read through other issues and found this.
Originally posted by @manmay-nakhashi in #1 (comment)
The text was updated successfully, but these errors were encountered: