inference code example #8

eschmidbauer · 2024-07-18T19:13:16Z

Is there inference code? I could not find any. but I read through other issues and found this.

          i'll write a inference script next so we can do some quick experiments.

Originally posted by @manmay-nakhashi in #1 (comment)

The text was updated successfully, but these errors were encountered:

manmay-nakhashi · 2024-07-19T02:35:21Z

I'll put together a code this weekend.

G-structure · 2024-07-19T21:35:19Z

okay so I took at shot at hacking together some inference code. I trained a model for 400k steps on the MushanW/GLOBE dataset; when I test it I get a cacophony which is starting to resemble the tts prompts.. but the intermediate melspec is of very poor quality so something might be wrong with my approach.

import os

import torch
import torchaudio
from torchaudio.transforms import GriffinLim, InverseSpectrogram, InverseMelScale, Resample, Speed

from einops import rearrange
from accelerate import Accelerator

from torch.optim import Adam

from e2_tts_pytorch.e2_tts import (
    E2TTS,
    DurationPredictor,
    MelSpec
)

duration_predictor = DurationPredictor(
    transformer = dict(
        dim = 80,
        depth = 2,
    )
)

model = E2TTS(
    duration_predictor = duration_predictor,
    transformer = dict(
        dim = 80,
        depth = 4,
        skip_connect_type = 'concat'
    )
)

n_fft = 1024
sample_rate = 22050
checkpoint_path = "./e2tts.pt"

def exists(v):
    return v is not None

def vocoder(melspec):
    inverse_melscale_transform = InverseMelScale(n_stft=n_fft // 2 + 1, n_mels=80, sample_rate=sample_rate, norm="slaney", f_min=0, f_max=8000)
    spectrogram = inverse_melscale_transform(melspec)
    transform = GriffinLim(n_fft=n_fft, hop_length=256, power=2)
    waveform = transform(spectrogram)
    return waveform

def load_checkpoint(checkpoint_path, model, accelerator, optimizer):
    if not exists(checkpoint_path) or not os.path.exists(checkpoint_path):
        return 0

    checkpoint = torch.load(checkpoint_path)
    accelerator.unwrap_model(model).load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['step']

accelerator = Accelerator(
            log_with="all",
        )

optimizer = Adam(model.parameters(), lr=1e-4)

start_step = load_checkpoint(checkpoint_path=checkpoint_path, model=model, accelerator=accelerator, optimizer=optimizer)

ref_waveform, ref_sample_rate = torchaudio.load("ref.wav", normalize=True)
resampler = Resample(orig_freq=ref_sample_rate, new_freq=sample_rate)
ref_waveform = resampler(ref_waveform)
speed_factor = sample_rate / ref_sample_rate
respeed = Speed(ref_sample_rate, speed_factor)
ref_waveform = respeed(ref_waveform)
ref_waveform_resampled = ref_waveform[0]

mel_model = MelSpec()
mel = mel_model(ref_waveform_resampled)
mel = torch.cat([mel, mel], dim=0)
mel = rearrange(mel, 'b d n -> b n d')

text = ["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", "Waves crashed against the cliffs, their thunderous applause echoing for miles."]
sample = model.sample(mel[:,:25], text = text, vocoder=vocoder)
sample = sample.to('cpu')

waveform = sample

mono_channel_1 = waveform[0].unsqueeze(0)
mono_channel_2 = waveform[1].unsqueeze(0)

torchaudio.save("output_channel_1.wav", mono_channel_1, sample_rate)
torchaudio.save("output_channel_2.wav", mono_channel_2, sample_rate)

changjinhan · 2024-07-20T05:59:37Z

I'm also looking forward to it

lucasnewman · 2024-07-23T15:54:45Z

I haven't figured out what's up with the text conditioning yet, but here's a rough sample (it doesn't use the duration predictor) of the generation flow in a notebook. I left in some debugging outputs so you can see the flow resolving visually. The voice cloning aspect seems to work fine with different speakers, fwiw, they just say nonsense at the moment 😅

(This is from a quick ~100M param model I trained with ~1/100th the FLOPs used in the paper.)

generate.ipynb

Coice · 2024-07-23T16:10:46Z

@lucasnewman does it always output the reference audio, regardless of what you use as input for the reference text? I also left out the duration predictor, I wound up just simply doubling the input duration and doubling the reference text, since if it can't do the doubling, it sure won't work for anything else 🙃

I couldn't get it to generate anything aside from the input reference audio. I was told by the author "train it more", but I put considerable resources into it, and it never improved.

lucasnewman · 2024-07-23T16:14:18Z

@lucasnewman does it always output the reference audio, regardless of what you use as input for the reference text? I also left out the duration predictor, I wound up just simply doubling the input duration and doubling the reference text, since if it can't do the doubling, it sure won't work for anything else 🙃

Yep, this is exactly what I'm doing, and more or less what you see in the notebook — I just hard-coded the duration to keep it simple.

I couldn't get it to generate anything aside from the input reference audio. I was told by the author "train it more", but I put considerable resources into it, and it never improved.

Make sure you took the duration fix from a few days ago if you explicitly passing it as an int because otherwise it will stop generation after the conditioning. You don't need to retrain your model as it only affects sampling.

Ryu1845 · 2024-07-23T16:15:12Z

I haven't figured out what's up with the text conditioning yet, but here's a rough sample (it doesn't use the duration predictor) of the generation flow in a notebook. I left in some debugging outputs so you can see the flow resolving visually. The voice cloning aspect seems to work fine with different speakers, fwiw, they just say nonsense at the moment 😅

(This is from a quick ~100M param model I trained with ~1/100th the FLOPs used in the paper.)

generate.ipynb

The output sample seems to be gibberish after what I assume to be the prompt(?)

Thank you for your work though!

generated.mp4

Coice · 2024-07-23T16:37:30Z

@lucasnewman The code I was using was based off of a modified version of voicebox, though I did try training an early version of this repo, but at the time it was giving nan's.

Just to be clear, if you put in any other text, do you still get the exact reference audio? The model I trained always ignored the text embeddings, I'm just wondering if you have the same issue. It looks like it just learns to pass through the input.

Another thing you can try is, pass in a masked region to see if it can do the training objective in inference mode. Can it do the infill? (When I tested this with my model, it was just gibberish, but the unmasked regions were the original audio basically).

lucasnewman · 2024-07-23T16:50:46Z

@lucasnewman The code I was using was based off of a modified version of voicebox, though I did try training an early version of this repo, but at the time it was giving nan's.

Just to be clear, if you put in any other text, do you still get the exact reference audio? The model I trained always ignored the text embeddings, I'm just wondering if you have the same issue. It looks like it just learns to pass through the input.

I haven't tried, but that would correlate to what I was referencing with the text conditioning.

Another thing you can try is, pass in a masked region to see if it can do the training objective in inference mode. Can it do the infill? (When I tested this with my model, it was just gibberish, but the unmasked regions were the original audio basically).

Yeah, this is effectively the same task with a different mask region, so I would expect similar results for now since the text conditioning doesn't seem to be working right. I don't actually have a ton of extra time to spend on debugging it, but you're welcome to run some experiments! The latest version of the code is almost exactly what I trained.

juliakorovsky · 2024-07-23T20:28:52Z

I'm trying to train this model with another repo (I've slightly changed voicebox repo) with around of 300 hours of data. Only at around 400 000 iterations it started to output something sounding like a speech (but the speech was gibberish). I also get random noise a lot, as if model would be unable to fill in the blanks. The voice model uses also doesn not resemble target voice for now. I'm thinking about increasing gradient accumulation to match paper's batch in case model just doesn't see enough of data per iteration.

HaiFengZeng · 2024-07-29T04:00:24Z

I think the model is hard to train when trying to directly learn the alignment from text to mel-spec, has anyone get some reasonable result? I also get some speech with the same timbre but the speech is not expected for text input,so I think the model doesn't get alignment properly
sorry for this, after train a longer time, some of results seem to become more related to input text(some words missing, some just wrong words...),but much better

eschmidbauer · 2024-08-01T14:59:59Z

im trying to test inference again - it is very slow, and it appears the code is using CPU instead of GPU. In the image, GPU is at 0%, VRAM is 9% and CPU is 2718%
Any ideas why this might be happening?

AbrarMahmud · 2024-08-02T11:57:30Z

is there any pre-trained checkpoint for this model available ?

-thanks in advance

eschmidbauer · 2024-08-05T19:59:43Z

i was eventually able to get inference to work by changing this line

sample = e2tts.sample(mel[:,:25].to("cuda"), text = text)

But i only get noise on output. Has anyone else been able to get inference to work?

eschmidbauer · 2024-08-05T21:03:49Z

shared checkpoint here

Coice · 2024-08-06T04:55:50Z

@eschmidbauer Were you successful in getting it to generate speech from the text input?

eschmidbauer · 2024-08-06T11:52:27Z

@Coice no, but maybe my inference script needs work. Maybe someone else is able to generate speech and share the code

juliakorovsky · 2024-08-06T13:28:08Z

i was eventually able to get inference to work by changing this line
sample = e2tts.sample(mel[:,:25].to("cuda"), text = text)
But i only get noise on output. Has anyone else been able to get inference to work?

I'll add what I know in case someone's interested. I't trying to train E2 TTS with another repo on a small dataset. I had to rewrote some code because I used Voicebox repo. I tried to train it for a couple of weeks, but the network only generated noise. I decided to print gradients for all parameters and found out that attention gradients were always zero. After some digging I found out I accidentally turned my attention dropout to 1. When I fixed it I got something resembling speech instead of noise. Model still can speak many sounds properly, but at least I see now that it learns. If your model outputs only noise even at 400 000 iterations (400 is just an example, theoretically at this stage it should be able to generate something), I would recommend to double check gradients: maybe there's some mistake and gradients are None or they might be zero, or you might have vanishing gradients.

G-structure · 2024-08-07T04:09:39Z

Yeah so as we are finding this model requires an considerable amount of training...

From the paper.

We utilized the Libriheavy dataset [30] to train our models. The Libriheavy dataset comprises 50,000 hours of read English speech from 6,736 speakers, accompanied by transcriptions that preserve case and punctuation marks.

We modeled the 100-dimensional log mel-filterbank features, extracted every 10.7 milliseconds from audio samples with a 24 kHz sampling rate.

All models were trained for 800,000 mini-batch updates with an effective mini-batch size of 307,200 audio frames.

Meaning the model saw 0.9131 hours of audio per mini-batch, 730480 hours total. ~15 epochs over Libriheavy.

From the WER graphs, we observe that the Voicebox models demonstrated a good WER even at the 10% training point, owing to the use of frame-wise phoneme alignment. On the other hand, E2 TTS required significantly more training to converge. Interestingly, E2 TTS achieved a better WER at the end of the training. We speculate this is because the E2 TTS model learned a more effective grapheme-to-phoneme mapping based on the large training data, compared to what was used for Voicebox. From the SIM-o graphs, we also observed that E2 TTS required more training iteration, but it ultimately achieved a better result at the end of the training. We believe this suggests the superiority of E2 TTS, where the audio model and duration model are jointly learned as a single flow-matching Transformer.

From this chart we can infer that the model doesn't really start to learn how to speak until the it sees 73048 hours of audio.

Here is the output I am getting after training the a model with the same specs as the paper on 4440.9 hours of audio:
trying to say "son, he would tell him. son, he would tell him". Note the first utterance is the reference audio, the second half is the tts generation.

e2tts_500k.mp4

lucasnewman · 2024-08-07T04:25:20Z

@cyber-phys FWIW that lines up with my napkin math and your sample sounds similar to my experiments.

I tried a trick where I used a scale factor that ramps from (0, 1] for the random times selection for a few thousand steps, forcing the model to learn stronger conditioning from close-to-noise time steps, which seemed to help a little bit with pronunciation in a low data regime (you could recognize words with ~10k hours of audio training), but nothing close to the quality of Voicebox, which obviously has a big alignment advantage.

It seems like you need a bunch of training over 50k+ hours of audio to make a dent on this one, which is kind of cool because it's possible to just brute force the alignment, but also probably out of reach for most academic/unfunded settings, unfortunately.

skirdey · 2024-08-07T04:26:43Z

I have a training job running that saw around 2,000,000 samples of speech out of 13M total. I am training on multi-lingual datasets so most likely it will take awhile before it can do coherent speech. But it does "speak" a combination of languages now, with no apparent alignment to the text prompt.
output_1089.webm

You can find latest checkpoint here https://drive.google.com/drive/folders/11m6ftmJbxua7-pVkQCA6qbfLMlsfC_Ls?usp=drive_link

# Initialize the duration predictor and TTS model
duration_predictor = DurationPredictor(
    transformer=dict(
        dim=512,
        depth=6,
        heads=2,
        dim_head=64,
        max_seq_len=4000
    )
)

e2tts = E2TTS(
    duration_predictor=duration_predictor,
    num_channels=100,
    transformer=dict(
        dim=1024,
        depth=24,
        skip_connect_type='concat',
        heads=16,
        dim_head=64,
        max_seq_len=4000
    ),
    text_num_embeds=256,
    cond_drop_prob=0.25,
)
optimizer = AdamWScheduleFree(e2tts.parameters(), lr=3e-5)

HaiFengZeng · 2024-08-07T06:51:38Z

I would like to share a sample based another modified repo: really needs a lot resource(only get 4X4090 gpus) and train nearly two week and the result seems need more training. I only use two datasets: gigaspeech and libiritts.

yy-2.mp4

text: you are very handsome.

ref_yy-2.mp4

juliakorovsky · 2024-08-07T08:29:02Z

@cyber-phys FWIW that lines up with my napkin math and your sample sounds similar to my experiments.

I tried a trick where I used a scale factor that ramps from (0, 1] for the random times selection for a few thousand steps, forcing the model to learn stronger conditioning from close-to-noise time steps, which seemed to help a little bit with pronunciation in a low data regime (you could recognize words with ~10k hours of audio training), but nothing close to the quality of Voicebox, which obviously has a big alignment advantage.

It seems like you need a bunch of training over 50k+ hours of audio to make a dent on this one, which is kind of cool because it's possible to just brute force the alignment, but also probably out of reach for most academic/unfunded settings, unfortunately.

Could you show the code for scale factor trick? Or link to it if it's included in this repo.

lucasnewman · 2024-08-07T15:38:57Z

It was just an experiment to see if the text conditioning was working at all — I'm not sure it's a great idea in general.

My intuition was that the joint training objective is particularly difficult for alignment because the "fingerprint" of the flow is pretty well established in the first ~2-3% of the ODE steps and at that point the model will primarily use the flow from the previous timestep for prediction. If we force the model to predict from near-noise earlier, we can bias the training objective towards the text conditioning at the start.

(Also I forgot to mention that I used phonemes instead of the raw byte encoding to make it a little easier on the model because I'm using a smaller dataset.)

You can reproduce it with something like:

diff --git a/e2_tts_pytorch/e2_tts.py b/e2_tts_pytorch/e2_tts.py
index b43a8d5..c83cf05 100644
--- a/e2_tts_pytorch/e2_tts.py
+++ b/e2_tts_pytorch/e2_tts.py
@@ -694,7 +694,8 @@ class E2TTS(Module):
         *,
         text: Int['b nt'] | List[str] | None = None,
         times: Int['b'] | None = None,
-        lens: Int['b'] | None = None,
+        times_scale: Float | None = None,
+        lens: Int['b'] | None = None
     ):
         # handle raw wave
 
@@ -740,6 +741,10 @@ class E2TTS(Module):
         # t is random times from above
 
         times = torch.rand((batch,), dtype = dtype, device = self.device)
+
+        if exists(times_scale):
+            times = times * max(1e-3, times_scale)
+        
         t = rearrange(times, 'b -> b 1 1')
 
         # sample xt (w in the paper)

And then in your trainer class define num_time_scale_steps and do:

if global_step < self.num_time_scale_steps:
    scale_progress = float(global_step) / float(self.num_time_scale_steps)
    times_scale = min(1.0, scale_progress)
else:
    times_scale = None

loss, ... = self.model(mel_spec, text = text, lens = mel_lengths, times_scale = times_scale)

lucidrains · 2024-08-07T20:47:13Z

HaiFengZeng · 2024-08-12T02:32:05Z

It was just an experiment to see if the text conditioning was working at all — I'm not sure it's a great idea in general.

My intuition was that the joint training objective is particularly difficult for alignment because the "fingerprint" of the flow is pretty well established in the first ~2-3% of the ODE steps and at that point the model will primarily use the flow from the previous timestep for prediction. If we force the model to predict from near-noise earlier, we can bias the training objective towards the text conditioning at the start.

(Also I forgot to mention that I used phonemes instead of the raw byte encoding to make it a little easier on the model because I'm using a smaller dataset.)

You can reproduce it with something like:
diff --git a/e2_tts_pytorch/e2_tts.py b/e2_tts_pytorch/e2_tts.py
index b43a8d5..c83cf05 100644
--- a/e2_tts_pytorch/e2_tts.py
+++ b/e2_tts_pytorch/e2_tts.py
@@ -694,7 +694,8 @@ class E2TTS(Module):
         *,
         text: Int['b nt'] | List[str] | None = None,
         times: Int['b'] | None = None,
-        lens: Int['b'] | None = None,
+        times_scale: Float | None = None,
+        lens: Int['b'] | None = None
     ):
         # handle raw wave
 
@@ -740,6 +741,10 @@ class E2TTS(Module):
         # t is random times from above
 
         times = torch.rand((batch,), dtype = dtype, device = self.device)
+
+        if exists(times_scale):
+            times = times * max(1e-3, times_scale)
+        
         t = rearrange(times, 'b -> b 1 1')
 
         # sample xt (w in the paper)
And then in your trainer class define num_time_scale_steps and do:
if global_step < self.num_time_scale_steps:
    scale_progress = float(global_step) / float(self.num_time_scale_steps)
    times_scale = min(1.0, scale_progress)
else:
    times_scale = None

loss, ... = self.model(mel_spec, text = text, lens = mel_lengths, times_scale = times_scale)

good idea, how to do inference when apply time_scale? will it use less NFE steps?

acul3 · 2024-08-13T04:49:17Z

Has anyone got good result already (both text aligment and sound similar/quality)

4xA100 a week,30k hours, still produce incosistent speech with txt

changjinhan · 2024-08-13T09:20:05Z

@acul3 I'm working with a similar setup, just using GLOBE. Here’s what I have so far—could you share your intermediate results as well?

Text: (separated from the main spacecraft and began its descent to the moon's surface.) Waves crashed against the cliffs, their thunderous applause echoing for miles.

e2tts_680k.mp4

acul3 · 2024-08-13T09:38:43Z

@changjinhan how long you train it?

i am training multilingual (indonesia, and malay)

the output its acceptable, but seem hard to follow text

here is my config

# Initialize the duration predictor and TTS model
duration_predictor = DurationPredictor(
    transformer=dict(
        dim=512,
        depth=6,
        heads=2,
        dim_head=64,
        max_seq_len=4000
    )
)

e2tts = E2TTS(
    duration_predictor=duration_predictor,
    num_channels=100,
    transformer=dict(
        dim=1024,
        depth=24,
        skip_connect_type='concat',
        heads=16,
        dim_head=64,
        max_seq_len=4000
    ),
    text_num_embeds=256,
    cond_drop_prob=0.25,
)
optimizer = AdamWScheduleFree(e2tts.parameters(), lr=3e-5)

can you share yours? appreciated

changjinhan · 2024-08-14T00:57:04Z

@acul3 I trained it for 6 days and my config is as follows:

from itertools import chain

duration_predictor = DurationPredictor(
    transformer = dict(
        dim = 512,
        depth = 6,
    )
)

model = E2TTS(
    num_channels=80,
    transformer = dict(
        dim = 512,
        depth = 12,
        skip_connect_type = 'concat'
    )
)
optimizer = Adam(chain(e2tts.parameters(), duration_predictor.parameters()), lr=7.5e-5)

darylsew · 2024-09-09T03:40:25Z

gotcha, had refactored some stuff and mistakenly entered in 'unet' since paper mentioned 'unet style skip connections' lol.. that could def be it.
learning rate calculated in my failed experiment was:

Batch size ratio: 0.7083
Adjusted learning rate: 5.31e-05

i think i changed too many things at once so gonna try to just replicate your initial quick-training results with a smaller model and the phonemizer and change one thing at a time from there - had some issues using the phonemizer on the repo actually, will post in issues.

manmay-nakhashi · 2024-09-09T07:27:13Z

dataset used
https://huggingface.co/datasets/MushanW/GLOBE
https://huggingface.co/datasets/ShoukanLabs/AniSpeech
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
lr = lr=3e-4
epochs: 4
extended phonemes like @lucasnewman did

e2tts = E2TTS(
    tokenizer = 'phoneme_en',
    cond_drop_prob = 0.2,
    interpolated_text=True,
    transformer = dict(
        dim = 384,
        depth = 12,
        heads = 6,
        max_seq_len = 2048,
        skip_connect_type = 'concat'
    ),
    mel_spec_kwargs = dict(
        filter_length = 1024,
        hop_length = 256,
        win_length = 1024,
        n_mel_channels = 100,
        sampling_rate = 24000,
    ),
    immiscible=True,
    frac_lengths_mask = (0.5, 0.9)
)

generated.wav

ILG2021 · 2024-09-09T09:39:54Z

dataset used https://huggingface.co/datasets/MushanW/GLOBE https://huggingface.co/datasets/ShoukanLabs/AniSpeech https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 lr = lr=3e-4 epochs: 4 extended phonemes like @lucasnewman did
e2tts = E2TTS(
    tokenizer = 'phoneme_en',
    cond_drop_prob = 0.2,
    interpolated_text=True,
    transformer = dict(
        dim = 384,
        depth = 12,
        heads = 6,
        max_seq_len = 2048,
        skip_connect_type = 'concat'
    ),
    mel_spec_kwargs = dict(
        filter_length = 1024,
        hop_length = 256,
        win_length = 1024,
        n_mel_channels = 100,
        sampling_rate = 24000,
    ),
    immiscible=True,
    frac_lengths_mask = (0.5, 0.9)
)
generated.wav

Hello, can you share the pretrain?

lucidrains · 2024-09-09T13:01:12Z

@manmay-nakhashi you are welcome to join us for dinner, but you'll have to fly here 😄 hit me up if you are ever in SF though!

lucidrains · 2024-09-09T13:28:41Z

think i will also just fix the transformer to always use concat unet skips, unless there are any objections

manmay-nakhashi · 2024-09-09T13:34:25Z

@lucidrains ha ha, i am coming there in next month mostly.

lucidrains · 2024-09-09T13:35:14Z

@manmay-nakhashi that's when we are planning on dinner! when are you arriving? send me an email and we can coordinate

manmay-nakhashi · 2024-09-09T17:10:32Z

@ILG2021 i'll share one pretrained model tomorrow, not fully trained though, it can at-least save some alignment time.

ILG2021 · 2024-09-09T22:11:51Z

@ILG2021 i'll share one pretrained model tomorrow, not fully trained though, it can at-least save some alignment time.

Thank you.

JingRH · 2024-09-11T03:16:31Z

Has anyone noticed this repository: https://github.com/feizc/FluxMusic? They also implicitly use a doublestream structure. This kind of multimodal fusion architecture might truly be the direction for future development.

yangyyt · 2024-09-11T09:46:13Z

@ILG2021 i'll share one pretrained model tomorrow, not fully trained though, it can at-least save some alignment time.

looking forward to your model sharing . @manmay-nakhashi

manmay-nakhashi · 2024-09-11T10:00:56Z

e2-tts-pytorch-small
@ILG2021 @yangyyt @lucidrains

manmay-nakhashi · 2024-09-11T10:14:02Z

@lucidrains can we keep interpolation method in ?.
i have trained a model and it works.
this checkpoint is trained with interpolated_text=True.

lucidrains · 2024-09-11T12:25:55Z

@lucidrains can we keep interpolation method in ?. i have trained a model and it works. this checkpoint is trained with interpolated_text=True.

added it back!

lucidrains · 2024-09-11T12:48:25Z

@skirdey are you near SF? do you want to join us for dinner mid next month? you were the one who alerted me to this paper!

Oktai15 · 2024-09-12T09:01:50Z

@manmay-nakhashi thanks for your checkpoint, but current lucidrains main branch doesn't work with it. Could you share correct commit/branch/link to actual code (especially, inference script)?

manmay-nakhashi · 2024-09-12T12:16:07Z

@Oktai15 i trained on 0.9.1 , but i'll see over weekend, what's difference between current and 0.9.1 which is not working.

Oktai15 · 2024-09-12T14:51:12Z

@manmay-nakhashi can you share the inference script?

lucidrains · 2024-09-12T20:06:11Z

@manmay-nakhashi thanks for your checkpoint, but current lucidrains main branch doesn't work with it. Could you share correct commit/branch/link to actual code (especially, inference script)?

you can always just install the specific version

evelynyhc · 2024-09-26T02:36:48Z

will the trained duration predictor model be saved? and when inference，duration model is used ?

fakerybakery · 2024-09-26T17:01:39Z

@Oktai15 i trained on 0.9.1 , but i'll see over weekend, what's difference between current and 0.9.1 which is not working.

@manmay-nakhashi would you mind sharing what code you used for inference on 0.9.1? thanks!

manmay-nakhashi · 2024-10-03T11:19:11Z

@fakerybakery

import datetime
from pathlib import Path

import torch
import torchaudio
from torchaudio.transforms import MelSpectrogram
torch.backends.cudnn.benchmark = True

from einops import rearrange

from vocos import Vocos

from e2_tts_pytorch.e2_tts import E2TTS, DurationPredictor

import matplotlib.pyplot as plt
from IPython.display import Audio

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

checkpoint_path = ""
audio_path = ""
text = ""

# load the model
duration_predictor = DurationPredictor( 
    transformer = dict(
        dim = 384,
        depth = 8,
        heads = 6)).to('cuda')

e2tts = E2TTS(
    tokenizer = 'phoneme_en',
    cond_drop_prob = 0.2,
    interpolated_text=True,
    transformer = dict(
        dim = 384,
        depth = 12,
        heads = 6,
        max_seq_len = 2048,
        skip_connect_type = 'concat'
    ),
    mel_spec_kwargs = dict(
        filter_length = 1024,
        hop_length = 256,
        win_length = 1024,
        n_mel_channels = 100,
        sampling_rate = 24000,
    ),
    immiscible=True,
    frac_lengths_mask = (0.5, 0.9)
)

checkpoint = torch.load(checkpoint_path, map_location='cpu', weights_only = True)
e2tts.load_state_dict(checkpoint['model_state_dict'], strict=False)
e2tts.eval()
duration_predictor.eval()
# load a sample audio file
audio, sr = torchaudio.load(Path(audio_path).expanduser())
if sr != 24000:
    resampler = torchaudio.transforms.Resample(sr, 24000)
    audio = resampler(audio)        
original_mel_spec = e2tts.mel_spec(audio).squeeze(0)

# visualize the mel spectrogram
plt.figure(figsize=(6, 4))
plt.imshow(original_mel_spec.numpy(), origin='lower', aspect='auto')
plt.colorbar()
plt.show()

# mask off the second half

mask_length = original_mel_spec.shape[1] // 2
mel_spec = rearrange(original_mel_spec, 'd n -> 1 n d')[:, :mask_length]

print(f"Text: {text}")

# if you want to use an accelerator, e.g. cuda or mps
device = torch.device('cuda')
e2tts = e2tts.to(device)
mel_spec = mel_spec.to(device)

start_date = datetime.datetime.now()

with torch.inference_mode():
    text = e2tts.tokenizer([text]).to(device)
    print(mel_spec.shape,  original_mel_spec.shape[1])
    duration = duration_predictor(mel_spec, text = text)
    print(duration)
    generated = e2tts.sample(
        cond = mel_spec,
        text = text,
        duration = original_mel_spec.shape[1],
        steps = 32,
        cfg_strength = 1.0
    )
    
    print(f"Generated: {generated.shape} in {datetime.datetime.now() - start_date}")
    generated_mel_spec = rearrange(generated, '1 n d -> 1 d n')

# vocode into audio

wave = vocos.decode(rearrange(rearrange(original_mel_spec, 'd n -> 1 n d').cpu(), '1 n d -> 1 d n'))
print(f"wave: {wave.shape}")

wave2 = vocos.decode(generated_mel_spec.cpu())
print(f"wave2: {wave2.shape}")

# show previews of the original and generated audio

print("Original:")

torchaudio.save("original.wav", wave, 24_000)

print("Generated:")

torchaudio.save("generated.wav", wave2, 24_000)

BlazJurisic · 2024-10-03T16:03:09Z

Thank You @manmay-nakhashi for the code and unfortunately, I still get errors with tokenizer on ==0.9.1 when trying to encode text

manmay-nakhashi · 2024-10-03T18:35:56Z

@BlazJurisic Take new g2p changes from the current master

fakerybakery · 2024-10-04T23:02:09Z

Hi @manmay-nakhashi,
Thanks for sharing the code!
I'm getting the following issue when trying to run it:

Traceback (most recent call last):
  File "run.py", line 97, in <module>
    print(f"Generated: {generated.shape} in {datetime.datetime.now() - start_date}")
                        ^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'shape'

JohnHerry · 2024-11-26T07:14:30Z

I may have some miss-understanding about the DurationPredictor? can anyone give me some help?
In the code I see the DurationPredictor can predict a total duration according to the prompt speech, which means that it predicting the prompt speech lens of frames.
Then during the inference process, it give a total length of len1 + len2, where len1 is filled with the prompt speech frames, and len2 (which is predicted by DurationPredictor with prompt speech, and plus 1), what E2-TTS should prediction is the value in the len2.
that means, no matter what is the length of our input text tokens, the total generated speech frames should be fixed at len2, but len2 predicted by DurationPredictor, which is trained mainly on mel frames but not text tokens. is that a resonable result?
Why the DurationPredictor predict a total lens mainly based on input text tokens? instead of prompt speech frames?

JohnHerry · 2024-11-28T07:12:22Z

okay so I took at shot at hacking together some inference code. I trained a model for 400k steps on the MushanW/GLOBE dataset; when I test it I get a cacophony which is starting to resemble the tts prompts.. but the intermediate melspec is of very poor quality so something might be wrong with my approach.

import os

import torch
import torchaudio
from torchaudio.transforms import GriffinLim, InverseSpectrogram, InverseMelScale, Resample, Speed

from einops import rearrange
from accelerate import Accelerator

from torch.optim import Adam

from e2_tts_pytorch.e2_tts import (
    E2TTS,
    DurationPredictor,
    MelSpec
)

duration_predictor = DurationPredictor(
    transformer = dict(
        dim = 80,
        depth = 2,
    )
)

model = E2TTS(
    duration_predictor = duration_predictor,
    transformer = dict(
        dim = 80,
        depth = 4,
        skip_connect_type = 'concat'
    )
)

n_fft = 1024
sample_rate = 22050
checkpoint_path = "./e2tts.pt"

def exists(v):
    return v is not None

def vocoder(melspec):
    inverse_melscale_transform = InverseMelScale(n_stft=n_fft // 2 + 1, n_mels=80, sample_rate=sample_rate, norm="slaney", f_min=0, f_max=8000)
    spectrogram = inverse_melscale_transform(melspec)
    transform = GriffinLim(n_fft=n_fft, hop_length=256, power=2)
    waveform = transform(spectrogram)
    return waveform

def load_checkpoint(checkpoint_path, model, accelerator, optimizer):
    if not exists(checkpoint_path) or not os.path.exists(checkpoint_path):
        return 0

    checkpoint = torch.load(checkpoint_path)
    accelerator.unwrap_model(model).load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['step']

accelerator = Accelerator(
            log_with="all",
        )

optimizer = Adam(model.parameters(), lr=1e-4)

start_step = load_checkpoint(checkpoint_path=checkpoint_path, model=model, accelerator=accelerator, optimizer=optimizer)

ref_waveform, ref_sample_rate = torchaudio.load("ref.wav", normalize=True)
resampler = Resample(orig_freq=ref_sample_rate, new_freq=sample_rate)
ref_waveform = resampler(ref_waveform)
speed_factor = sample_rate / ref_sample_rate
respeed = Speed(ref_sample_rate, speed_factor)
ref_waveform = respeed(ref_waveform)
ref_waveform_resampled = ref_waveform[0]

mel_model = MelSpec()
mel = mel_model(ref_waveform_resampled)
mel = torch.cat([mel, mel], dim=0)
mel = rearrange(mel, 'b d n -> b n d')

text = ["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", "Waves crashed against the cliffs, their thunderous applause echoing for miles."]
sample = model.sample(mel[:,:25], text = text, vocoder=vocoder)
sample = sample.to('cpu')

waveform = sample

mono_channel_1 = waveform[0].unsqueeze(0)
mono_channel_2 = waveform[1].unsqueeze(0)

torchaudio.save("output_channel_1.wav", mono_channel_1, sample_rate)
torchaudio.save("output_channel_2.wav", mono_channel_2, sample_rate)

is that means 25 frames of target speaker is enought to syntheize his timbre？

JohnHerry · 2024-12-02T10:20:50Z

Trained on more then 100,000 hours of speech, over 500,000 steps, about half an epoch, but still can not get good result.
the training loss seems do not go down anymore.
And the DurationPredictor, trained to 380,000 steps, it seems always predict output the value 1, completely useless.
When test inference, I use a constant duration=1000, means the cond and generated mel totally 1000 frames len. the default sample function will generate a total noise, nothing useful.
and then I changed the diffustion steps from the default 35 to 100, it can produce something out. it is the cond audio, then a puff of noise, and then the generated audio about the input text. the generated audio lost the begin part about the text, and it sounds different with the condition speaker.
and because the constant duration=1000 may be not fit for the text, the generated audio makes very bad prosody, even more robot-like then the troditional fastspeech tts.

JohnHerry · 2024-12-05T06:58:04Z

sample的结果里有各种重叠音效，是不是sample过程有什么问题啊？

ndhuynh02 · 2024-12-17T10:26:06Z

Since E2TTS takes Mel-spectrogram as input and its job is to filled out the masked part (this is why there is a crop in input mel), I believe that the result still return the cropped input. Therefore, we need to remove the beginning part of output.

FYI, I am also having difficulties inferring with this model. In the paper, the author said that G2P is removed, so using phoneme_en as tokenizer doesn't seem right. In addition, the model hyperparameters like depth, dim, heads, etc need modification so that it is like what mentioned in the paper

inference code example #8

inference code example #8

Comments

eschmidbauer commented Jul 18, 2024 • edited Loading

manmay-nakhashi commented Jul 19, 2024

G-structure commented Jul 19, 2024

changjinhan commented Jul 20, 2024

lucasnewman commented Jul 23, 2024

Coice commented Jul 23, 2024

lucasnewman commented Jul 23, 2024

Ryu1845 commented Jul 23, 2024

Coice commented Jul 23, 2024

lucasnewman commented Jul 23, 2024

juliakorovsky commented Jul 23, 2024

HaiFengZeng commented Jul 29, 2024 • edited Loading

eschmidbauer commented Aug 1, 2024

AbrarMahmud commented Aug 2, 2024

eschmidbauer commented Aug 5, 2024

eschmidbauer commented Aug 5, 2024

Coice commented Aug 6, 2024

eschmidbauer commented Aug 6, 2024

juliakorovsky commented Aug 6, 2024 • edited Loading

G-structure commented Aug 7, 2024

lucasnewman commented Aug 7, 2024

skirdey commented Aug 7, 2024 • edited Loading

HaiFengZeng commented Aug 7, 2024

juliakorovsky commented Aug 7, 2024

lucasnewman commented Aug 7, 2024 • edited Loading

lucidrains commented Aug 7, 2024

HaiFengZeng commented Aug 12, 2024

acul3 commented Aug 13, 2024

changjinhan commented Aug 13, 2024 • edited Loading

acul3 commented Aug 13, 2024 • edited Loading

changjinhan commented Aug 14, 2024 • edited Loading

darylsew commented Sep 9, 2024 • edited Loading

manmay-nakhashi commented Sep 9, 2024 • edited Loading

ILG2021 commented Sep 9, 2024

lucidrains commented Sep 9, 2024 • edited Loading

lucidrains commented Sep 9, 2024

manmay-nakhashi commented Sep 9, 2024

lucidrains commented Sep 9, 2024

manmay-nakhashi commented Sep 9, 2024

ILG2021 commented Sep 9, 2024

JingRH commented Sep 11, 2024

yangyyt commented Sep 11, 2024

manmay-nakhashi commented Sep 11, 2024 • edited Loading

manmay-nakhashi commented Sep 11, 2024

lucidrains commented Sep 11, 2024

lucidrains commented Sep 11, 2024

Oktai15 commented Sep 12, 2024

manmay-nakhashi commented Sep 12, 2024 • edited Loading

Oktai15 commented Sep 12, 2024

lucidrains commented Sep 12, 2024

evelynyhc commented Sep 26, 2024

fakerybakery commented Sep 26, 2024

manmay-nakhashi commented Oct 3, 2024 • edited Loading

BlazJurisic commented Oct 3, 2024

manmay-nakhashi commented Oct 3, 2024

fakerybakery commented Oct 4, 2024

JohnHerry commented Nov 26, 2024

JohnHerry commented Nov 28, 2024

JohnHerry commented Dec 2, 2024 • edited Loading

JohnHerry commented Dec 5, 2024

ndhuynh02 commented Dec 17, 2024 • edited Loading

eschmidbauer commented Jul 18, 2024 •

edited

Loading

HaiFengZeng commented Jul 29, 2024 •

edited

Loading

juliakorovsky commented Aug 6, 2024 •

edited

Loading

skirdey commented Aug 7, 2024 •

edited

Loading

lucasnewman commented Aug 7, 2024 •

edited

Loading

changjinhan commented Aug 13, 2024 •

edited

Loading

acul3 commented Aug 13, 2024 •

edited

Loading

changjinhan commented Aug 14, 2024 •

edited

Loading

darylsew commented Sep 9, 2024 •

edited

Loading

manmay-nakhashi commented Sep 9, 2024 •

edited

Loading

lucidrains commented Sep 9, 2024 •

edited

Loading

manmay-nakhashi commented Sep 11, 2024 •

edited

Loading

manmay-nakhashi commented Sep 12, 2024 •

edited

Loading

manmay-nakhashi commented Oct 3, 2024 •

edited

Loading

JohnHerry commented Dec 2, 2024 •

edited

Loading

ndhuynh02 commented Dec 17, 2024 •

edited

Loading