New TTS Model request #3

rishikksh20 · 2021-11-19T20:18:50Z

Recently two papers regarding Transformer TTS pops up and I think both are suitable for this repo:

I think both are easy to implement and well suited for this repo.

keonlee9420 · 2021-11-22T10:39:54Z

Hi @rishikksh20, thanks for the requests! I can see that they fit well with this project. I will look into it and hope that I can merge them with this repo :)

rishikksh20 · 2021-11-22T17:24:51Z

Hi @keonlee9420 , DelightfulTTS is similar to Phone Level Mixture Density Network but here instead of using complicated GMM based model author directly used latent representation for Prosody Predictor and Prosody encoder. Phoneme level prosody encoder and Utterance level encoder are similar to this. I think they simply uses Global Style Token(GST) module as Utterance level encoder.

rishikksh20 · 2021-11-22T17:27:53Z

DelightfulTTS learn Phoneme level prosody implicitly whereas Emphasis control for parallel neural TTS learn same explicitly by extracting features from this repo.

rishikksh20 · 2021-11-22T17:31:58Z

I think DelightfulTTS is all in one solution, it uses non-autoregressive architecture with conformer blocks and both Utterance level and Phoneme level predictor as well.

keonlee9420 · 2021-11-23T08:26:19Z

Thank you for the summary. The DelightfulTTS model seems worth a try as you depicted. I will try it and share through the update soon!

rishikksh20 · 2021-12-10T15:49:17Z

@keonlee9420 Hi, are you able to train DelightfullTTS successfully ?

keonlee9420 · 2021-12-13T07:03:40Z

Yes, but it shows overfitting issue. I guess this issue originated from the limited capacity of the prosody predictor since I can confirm that the prosody embedding extracted from prosody extractor can actually improve the expressiveness including the validation loss.

rishikksh20 · 2021-12-14T06:31:02Z

Have you train predictor and extractor simultaneously or train extractor for 100k steps first then pause it and then start predictor training in teacher forcing method like mentioned in AdaSpeech paper ?

rishikksh20 · 2021-12-14T06:48:23Z

Because in my case I do some modification in architecture, I used same extractors as mentioned in DelightfullTTS 's papers but I am not using any predictor for utterance level because I want to use it similarly as GST-Tacotron by passing external reference mel, and for phoneme level predictor I used similar predictor architecture as in original Adaspeech's which is similar to duration and pitch predictor. And I train Phoneme level extractor for 100k then stop it and then start predictor training.
But while training this, till 2000 steps with 32 batch size model loss works perfectly but after around 2200 steps loss start increasing and not converge and output is just noise. But when I passed detached hidden state to Phoneme level extractor then it train perfectly and even latent variable also working, I am able to change emotion using latent variable of Phoneme -level predictor.

keonlee9420 · 2021-12-14T08:39:58Z

ah, thanks for sharing. I trained jointly without any detach or schedule from the first step. So what you mean is

training only the prosody extractor (not predictor) until 100k
start training the prosody predictor but with a detached prosody embedding from prosody extractor (still the prosody extractor is also on its training)
right? Or in 2, do you mean no gradient flows back to even the prosody extractor too?

rishikksh20 · 2021-12-14T09:13:34Z

I suggest 1

rishikksh20 · 2021-12-16T05:10:26Z

@keonlee9420 In your experience which perform better normal Transformer encoder or Conformer when you have only 20 hours of speech data?

rishikksh20 · 2021-12-21T05:33:12Z

As per this article Microsoft TTS api built on DelightfullTTS.

hdmjdp · 2022-01-30T04:46:57Z

can you share your code

I suggest 1 @rishikksh20

hdmjdp · 2022-01-30T06:20:54Z

detached hidden state

@rishikksh20 Does this refer to text encoder output?

rishikksh20 · 2022-01-30T08:24:17Z

detached hidden state

@rishikksh20 Does this refer to text encoder output?

yes

hdmjdp · 2022-01-30T12:46:54Z

@rishikksh20 After 100k,, does the prams of prodsody extractor update or just frozen?

v-nhandt21 · 2022-08-15T09:40:05Z

Is there any confirmation on the quality of the Transformer encoder or Conformer, I found that the conformer in DelightfulTTS is quite different from ASR a little bit.

rishikksh20 · 2022-08-15T14:24:08Z

@v-nhandt21 yes conformer in TTS is modified version of ASR one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New TTS Model request #3

New TTS Model request #3

rishikksh20 commented Nov 19, 2021 •

edited

Loading

keonlee9420 commented Nov 22, 2021

rishikksh20 commented Nov 22, 2021 •

edited

Loading

rishikksh20 commented Nov 22, 2021

rishikksh20 commented Nov 22, 2021 •

edited

Loading

keonlee9420 commented Nov 23, 2021

rishikksh20 commented Dec 10, 2021

keonlee9420 commented Dec 13, 2021

rishikksh20 commented Dec 14, 2021

rishikksh20 commented Dec 14, 2021

keonlee9420 commented Dec 14, 2021

rishikksh20 commented Dec 14, 2021

rishikksh20 commented Dec 16, 2021

rishikksh20 commented Dec 21, 2021

hdmjdp commented Jan 30, 2022

hdmjdp commented Jan 30, 2022 •

edited

Loading

rishikksh20 commented Jan 30, 2022

hdmjdp commented Jan 30, 2022 •

edited

Loading

v-nhandt21 commented Aug 15, 2022

rishikksh20 commented Aug 15, 2022

New TTS Model request #3

New TTS Model request #3

Comments

rishikksh20 commented Nov 19, 2021 • edited Loading

keonlee9420 commented Nov 22, 2021

rishikksh20 commented Nov 22, 2021 • edited Loading

rishikksh20 commented Nov 22, 2021

rishikksh20 commented Nov 22, 2021 • edited Loading

keonlee9420 commented Nov 23, 2021

rishikksh20 commented Dec 10, 2021

keonlee9420 commented Dec 13, 2021

rishikksh20 commented Dec 14, 2021

rishikksh20 commented Dec 14, 2021

keonlee9420 commented Dec 14, 2021

rishikksh20 commented Dec 14, 2021

rishikksh20 commented Dec 16, 2021

rishikksh20 commented Dec 21, 2021

hdmjdp commented Jan 30, 2022

hdmjdp commented Jan 30, 2022 • edited Loading

rishikksh20 commented Jan 30, 2022

hdmjdp commented Jan 30, 2022 • edited Loading

v-nhandt21 commented Aug 15, 2022

rishikksh20 commented Aug 15, 2022

rishikksh20 commented Nov 19, 2021 •

edited

Loading

rishikksh20 commented Nov 22, 2021 •

edited

Loading

rishikksh20 commented Nov 22, 2021 •

edited

Loading

hdmjdp commented Jan 30, 2022 •

edited

Loading

hdmjdp commented Jan 30, 2022 •

edited

Loading