Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on the TTS vocoder #1863

Open
mah92 opened this issue Feb 15, 2025 · 6 comments
Open

Discussion on the TTS vocoder #1863

mah92 opened this issue Feb 15, 2025 · 6 comments

Comments

@mah92
Copy link
Contributor

mah92 commented Feb 15, 2025

Hi
This is a question and a possible improvement proposal.
I was trying to train a hifigan vocoder for different frequencies(24KHz, 16Khz) but all my tries have failed with noisy voice.
After seeing this I noticed that the original hifi repo were using two v100 gpus for two weeks to get a good model and I understood why I constantly failed! So I tried to used the existing 22050Hz versions...
After experimenting with different versions of hifigan vocoder(v1, v2, and v3) I also noticed that v2 is much faster than v1 and faster than v3, without noticable difference in quality. So I reached to the same decision as your defaults in your repo.
So there are two versions of hifigan v2 vocoders in the original repository here and here, I wondered which of them are you using as it is not mentioned anywhere?

Folder Name Generator Dataset Fine-Tuned
LJ_V1 V1 LJSpeech No
LJ_V2 V2 LJSpeech No
LJ_V3 V3 LJSpeech No
LJ_FT_T2_V1 V1 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V2 V2 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V3 V3 LJSpeech Yes (Tacotron2)
VCTK_V1 V1 VCTK No
VCTK_V2 V2 VCTK No
VCTK_V3 V3 VCTK No
UNIVERSAL_V1 V1 Universal No

Note that LJSpeech is single speaker english, VCTK is multiple speaker english, and universal dataset is a combination of LibriSpeech(single speaker english), VCTK, and LJSpeech
Note: I previously thought that universal dataset is multilingual which is not true

@mah92
Copy link
Contributor Author

mah92 commented Feb 15, 2025

It is important because I see problems with low-pitched voices(like they grow "older"), If you have used single speaker version, It can be improved by using the multiple speaker vocoder.

@csukuangfj
Copy link
Collaborator

It is LJ_V{1,2,3}.

@mah92
Copy link
Contributor Author

mah92 commented Feb 15, 2025

So I will test, compare and report the advantage of VCTK_V2

@mah92
Copy link
Contributor Author

mah92 commented Feb 15, 2025

And the miracle happend. Musa have become 20 years younger! Thanks God...
I have replaced the vctk vocoder and suddenly the mans voice got clear...
For comparison examples, see: this and that
For the new model with sherpa metadata, see here

@mah92
Copy link
Contributor Author

mah92 commented Feb 15, 2025

The Khadijah(female) voice has not changed with the vctk vocoder. Even the female voice reads some letters better.

@mah92
Copy link
Contributor Author

mah92 commented Feb 17, 2025

I propose that you just replace generator_v1, generator_v2 and generator_v3 in you repo with the new ones, as it is not mensioned that vctk is not used... @csukuangfj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants