Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are sent and word duration loss necessary for unsupervised alignment ? #11

Open
xiaoyangnihao opened this issue Apr 27, 2022 · 3 comments

Comments

@xiaoyangnihao
Copy link

Are sent and word duration loss necessary for unsupervised alignment for a robust duration prediction?

@keonlee9420
Copy link
Owner

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

@xiaoyangnihao
Copy link
Author

xiaoyangnihao commented May 11, 2022

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

Thanks for your replay. By the way, in paper: "One TTS Alignment To Rule Them All", align modue use encoder outputs and mel as input for alignment, btw in your repo, align model use text_embedding as mel as inputs, have you done an experiment to compare this diff ?

@keonlee9420
Copy link
Owner

I just followed the Nemo's implementation, and I guess there is no specific reason for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants