Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiffSinger output is non-deterministic for a given score #1440

Open
1 task done
yezhiyi9670 opened this issue Mar 9, 2025 · 0 comments
Open
1 task done

DiffSinger output is non-deterministic for a given score #1440

yezhiyi9670 opened this issue Mar 9, 2025 · 0 comments

Comments

@yezhiyi9670
Copy link

Acknowledgement

  • I have read Getting-Started and FAQ

🐛 Describe the bug

Even for identical scores, DiffSinger singers will produce different audio on different runs. This could be considered a bug since it undermines the ability for a singing voice synthesis project to produce a predictable output, making the project "unmaintainable". For example, it will not be possible to correct a flaw in the score without affecting the rest if the cache is gone.

I understand that generative deep learning models are expected to have some randomness since they are actually "sampling" from some distribution. However, the reproducibility of the audio output is very important. As far as I know, two approaches can be used to fix this:

  • Explicitly check relevant cache data into the project file (or into a file next to the project file), so that the cache is not volatile and can make the output reproducible. This can take up a lot of disk space, but it is worthwhile to have such an option.
  • Use a fixed seed (ideally changeable by the user for each musical part or even each sentence) during inference. The ONNX runtime seems to have an interface for setting a seed. To ensure reproducibility, the seed must also be written into project file.

Explains how to reproduce the bug

  1. Select a DiffSinger-based singer.
  2. Write a piece of score.
  3. Load the predicted pitch curve from the model.
  4. Play the piece to have the audio rendered. Then click Tools > Clear cache in the main window and play again to have the audio re-rendered. Repeat for several times and compare the generated audio.
    ❎ The audio will be different although no changes are made to the score between the runs.

For example, I am using the QiXuan_v2.5.0 and the following piece:

The score containing "天上的星星不说话,地上的娃娃想妈妈;" in E key

This phrase is taken from a well-known song which is probably already included in the training data, so I have deliberately transposed it up by 2 keys.

Here are the waveforms of 5 samples obtained from score. They are slightly different.

Image of five different audio samples obtained from the same score

Actually, they do have audible differences, especially the pronunciation of "上"(shang), "说"(shuo) and "想"(xiang). In one of the samples, the "上" is almost pronounced as "sheng". Having an unpredictable possibility for such mistake is unacceptable.

OS & Version

Windows 11 版本 24H2 (OS 内部版本 26100.3323)

Logs

log20250309.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant