You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even for identical scores, DiffSinger singers will produce different audio on different runs. This could be considered a bug since it undermines the ability for a singing voice synthesis project to produce a predictable output, making the project "unmaintainable". For example, it will not be possible to correct a flaw in the score without affecting the rest if the cache is gone.
I understand that generative deep learning models are expected to have some randomness since they are actually "sampling" from some distribution. However, the reproducibility of the audio output is very important. As far as I know, two approaches can be used to fix this:
Explicitly check relevant cache data into the project file (or into a file next to the project file), so that the cache is not volatile and can make the output reproducible. This can take up a lot of disk space, but it is worthwhile to have such an option.
Use a fixed seed (ideally changeable by the user for each musical part or even each sentence) during inference. The ONNX runtime seems to have an interface for setting a seed. To ensure reproducibility, the seed must also be written into project file.
Explains how to reproduce the bug
Select a DiffSinger-based singer.
Write a piece of score.
Load the predicted pitch curve from the model.
Play the piece to have the audio rendered. Then click Tools > Clear cache in the main window and play again to have the audio re-rendered. Repeat for several times and compare the generated audio.
❎ The audio will be different although no changes are made to the score between the runs.
For example, I am using the QiXuan_v2.5.0 and the following piece:
This phrase is taken from a well-known song which is probably already included in the training data, so I have deliberately transposed it up by 2 keys.
Here are the waveforms of 5 samples obtained from score. They are slightly different.
Actually, they do have audible differences, especially the pronunciation of "上"(shang), "说"(shuo) and "想"(xiang). In one of the samples, the "上" is almost pronounced as "sheng". Having an unpredictable possibility for such mistake is unacceptable.
Acknowledgement
🐛 Describe the bug
Even for identical scores, DiffSinger singers will produce different audio on different runs. This could be considered a bug since it undermines the ability for a singing voice synthesis project to produce a predictable output, making the project "unmaintainable". For example, it will not be possible to correct a flaw in the score without affecting the rest if the cache is gone.
I understand that generative deep learning models are expected to have some randomness since they are actually "sampling" from some distribution. However, the reproducibility of the audio output is very important. As far as I know, two approaches can be used to fix this:
Explains how to reproduce the bug
Tools > Clear cache
in the main window and play again to have the audio re-rendered. Repeat for several times and compare the generated audio.❎ The audio will be different although no changes are made to the score between the runs.
For example, I am using the QiXuan_v2.5.0 and the following piece:
This phrase is taken from a well-known song which is probably already included in the training data, so I have deliberately transposed it up by 2 keys.
Here are the waveforms of 5 samples obtained from score. They are slightly different.
Actually, they do have audible differences, especially the pronunciation of "上"(shang), "说"(shuo) and "想"(xiang). In one of the samples, the "上" is almost pronounced as "sheng". Having an unpredictable possibility for such mistake is unacceptable.
OS & Version
Windows 11 版本 24H2 (OS 内部版本 26100.3323)
Logs
log20250309.txt
The text was updated successfully, but these errors were encountered: