I am trying to achieve deterministic and consistent audio generation (fixed speaker, prosody, and emotion) for mixed Chinese-English text. To ensure the output is identical across multiple runs, I set the temperature parameter to 0 (or used greedy search).
However, this leads to the following issues:
Infinite/Long Silence: The model generates a very long silent audio file (approx. 5 minutes) instead of the expected speech.
Slow Inference: The inference process takes an unusually long time before completing.
Reproducibility Goal: My goal is to get the exact same audio output for the same input text and speaker reference.
Steps to Reproduce:
Use Qwen3-TTS-12Hz-0.6B-Base (or CustomVoice version).
Input a mixed-language text (e.g., "李长乐在今天的音乐meeting上做了一个简短的PPT分享,然后发了一封 email,提醒大家project 的 deadline快到了。2025年10月30日,电话号码是19883593891").
Set temperature=0
I am trying to achieve deterministic and consistent audio generation (fixed speaker, prosody, and emotion) for mixed Chinese-English text. To ensure the output is identical across multiple runs, I set the temperature parameter to 0 (or used greedy search).
However, this leads to the following issues:
Steps to Reproduce: