Long silent output (5+ minutes) and slow inference when setting temperature=0 for mixed Chinese-English text

I am trying to achieve deterministic and consistent audio generation (fixed speaker, prosody, and emotion) for mixed Chinese-English text. To ensure the output is identical across multiple runs, I set the temperature parameter to 0 (or used greedy search).

However, this leads to the following issues:

    Infinite/Long Silence: The model generates a very long silent audio file (approx. 5 minutes) instead of the expected speech.

    Slow Inference: The inference process takes an unusually long time before completing.

    Reproducibility Goal: My goal is to get the exact same audio output for the same input text and speaker reference.

Steps to Reproduce:

    Use Qwen3-TTS-12Hz-0.6B-Base (or CustomVoice version).

    Input a mixed-language text (e.g., "李长乐在今天的音乐meeting上做了一个简短的PPT分享，然后发了一封 email，提醒大家project 的 deadline快到了。2025年10月30日，电话号码是19883593891").

    Set temperature=0 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long silent output (5+ minutes) and slow inference when setting temperature=0 for mixed Chinese-English text #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Long silent output (5+ minutes) and slow inference when setting temperature=0 for mixed Chinese-English text #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions