Skip to content

Long silent output (5+ minutes) and slow inference when setting temperature=0 for mixed Chinese-English text #21

@wangxuefei-25

Description

@wangxuefei-25

I am trying to achieve deterministic and consistent audio generation (fixed speaker, prosody, and emotion) for mixed Chinese-English text. To ensure the output is identical across multiple runs, I set the temperature parameter to 0 (or used greedy search).

However, this leads to the following issues:

Infinite/Long Silence: The model generates a very long silent audio file (approx. 5 minutes) instead of the expected speech.

Slow Inference: The inference process takes an unusually long time before completing.

Reproducibility Goal: My goal is to get the exact same audio output for the same input text and speaker reference.

Steps to Reproduce:

Use Qwen3-TTS-12Hz-0.6B-Base (or CustomVoice version).

Input a mixed-language text (e.g., "李长乐在今天的音乐meeting上做了一个简短的PPT分享,然后发了一封 email,提醒大家project 的 deadline快到了。2025年10月30日,电话号码是19883593891").

Set temperature=0 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions