In the official Alibaba Qwen3-TTS both ref_audio and ref_text can be passed to increase cloning quality.
After several comparisons using this implementation and vllm-omni's implementation (which is WAY slower by the way) it does seem like vllm-omni's outputs are better. The voice sounds about the same but misses the typical pacing and pronunciation of the source voice.
In the official Alibaba Qwen3-TTS both ref_audio and ref_text can be passed to increase cloning quality.
After several comparisons using this implementation and vllm-omni's implementation (which is WAY slower by the way) it does seem like vllm-omni's outputs are better. The voice sounds about the same but misses the typical pacing and pronunciation of the source voice.