Summary
The AudioTokenizerDecoder (WavTokenizer vocoder) runs entirely on CPU despite
AudioTokenizerDecoder backend: Vulkan0 being reported. Code generation is fully
GPU-accelerated (~1.5s), but vocoder decode remains at ~17s regardless of backend,
quantization, or GPU used.
The root cause has been identified and confirmed: a backend/buffer mismatch in
gguf_loader.cpp causes the scheduler to fall back to CPU for all vocoder ops.
Environment
- OS: NixOS 25.11 (Linux 6.19.8-xanmod1)
- GPUs: AMD Radeon RX 9070 XT + AMD Radeon AI PRO R9700 (both RDNA4 / gfx1201)
- Vulkan driver: RADV GFX1201 (Mesa), Vulkan 1.4.341
- ROCm version: 7.2.0 (also tested, same results — RTF 11x vs 3.6x with Vulkan)
Benchmark results
Input: "Hallo, ich bin ein Sprachmodell auf deiner Radeon R9700." (~5s audio)
| Backend |
Code generation |
Vocoder decode |
Total |
RTF |
| ROCm (HIP), f16 |
42446 ms |
18070 ms |
60516 ms |
11.0x |
| Vulkan, f16 |
1743 ms |
17016 ms |
18760 ms |
3.6x |
| Vulkan, q4_k / q8_0 |
1513 ms |
18348 ms |
19861 ms |
3.5x |
Code generation is 24x faster with Vulkan. Vocoder decode is unchanged across
all backends, GPUs, and quantization levels.
Root cause (confirmed)
The bug: load_tensor_data_from_file creates a temporary backend
In src/gguf_loader.cpp, load_tensor_data_from_file allocates tensor memory
using a temporary backend that it immediately frees:
// gguf_loader.cpp
ggml_backend_t backend = ggml_backend_init_by_type(preferred_backend_type, nullptr);
// ...
buffer = ggml_backend_alloc_ctx_tensors(model_ctx, backend);
// ... load data ...
ggml_backend_free(backend); // ← temporary backend freed here
Then AudioTokenizerDecoder::load_model() calls init_preferred_backend(), which
returns a different backend object (the shared backend from get_shared_backend_state()).
The GGML backend scheduler checks: does the tensor's buffer belong to the current
compute backend? Since the buffer was allocated on the freed temporary backend and
the compute backend is a different object, the answer is no → CPU fallback for
all ops.
Proof via GGML_SCHED_DEBUG=2
$ GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Test." -o /dev/null 2>&1 \
| grep "## SPLIT" | tail -3
SPLIT #0: Vulkan0 # 2 inputs: [inp_code] [inp_pos] ← code predictor
SPLIT #0: Vulkan0 # 2 inputs: [inp_step_embd] [inp_pos]
SPLIT #0: CPU # 0 inputs ← ENTIRE vocoder on CPU
All transformer/code predictor graphs run on Vulkan0. The vocoder graph — the
very last split — runs on CPU.
Why the transformer works but the vocoder doesn't
The transformer uses a different loading path in tts_transformer.cpp:
// tts_transformer.cpp — CORRECT
ggml_backend_t backend = init_preferred_backend("TTSTransformer", &error_msg_);
model_.buffer = ggml_backend_alloc_ctx_tensors(model_.ctx, backend);
// buffer and compute backend are the SAME object ✓
The vocoder uses load_tensor_data_from_file which creates its own temporary
backend — not the shared backend returned by init_preferred_backend.
Suggested fix
Change load_tensor_data_from_file to use init_preferred_backend /
release_preferred_backend instead of ggml_backend_init_by_type /
ggml_backend_free, so all components share the same backend object.
Additionally, AudioTokenizerDecoder::load_model() must save the normalized
codebook data before ggml_backend_alloc_ctx_tensors (which moves tensors to
GPU memory), and re-upload it afterwards — because t->data becomes a GPU
pointer after allocation and can no longer be used as a host source.
Expected impact
If the vocoder weights land on the same GPU buffer as the compute backend,
the scheduler should dispatch all vocoder ops to Vulkan. Based on the CUDA
benchmark in issue #10 (vocoder decode: 899ms on CUDA), the expected improvement
is from ~17s to ~1-2s vocoder decode, bringing total RTF below 1.0x (real-time).
Reproduction
# Build with Vulkan
cmake -S ggml -B ggml/build -G Ninja -DGGML_VULKAN=ON \
-DVulkan_GLSLC_EXECUTABLE="$(which glslc)" -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build --target ggml-vulkan -j$(nproc)
cmake --build ggml/build -j$(nproc)
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
Confirm: last SPLIT is CPU (vocoder)
GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Hello." -o /dev/null 2>&1
| grep "## SPLIT" | tail -3
Summary
The
AudioTokenizerDecoder(WavTokenizer vocoder) runs entirely on CPU despiteAudioTokenizerDecoder backend: Vulkan0being reported. Code generation is fully GPU-accelerated (~1.5s), but vocoder decode remains at ~17s regardless of backend, quantization, or GPU used.The root cause has been identified and confirmed: a backend/buffer mismatch in
gguf_loader.cppcauses the scheduler to fall back to CPU for all vocoder ops.Environment
Benchmark results
Input:
"Hallo, ich bin ein Sprachmodell auf deiner Radeon R9700."(~5s audio)Code generation is 24x faster with Vulkan. Vocoder decode is unchanged across all backends, GPUs, and quantization levels.
Root cause (confirmed)
The bug: load_tensor_data_from_file creates a temporary backend
In
src/gguf_loader.cpp,load_tensor_data_from_fileallocates tensor memory using a temporary backend that it immediately frees:Then
AudioTokenizerDecoder::load_model()callsinit_preferred_backend(), which returns a different backend object (the shared backend fromget_shared_backend_state()).The GGML backend scheduler checks: does the tensor's buffer belong to the current compute backend? Since the buffer was allocated on the freed temporary backend and the compute backend is a different object, the answer is no → CPU fallback for all ops.
Proof via GGML_SCHED_DEBUG=2
All transformer/code predictor graphs run on Vulkan0. The vocoder graph — the very last split — runs on CPU.
Why the transformer works but the vocoder doesn't
The transformer uses a different loading path in
tts_transformer.cpp:The vocoder uses
load_tensor_data_from_filewhich creates its own temporary backend — not the shared backend returned byinit_preferred_backend.Suggested fix
Change
load_tensor_data_from_fileto useinit_preferred_backend/release_preferred_backendinstead ofggml_backend_init_by_type/ggml_backend_free, so all components share the same backend object.Additionally,
AudioTokenizerDecoder::load_model()must save the normalized codebook data beforeggml_backend_alloc_ctx_tensors(which moves tensors to GPU memory), and re-upload it afterwards — becauset->databecomes a GPU pointer after allocation and can no longer be used as a host source.Expected impact
If the vocoder weights land on the same GPU buffer as the compute backend, the scheduler should dispatch all vocoder ops to Vulkan. Based on the CUDA benchmark in issue #10 (vocoder decode: 899ms on CUDA), the expected improvement is from ~17s to ~1-2s vocoder decode, bringing total RTF below 1.0x (real-time).
Reproduction