Skip to content

Vocoder decode runs on CPU despite Vulkan backend: backend/buffer mismatch in load_tensor_data_from_file #20

@staubsauger

Description

@staubsauger

Summary

The AudioTokenizerDecoder (WavTokenizer vocoder) runs entirely on CPU despite AudioTokenizerDecoder backend: Vulkan0 being reported. Code generation is fully GPU-accelerated (~1.5s), but vocoder decode remains at ~17s regardless of backend, quantization, or GPU used.

The root cause has been identified and confirmed: a backend/buffer mismatch in gguf_loader.cpp causes the scheduler to fall back to CPU for all vocoder ops.

Environment

  • OS: NixOS 25.11 (Linux 6.19.8-xanmod1)
  • GPUs: AMD Radeon RX 9070 XT + AMD Radeon AI PRO R9700 (both RDNA4 / gfx1201)
  • Vulkan driver: RADV GFX1201 (Mesa), Vulkan 1.4.341
  • ROCm version: 7.2.0 (also tested, same results — RTF 11x vs 3.6x with Vulkan)

Benchmark results

Input: "Hallo, ich bin ein Sprachmodell auf deiner Radeon R9700." (~5s audio)

Backend Code generation Vocoder decode Total RTF
ROCm (HIP), f16 42446 ms 18070 ms 60516 ms 11.0x
Vulkan, f16 1743 ms 17016 ms 18760 ms 3.6x
Vulkan, q4_k / q8_0 1513 ms 18348 ms 19861 ms 3.5x

Code generation is 24x faster with Vulkan. Vocoder decode is unchanged across all backends, GPUs, and quantization levels.

Root cause (confirmed)

The bug: load_tensor_data_from_file creates a temporary backend

In src/gguf_loader.cpp, load_tensor_data_from_file allocates tensor memory using a temporary backend that it immediately frees:

// gguf_loader.cpp
ggml_backend_t backend = ggml_backend_init_by_type(preferred_backend_type, nullptr);
// ...
buffer = ggml_backend_alloc_ctx_tensors(model_ctx, backend);
// ... load data ...
ggml_backend_free(backend);  // ← temporary backend freed here

Then AudioTokenizerDecoder::load_model() calls init_preferred_backend(), which returns a different backend object (the shared backend from get_shared_backend_state()).

The GGML backend scheduler checks: does the tensor's buffer belong to the current compute backend? Since the buffer was allocated on the freed temporary backend and the compute backend is a different object, the answer is no → CPU fallback for all ops.

Proof via GGML_SCHED_DEBUG=2

$ GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Test." -o /dev/null 2>&1 \
    | grep "## SPLIT" | tail -3

SPLIT #0: Vulkan0 # 2 inputs: [inp_code] [inp_pos] ← code predictor

SPLIT #0: Vulkan0 # 2 inputs: [inp_step_embd] [inp_pos]

SPLIT #0: CPU # 0 inputs ← ENTIRE vocoder on CPU

All transformer/code predictor graphs run on Vulkan0. The vocoder graph — the very last split — runs on CPU.

Why the transformer works but the vocoder doesn't

The transformer uses a different loading path in tts_transformer.cpp:

// tts_transformer.cpp — CORRECT
ggml_backend_t backend = init_preferred_backend("TTSTransformer", &error_msg_);
model_.buffer = ggml_backend_alloc_ctx_tensors(model_.ctx, backend);
// buffer and compute backend are the SAME object ✓

The vocoder uses load_tensor_data_from_file which creates its own temporary backend — not the shared backend returned by init_preferred_backend.

Suggested fix

Change load_tensor_data_from_file to use init_preferred_backend / release_preferred_backend instead of ggml_backend_init_by_type / ggml_backend_free, so all components share the same backend object.

Additionally, AudioTokenizerDecoder::load_model() must save the normalized codebook data before ggml_backend_alloc_ctx_tensors (which moves tensors to GPU memory), and re-upload it afterwards — because t->data becomes a GPU pointer after allocation and can no longer be used as a host source.

Expected impact

If the vocoder weights land on the same GPU buffer as the compute backend, the scheduler should dispatch all vocoder ops to Vulkan. Based on the CUDA benchmark in issue #10 (vocoder decode: 899ms on CUDA), the expected improvement is from ~17s to ~1-2s vocoder decode, bringing total RTF below 1.0x (real-time).

Reproduction

# Build with Vulkan
cmake -S ggml -B ggml/build -G Ninja -DGGML_VULKAN=ON \
  -DVulkan_GLSLC_EXECUTABLE="$(which glslc)" -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build --target ggml-vulkan -j$(nproc)
cmake --build ggml/build -j$(nproc)
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Confirm: last SPLIT is CPU (vocoder)

GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Hello." -o /dev/null 2>&1
| grep "## SPLIT" | tail -3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions