Vocoder decode runs on CPU despite Vulkan backend: backend/buffer mismatch in load_tensor_data_from_file

<h2>Summary</h2>
<p>The <code>AudioTokenizerDecoder</code> (WavTokenizer vocoder) runs entirely on CPU despite
<code>AudioTokenizerDecoder backend: Vulkan0</code> being reported. Code generation is fully
GPU-accelerated (~1.5s), but vocoder decode remains at ~17s regardless of backend,
quantization, or GPU used.</p>
<p>The root cause has been identified and confirmed: <strong>a backend/buffer mismatch in
<code>gguf_loader.cpp</code> causes the scheduler to fall back to CPU for all vocoder ops.</strong></p>
<h2>Environment</h2>
<ul>
<li><strong>OS:</strong> NixOS 25.11 (Linux 6.19.8-xanmod1)</li>
<li><strong>GPUs:</strong> AMD Radeon RX 9070 XT + AMD Radeon AI PRO R9700 (both RDNA4 / gfx1201)</li>
<li><strong>Vulkan driver:</strong> RADV GFX1201 (Mesa), Vulkan 1.4.341</li>
<li><strong>ROCm version:</strong> 7.2.0 (also tested, same results — RTF 11x vs 3.6x with Vulkan)</li>
</ul>
<h2>Benchmark results</h2>
<p>Input: <code>"Hallo, ich bin ein Sprachmodell auf deiner Radeon R9700."</code> (~5s audio)</p>

Backend | Code generation | Vocoder decode | Total | RTF
-- | -- | -- | -- | --
ROCm (HIP), f16 | 42446 ms | 18070 ms | 60516 ms | 11.0x
Vulkan, f16 | 1743 ms | 17016 ms | 18760 ms | 3.6x
Vulkan, q4_k / q8_0 | 1513 ms | 18348 ms | 19861 ms | 3.5x


<p>Code generation is <strong>24x faster</strong> with Vulkan. Vocoder decode is unchanged across
all backends, GPUs, and quantization levels.</p>
<h2>Root cause (confirmed)</h2>
<h3>The bug: load_tensor_data_from_file creates a temporary backend</h3>
<p>In <code>src/gguf_loader.cpp</code>, <code>load_tensor_data_from_file</code> allocates tensor memory
using a <strong>temporary</strong> backend that it immediately frees:</p>
<pre><code class="language-cpp">// gguf_loader.cpp
ggml_backend_t backend = ggml_backend_init_by_type(preferred_backend_type, nullptr);
// ...
buffer = ggml_backend_alloc_ctx_tensors(model_ctx, backend);
// ... load data ...
ggml_backend_free(backend);  // ← temporary backend freed here
</code></pre>
<p>Then <code>AudioTokenizerDecoder::load_model()</code> calls <code>init_preferred_backend()</code>, which
returns a <strong>different backend object</strong> (the shared backend from <code>get_shared_backend_state()</code>).</p>
<p>The GGML backend scheduler checks: does the tensor's buffer belong to the current
compute backend? Since the buffer was allocated on the freed temporary backend and
the compute backend is a different object, the answer is <strong>no</strong> → CPU fallback for
all ops.</p>
<h3>Proof via GGML_SCHED_DEBUG=2</h3>
<pre><code>$ GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Test." -o /dev/null 2&gt;&amp;1 \
    | grep "## SPLIT" | tail -3

## SPLIT #0: Vulkan0 # 2 inputs: [inp_code] [inp_pos]   ← code predictor
## SPLIT #0: Vulkan0 # 2 inputs: [inp_step_embd] [inp_pos]
## SPLIT #0: CPU # 0 inputs                              ← ENTIRE vocoder on CPU
</code></pre>
<p>All transformer/code predictor graphs run on Vulkan0. The vocoder graph — the
very last split — runs on CPU.</p>
<h3>Why the transformer works but the vocoder doesn't</h3>
<p>The transformer uses a different loading path in <code>tts_transformer.cpp</code>:</p>
<pre><code class="language-cpp">// tts_transformer.cpp — CORRECT
ggml_backend_t backend = init_preferred_backend("TTSTransformer", &amp;error_msg_);
model_.buffer = ggml_backend_alloc_ctx_tensors(model_.ctx, backend);
// buffer and compute backend are the SAME object ✓
</code></pre>
<p>The vocoder uses <code>load_tensor_data_from_file</code> which creates its own temporary
backend — <strong>not</strong> the shared backend returned by <code>init_preferred_backend</code>.</p>
<h2>Suggested fix</h2>
<p>Change <code>load_tensor_data_from_file</code> to use <code>init_preferred_backend</code> /
<code>release_preferred_backend</code> instead of <code>ggml_backend_init_by_type</code> /
<code>ggml_backend_free</code>, so all components share the same backend object.</p>
<p>Additionally, <code>AudioTokenizerDecoder::load_model()</code> must save the normalized
codebook data before <code>ggml_backend_alloc_ctx_tensors</code> (which moves tensors to
GPU memory), and re-upload it afterwards — because <code>t-&gt;data</code> becomes a GPU
pointer after allocation and can no longer be used as a host source.</p>
<h2>Expected impact</h2>
<p>If the vocoder weights land on the same GPU buffer as the compute backend,
the scheduler should dispatch all vocoder ops to Vulkan. Based on the CUDA
benchmark in issue #10 (vocoder decode: 899ms on CUDA), the expected improvement
is from ~17s to ~1-2s vocoder decode, bringing total RTF below 1.0x (real-time).</p>
<h2>Reproduction</h2>
<pre><code class="language-bash"># Build with Vulkan
cmake -S ggml -B ggml/build -G Ninja -DGGML_VULKAN=ON \
  -DVulkan_GLSLC_EXECUTABLE="$(which glslc)" -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build --target ggml-vulkan -j$(nproc)
cmake --build ggml/build -j$(nproc)
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Confirm: last SPLIT is CPU (vocoder)
GGML_SCHED_DEBUG=2 ./build/qwen3-tts-cli -m models -t "Hello." -o /dev/null 2&gt;&amp;1 \
  | grep "## SPLIT" | tail -3
</code></pre></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocoder decode runs on CPU despite Vulkan backend: backend/buffer mismatch in load_tensor_data_from_file #20

Summary

Environment

Benchmark results

Root cause (confirmed)

The bug: load_tensor_data_from_file creates a temporary backend

Proof via GGML_SCHED_DEBUG=2

SPLIT #0: Vulkan0 # 2 inputs: [inp_code] [inp_pos] ← code predictor

SPLIT #0: Vulkan0 # 2 inputs: [inp_step_embd] [inp_pos]

SPLIT #0: CPU # 0 inputs ← ENTIRE vocoder on CPU

Why the transformer works but the vocoder doesn't

Suggested fix

Expected impact

Reproduction

Confirm: last SPLIT is CPU (vocoder)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Backend	Code generation	Vocoder decode	Total	RTF
ROCm (HIP), f16	42446 ms	18070 ms	60516 ms	11.0x
Vulkan, f16	1743 ms	17016 ms	18760 ms	3.6x
Vulkan, q4_k / q8_0	1513 ms	18348 ms	19861 ms	3.5x

Vocoder decode runs on CPU despite Vulkan backend: backend/buffer mismatch in load_tensor_data_from_file #20

Description

Summary

Environment

Benchmark results

Root cause (confirmed)

The bug: load_tensor_data_from_file creates a temporary backend

Proof via GGML_SCHED_DEBUG=2

SPLIT #0: Vulkan0 # 2 inputs: [inp_code] [inp_pos] ← code predictor

SPLIT #0: Vulkan0 # 2 inputs: [inp_step_embd] [inp_pos]

SPLIT #0: CPU # 0 inputs ← ENTIRE vocoder on CPU

Why the transformer works but the vocoder doesn't

Suggested fix

Expected impact

Reproduction

Confirm: last SPLIT is CPU (vocoder)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions