Skip to content

CUDA crash during mmproj/clip image encode on discrete NVIDIA GPU #230

@TheFatCatCompany

Description

@TheFatCatCompany

Summary

Multimodal image describe (vision GGUF + mmproj projector) crashes the process
with Lost connection to device during the first native image encode
(mtmd_helper_eval_chunks) on a discrete NVIDIA GPU. The same code path works on
version 0.6.12 and regressed somewhere in 0.8.0 → 0.8.3 (I think? I need to test more since I moved to 0.8.x to try out MTP) (native pin
b9587b9694).

It is worth it to mention that MTP and llamadart chat sessions seem to work wonderfully when I am just using text.

The crash is not caused by:

  • speculative decoding / MTP (disabled, still crashes),
  • enableThinking (forced false, still crashes),
  • model flash-attention (FlashAttention.disabled on model load, still crashes),
  • GPU selection (pinned mainGpu to the 4080, still crashes),
  • running the projector on CPU vs GPU (CPU use_gpu=false also crashes).

So at first I thought it could be something related to FlashAttention or with CUDA, but running on CPU crashed too.

Environment

llamadart 0.8.1 / 0.8.2 / 0.8.3 (all reproduce)
Known-good 0.6.12
Native pin leehack/llamadart-native@b9694 (0.8.2+), b9587 (0.8.0)
OS Windows 11 10.0.26200 (x64)
GPU NVIDIA GeForce RTX 4080 Laptop (12 GB), CUDA backend
Also present (but unused) Intel Iris Xe
Backends bundled [cuda, vulkan] (Windows), CPU
Model Unsloth Gemma-4-E4B-It 4-bit quant, after trying 26B MoE since I originally thought it could be OOM
Projector gemma4v vision + gemma4a audio mmproj

Reproduction

  1. Load the Gemma-4 vision GGUF with GpuBackend.auto/cuda, all layers offloaded to GPU.
  2. loadMultimodalProjector(mmproj.gguf) — succeeds; logs clip_ctx: CLIP using CUDA0 backend.
  3. engine.create([LlamaImageContent(path), LlamaTextContent(prompt)], enableThinking: false).
  4. Stream the result. Process dies with Lost connection to device ~5s in,
    before the first token, during the native image encode.

What I tried

Attempt Result
Disable MTP no change
Force enableThinking: false for vision no change
FlashAttention.disabled on model load no change
GpuBackend.vulkan for the model model on Vulkan → crash
Pin mainGpu to discrete RTX 4080 (splitMode.none) correct GPU, still crash
Force whole vision engine to CPU (gpuLayers:0mtmd use_gpu=false) crash again

Asks for upstream

Honestly I hope that I missed something myself, so if anybody's got the time I'd like to see if anybody can replicate what I've seen, and if you can get it working (or not) in a similar environment to mine.

Local workaround

I just pinned my pubspec to 0.6.12 for now, and commented out the MTP paths so I can hopefully re-enable them later once I can get images working. Not all that fancy, but it is what it is, I suppose.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions