Summary
Multimodal image describe (vision GGUF + mmproj projector) crashes the process
with Lost connection to device during the first native image encode
(mtmd_helper_eval_chunks) on a discrete NVIDIA GPU. The same code path works on
version 0.6.12 and regressed somewhere in 0.8.0 → 0.8.3 (I think? I need to test more since I moved to 0.8.x to try out MTP) (native pin
b9587 → b9694).
It is worth it to mention that MTP and llamadart chat sessions seem to work wonderfully when I am just using text.
The crash is not caused by:
- speculative decoding / MTP (disabled, still crashes),
enableThinking (forced false, still crashes),
- model flash-attention (
FlashAttention.disabled on model load, still crashes),
- GPU selection (pinned
mainGpu to the 4080, still crashes),
- running the projector on CPU vs GPU (CPU
use_gpu=false also crashes).
So at first I thought it could be something related to FlashAttention or with CUDA, but running on CPU crashed too.
Environment
|
|
| llamadart |
0.8.1 / 0.8.2 / 0.8.3 (all reproduce) |
| Known-good |
0.6.12 |
| Native pin |
leehack/llamadart-native@b9694 (0.8.2+), b9587 (0.8.0) |
| OS |
Windows 11 10.0.26200 (x64) |
| GPU |
NVIDIA GeForce RTX 4080 Laptop (12 GB), CUDA backend |
| Also present (but unused) |
Intel Iris Xe |
| Backends bundled |
[cuda, vulkan] (Windows), CPU |
| Model |
Unsloth Gemma-4-E4B-It 4-bit quant, after trying 26B MoE since I originally thought it could be OOM |
| Projector |
gemma4v vision + gemma4a audio mmproj |
Reproduction
- Load the Gemma-4 vision GGUF with
GpuBackend.auto/cuda, all layers offloaded to GPU.
loadMultimodalProjector(mmproj.gguf) — succeeds; logs clip_ctx: CLIP using CUDA0 backend.
engine.create([LlamaImageContent(path), LlamaTextContent(prompt)], enableThinking: false).
- Stream the result. Process dies with
Lost connection to device ~5s in,
before the first token, during the native image encode.
What I tried
| Attempt |
Result |
| Disable MTP |
no change |
Force enableThinking: false for vision |
no change |
FlashAttention.disabled on model load |
no change |
GpuBackend.vulkan for the model |
model on Vulkan → crash |
Pin mainGpu to discrete RTX 4080 (splitMode.none) |
correct GPU, still crash |
Force whole vision engine to CPU (gpuLayers:0 → mtmd use_gpu=false) |
crash again |
Asks for upstream
Honestly I hope that I missed something myself, so if anybody's got the time I'd like to see if anybody can replicate what I've seen, and if you can get it working (or not) in a similar environment to mine.
Local workaround
I just pinned my pubspec to 0.6.12 for now, and commented out the MTP paths so I can hopefully re-enable them later once I can get images working. Not all that fancy, but it is what it is, I suppose.
Summary
Multimodal image describe (vision GGUF +
mmprojprojector) crashes the processwith
Lost connection to deviceduring the first native image encode(
mtmd_helper_eval_chunks) on a discrete NVIDIA GPU. The same code path works onversion 0.6.12 and regressed somewhere in 0.8.0 → 0.8.3 (I think? I need to test more since I moved to 0.8.x to try out MTP) (native pin
b9587→b9694).It is worth it to mention that MTP and llamadart chat sessions seem to work wonderfully when I am just using text.
The crash is not caused by:
enableThinking(forced false, still crashes),FlashAttention.disabledon model load, still crashes),mainGputo the 4080, still crashes),use_gpu=falsealso crashes).So at first I thought it could be something related to FlashAttention or with CUDA, but running on CPU crashed too.
Environment
leehack/llamadart-native@b9694(0.8.2+),b9587(0.8.0)[cuda, vulkan](Windows), CPUReproduction
GpuBackend.auto/cuda, all layers offloaded to GPU.loadMultimodalProjector(mmproj.gguf)— succeeds; logsclip_ctx: CLIP using CUDA0 backend.engine.create([LlamaImageContent(path), LlamaTextContent(prompt)], enableThinking: false).Lost connection to device~5s in,before the first token, during the native image encode.
What I tried
enableThinking: falsefor visionFlashAttention.disabledon model loadGpuBackend.vulkanfor the modelmainGputo discrete RTX 4080 (splitMode.none)gpuLayers:0→mtmd use_gpu=false)Asks for upstream
Honestly I hope that I missed something myself, so if anybody's got the time I'd like to see if anybody can replicate what I've seen, and if you can get it working (or not) in a similar environment to mine.
Local workaround
I just pinned my pubspec to 0.6.12 for now, and commented out the MTP paths so I can hopefully re-enable them later once I can get images working. Not all that fancy, but it is what it is, I suppose.