mtmd : add post-decode callback by ggerganov · Pull Request #24645 · ggml-org/llama.cpp

ggerganov · 2026-06-15T09:35:06Z

Overview

This resolves the [TAG_MTMD_DRAFT_PROCESSING] TODO for synchronizing the target and draft contexts and avoid including llama-ext.h in server-context.cpp.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. pi:llama.cpp/Qwen3.6-27B

Assisted-by: pi:llama.cpp/Qwen3.6-27B

ngxson · 2026-06-15T10:14:37Z

Just to confirm, apart from avoid including llama-ext.h, this also make sure decoding the spec always calls common_speculative_process instead of calling llama_decode directly, right?

If so, I agree that the callback added here is acceptable

ggerganov · 2026-06-15T10:29:26Z

Just to confirm, apart from avoid including llama-ext.h, this also make sure decoding the spec always calls common_speculative_process instead of calling llama_decode directly, right?

Yes, I think that as long as we process the exact same batches both with the target and draft/spec contexts, they should remain synchronized.

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

ngxson · 2026-06-15T10:35:13Z

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

ok I will start working on this today (unless you want to take over it)

ggerganov · 2026-06-15T10:59:55Z

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

ok I will start working on this today (unless you want to take over it)

I plan to do a few refactors that have piled-up lately (Metal, memory, tests, ggml backend), but wasn't planning to start on llama_batch yet. So if you want to do it, go ahead.

My general idea for llama_batch is to make a new llama_batch_ext that is oblique (i.e. accessed only via dedicated API). And introduce a respective llama_process(llama_batch_ext * batch) instead of separate encode/decode calls. The new llama_batch_ext would have more information - for example, it should be able to provide both vision and target-model embeddings, so that the Qwen MTP can work correctly.

mtmd : add post-decode callback

7da52ff

Assisted-by: pi:llama.cpp/Qwen3.6-27B

ggerganov mentioned this pull request Jun 15, 2026

server : unify mtmd image processing with post-decode callback #24520

Closed

1 task

github-actions Bot added examples server labels Jun 15, 2026

ggerganov marked this pull request as ready for review June 15, 2026 10:29

ggerganov requested review from a team as code owners June 15, 2026 10:29

ngxson approved these changes Jun 15, 2026

View reviewed changes

ggerganov merged commit e3cab40 into master Jun 15, 2026
25 checks passed

ggerganov deleted the gg/server-spec-mtmd-cont branch June 15, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd : add post-decode callback#24645

mtmd : add post-decode callback#24645
ggerganov merged 1 commit into
masterfrom
gg/server-spec-mtmd-cont

ggerganov commented Jun 15, 2026

Uh oh!

ngxson commented Jun 15, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jun 15, 2026

Uh oh!

ngxson commented Jun 15, 2026

Uh oh!

ggerganov commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented Jun 15, 2026

Overview

Requirements

Uh oh!

ngxson commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 15, 2026

Uh oh!

ngxson commented Jun 15, 2026

Uh oh!

ggerganov commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jun 15, 2026 •

edited

Loading