Skip to content

mtmd : add post-decode callback#24645

Merged
ggerganov merged 1 commit into
masterfrom
gg/server-spec-mtmd-cont
Jun 15, 2026
Merged

mtmd : add post-decode callback#24645
ggerganov merged 1 commit into
masterfrom
gg/server-spec-mtmd-cont

Conversation

@ggerganov

Copy link
Copy Markdown
Member

Overview

alt #24520

This resolves the [TAG_MTMD_DRAFT_PROCESSING] TODO for synchronizing the target and draft contexts and avoid including llama-ext.h in server-context.cpp.

Requirements

Assisted-by: pi:llama.cpp/Qwen3.6-27B
@ngxson

ngxson commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Just to confirm, apart from avoid including llama-ext.h, this also make sure decoding the spec always calls common_speculative_process instead of calling llama_decode directly, right?

If so, I agree that the callback added here is acceptable

@ggerganov

Copy link
Copy Markdown
Member Author

Just to confirm, apart from avoid including llama-ext.h, this also make sure decoding the spec always calls common_speculative_process instead of calling llama_decode directly, right?

Yes, I think that as long as we process the exact same batches both with the target and draft/spec contexts, they should remain synchronized.

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

@ggerganov ggerganov marked this pull request as ready for review June 15, 2026 10:29
@ggerganov ggerganov requested review from a team as code owners June 15, 2026 10:29
@ngxson

ngxson commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

ok I will start working on this today (unless you want to take over it)

@ggerganov

Copy link
Copy Markdown
Member Author

There is still some incorrectness when doing multi-modal processing with MTP, but to fix that we have to rework the llama_batch.

ok I will start working on this today (unless you want to take over it)

I plan to do a few refactors that have piled-up lately (Metal, memory, tests, ggml backend), but wasn't planning to start on llama_batch yet. So if you want to do it, go ahead.

My general idea for llama_batch is to make a new llama_batch_ext that is oblique (i.e. accessed only via dedicated API). And introduce a respective llama_process(llama_batch_ext * batch) instead of separate encode/decode calls. The new llama_batch_ext would have more information - for example, it should be able to provide both vision and target-model embeddings, so that the Qwen MTP can work correctly.

@ggerganov ggerganov merged commit e3cab40 into master Jun 15, 2026
25 checks passed
@ggerganov ggerganov deleted the gg/server-spec-mtmd-cont branch June 15, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants