UPSTREAM PR #16933: server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility by DajanaV · Pull Request #45 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-02T14:04:26Z

Description

MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This PR adds a minimal reasoning format override that injects a synthetic opening <think> tag while keeping all reasoning content inline, ensuring compatibility with existing clients without modifying the current parsing behavior.
This approach is equivalent to reasoning_format=none but with synthetic prefix injection. When set via --reasoning-format minimax-m2 at server startup, it overrides client API requests that specify reasoning_format=auto, allowing the model to receive the full reasoning block it needs while remaining compatible with all OpenAI-compatible clients.

Changes

Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format
Implement minimax-m2 format parsing that bypasses reasoning extraction
Inject synthetic <think>\n chunk before first generated token when minimax-m2 is active
Track injection state with minimax_reasoning_prefix_injected and minimax_reasoning_prefix_streamed slot flags
Prepend <think>\n to generated_text for final response and chat parsing
Prevent client reasoning_format=auto from overriding server CLI setting
Add minimax-m2 to CLI help, README.md, and code documentation
Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording
Update process_token to preserve delta_to_send for streaming correctness
Defer synthetic prefix injection until first generated token for better UX

Testing

Tested with MiniMax-M2-230B model using --reasoning-format minimax-m2 flag on stock Svelte UI.

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

…GGML_KQ_MASK_PAD) (#16316)

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

…6354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

* feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

* fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <slarengh@gmail.com>

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

* tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

* webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build This also reverts commit 742dbb837939c176a813868c268d28ebd3fafb7c.

* feat: Add setting to display message generation statistics * chore: build static webui output

* mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…iframe (#16757) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…a (#16784) * webui: auto-refresh /props on inference start to resync model metadata - Add no-cache headers to /props and /slots - Throttle slot checks to 30s - Prevent concurrent fetches with promise guard - Trigger refresh from chat streaming for legacy and ModelSelector - Show dynamic serverWarning when using cached data * fix: restore proper legacy behavior in webui by using unified /props refresh Updated assistant message bubbles to show each message's stored model when available, falling back to the current server model only when the per-message value is missing When the model selector is disabled, now fetches /props and prioritizes that model name over chunk metadata, then persists it with the streamed message so legacy mode properly reflects the backend configuration * fix: detect first valid SSE chunk and refresh server props once * fix: removed the slots availability throttle constant and state * webui: purge ai-generated cruft * chore: update webui static build

…tibility MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This mode injects a synthetic opening <think> tag in the stream while keeping all reasoning tags inline in message.content, ensuring the model receives the full reasoning block it needs. Changes: - Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format - Implement minimax-m2 format parsing that bypasses reasoning extraction - Inject synthetic <think>\n chunk at slot start when minimax-m2 is active - Track injection state with minimax_reasoning_prefix_injected slot flag - Prepend <think>\n to generated_text for final response and chat parsing - Prevent client reasoning_format=auto from overriding server CLI setting - Add minimax-m2 to CLI help, README.md, and code documentation - Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording - Update process_token to preserve delta_to_send for streaming correctness

loci-review · 2025-11-02T15:18:11Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #45 MiniMax-M2 Reasoning Format

Critical Function Analysis

Core Inference Functions

All critical inference functions show no measurable performance impact:

llama_decode: 49,003,916 ns (baseline: 49,003,848 ns) - No change
llama_encode: 12,329,226 ns (baseline: 12,329,209 ns) - No change
llama_tokenize: 834,827 ns (baseline: 834,829 ns) - No change
llama_model_quantize: 6,891,692 ns (baseline: 6,891,727 ns) - No change
llama_batch_init: 257 ns (baseline: 257 ns) - No change
llama_memory_clear: 49 ns (baseline: 49 ns) - No change

Key Finding: None of the core inference functions were modified or show performance degradation.

Performance Impact Analysis by KPI

1. Tokens Per Second

Impact: No measurable impact on inference throughput

Core inference functions (llama_decode, llama_encode, llama_tokenize) show no performance changes
Changes are isolated to server-side chat parsing and streaming logic
The MiniMax-M2 feature operates at the application layer without affecting token processing pipelines

2. Power Consumption

Impact: Negligible change across all binaries

build.bin.libllama.so: -0.0% change (306,978.11 nJ vs 306,978.33 nJ baseline)
build.bin.libggml-base.so: 0.0% change
build.bin.libggml-cpu.so: 0.0% change
build.bin.libggml.so: 0.0% change

Analysis: Power consumption remains stable as core computational functions are unaffected.

3. Quantization Efficiency

Impact: No impact

llama_model_quantize function shows no performance changes
Quantization logic remains unchanged
MiniMax-M2 feature operates independently of model quantization

4. Memory Usage

Impact: Minimal increase in server memory usage

Affected Areas: Chat parsing and streaming buffers
New Allocations: Synthetic <think> tag injection requires temporary string allocations
Per-Slot Overhead: Two additional boolean flags per server slot
Core Memory Systems: KV cache and model memory management unaffected

5. Batch Processing

Impact: No impact

llama_batch_init and related batch functions show no changes
Batch processing logic operates independently of chat formatting
Token batching efficiency preserved

Root Cause Analysis

The observed performance degradations (PLT stub functions showing +0.043% to +0.113% increases) stem from:

Dynamic Linking Overhead

Additional code paths in server module increase binary size
Modified symbol table layout affects PLT resolution efficiency
String manipulation functions add to the dynamic symbol table

String Processing Overhead

text_to_parse.insert(0, "<think>\n") operations in chat parsing
Additional conditional branches in parsing hot paths
Memory allocation patterns for synthetic content injection

Action Items

Code Optimization

String Operations: Replace insert(0, ...) with more efficient string building patterns
Conditional Logic: Consolidate reasoning format checks into switch statements
Memory Management: Use string views or pre-allocated buffers for synthetic content

Build Optimization

Symbol Ordering: Optimize linker script to improve symbol locality
Static Linking: Consider static linking for frequently used string functions
Profile-Guided Optimization: Enable PGO to optimize new code paths

Architecture Improvements

Strategy Pattern: Implement reasoning format handlers as separate classes
Lazy Evaluation: Defer synthetic prefix operations until required
Buffer Pooling: Use pre-allocated buffers for chat parsing operations

Conclusion

The MiniMax-M2 feature implementation introduces minimal performance overhead isolated to server-side chat processing. Core inference performance remains unaffected, ensuring no impact on tokens per second throughput. The observed degradations in PLT functions reflect dynamic linking overhead from additional code paths rather than algorithmic inefficiencies in critical inference functions.

Temp fix for multithreading bug

allozaur and others added 30 commits October 3, 2025 10:11

webui : Fix messages payload sent to chat completions (#16402)

136bda7

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

vulkan: in flash attention, bounds check against nem1 (don't rely on …

e308efd

…GGML_KQ_MASK_PAD) (#16316)

Capture model name only after first token (streaming) or completed re…

7723327

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

0e1f838

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

Fix missing messages on sibling navigation (#16408)

84c8e30

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

ggml : fix graph reallocation with multiple chunks (#16396)

638d330

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

llama : fix shapes for bert/mpt q/k norm (#16409)

946f71e

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

chat : support Magistral thinking (#16413)

128d522

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

rpc : check src buffer when copying tensor (#16421)

f392839

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

vulkan: use a more appropriate amount of threads when generating shad…

86df2c9

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

ggml webgpu: actually add softmax, fix rms_norm offset (#16400)

3526657

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

server: update readme to mention n_past_max metric (#16436)

c5fef0f

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

nix : removed metal for nix (#16118)

1d49ca3

ggml : fix unaligned access in AMX code (#16315)

a23b9bd

ci : refactor sdk caching to minimize storage (#16414)

3a002af

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

llama : add --no-host to disable host buffers (#16310)

3df2244

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <slarengh@gmail.com>

metal : various optimizations + refactoring (#16446)

8ae32dc

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

metal : add support for non-padded FA KV (#16148)

0a319bb

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

memory : use sequential equal splits for recurrent modules (#16442)

0123ff3

CISC and others added 13 commits November 1, 2025 11:01

common : allow --system-prompt-file for diffusion-cli (#16903)

961660b

Add a setting to display message generation statistics (#16901)

d8b860a

* feat: Add setting to display message generation statistics * chore: build static webui output

vendor : update cpp-httplib to 0.27.0 (#16846)

dd5e8ca

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

scripts : add script to bench models (#16894)

7fd205a

ggml: add s390x cpu-feats (#16774)

d38d9f0

devops: fix failing s390x docker build (#16918)

a864132

CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917)

7db35a7

server: defer minimax-m2 synthetic <think> until first generated token

39351b1

DajanaV temporarily deployed to PROD__AL_DEMO November 2, 2025 14:04 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 12 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

loci-dev pushed a commit that referenced this pull request Feb 19, 2026

Merge pull request #45 from cavusmustafa/tmp_fix_multithread

cb92f77

Temp fix for multithreading bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16933: server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility#45

UPSTREAM PR #16933: server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility#45
DajanaV wants to merge 6924 commits into
mainfrom
upstream-PR16933-branch_ServeurpersoCom-reasoning-format-minimax-m2

DajanaV commented Nov 2, 2025

Uh oh!

loci-review Bot commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

DajanaV commented Nov 2, 2025

Description

Changes

Testing

Uh oh!

loci-review Bot commented Nov 2, 2025

Performance Analysis Summary: PR #45 MiniMax-M2 Reasoning Format

Critical Function Analysis

Core Inference Functions

Performance Impact Analysis by KPI

1. Tokens Per Second

2. Power Consumption

3. Quantization Efficiency

4. Memory Usage

5. Batch Processing

Root Cause Analysis

Dynamic Linking Overhead

String Processing Overhead

Action Items

Code Optimization

Build Optimization

Architecture Improvements

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants