Skip to content

Conversation

@dhdaines
Copy link

@dhdaines dhdaines commented Jan 4, 2026

Applies on top of #2108 which has the necessary changes to MTMD.

This adds chat formats to support https://huggingface.co/ggml-org/granite-docling-258M-GGUF and its ancestor, https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF (and various other SmolVLM models)

In order to use Granite-Docling effectively for table structure, equation and layout extraction, it is necessary to enable special tokens in the chat completion output, so this adds a special flag to all of the chat completion functions which matches what --special does in llama-cli (this is enabled by default in llama-mtmd-cli)

@dhdaines dhdaines marked this pull request as ready for review January 4, 2026 18:05
@dhdaines
Copy link
Author

dhdaines commented Jan 4, 2026

Ready for review!

@dhdaines dhdaines force-pushed the granite-docling branch 3 times, most recently from 3a04e21 to 8790ce6 Compare January 6, 2026 19:55
- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01)
- Bump version 0.3.16 -> 0.4.0

Critical fixes:
- Remove phantom flash_attn field from llama_context_params (caused segfaults)
- Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level)
- Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE)
- Add flash_attn_type to TYPE_CHECKING block
- Fix test: use flash_attn_type instead of removed flash_attn field
- FIX CRITICAL: kv_cache_seq_rm must preserve seq_id=-1 semantics (all sequences)
  * The wrapper was incorrectly converting -1 to 0, breaking context rewind
  * This caused 'discontinuity' errors on multi-turn conversations

API changes:
- flash_attn: bool field REMOVED from structs
- flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1)
- High-level API maintains backward compatibility via wrapper
- Server default changed: flash_attn=False -> flash_attn=None (AUTO mode)

New features:
- 20+ new functions (memory API, state management, samplers, vocab queries)
- 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.)
- 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params

Deprecated removals:
- 11 llama_kv_self_* functions (replaced by llama_memory_*)
- llama_sampler_init_softmax
- verbosity field from mtmd_context_params
Ralf Waldukat and others added 5 commits January 14, 2026 08:35
After external code review (GPT-5.2), fixed 4 critical issues:

1. CRITICAL: Fixed tokens[:-1] bug in prefix matching
   - Was silently breaking prefix matching for ALL models
   - Caused false rewind detection and cache inefficiency
   - Impact: Transformers AND recurrent models

2. CRITICAL: Implement proper reset() for recurrent models
   - Now actually clears llama_memory backend state
   - Root cause fix for 'sequence positions not consecutive' crash
   - Without this, reset was a no-op for recurrent models

3. CRITICAL: Enforce strict append policy for recurrent models
   - Prevents KV cache rewinding that's impossible without state snapshots
   - Forces full reset on history edits instead of crashing

4. Performance: Cache _is_recurrent to avoid repeated FFI calls

5. Documentation: Simplified comments and updated docstring

6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated

Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models.

Reviewed-by: GPT-5.2 (OpenAI)
Tested-by: pytest + Mistral-Small-3.2-24B
Fixes: abetlen#2108 (recurrent model crashes)
Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant