feat: enable MiniCPM5-1B (explicit head_dim + special-token decode + tojson kwarg + tool parser)#176
feat: enable MiniCPM5-1B (explicit head_dim + special-token decode + tojson kwarg + tool parser)#176dusterbloom wants to merge 12 commits into
Conversation
The streaming `/v1/chat/completions` route previously stripped tools from the prompt (`prepare_chat_prompt_with_thinking(..., None, ...)`) and warned that "tool_calls deltas are unsupported". The model, deprived of tool context, would hallucinate fake tool calls as plain text — and the client's TTS pipeline would happily speak `<nanobot>read_skill newsreader</nanobot>` out loud. This commit completes the streaming tool-call story that was documented but never finished (see closed upstream PR panbanda#63). Three pieces, ~580 net new lines: 1. `tool_parser::StreamingToolCallTracker` New `pub struct` next to the existing `parse_tool_calls`. A small state machine that buffers streamed text chunks and extracts complete `<tool_call>{json}</tool_call>` blocks on the fly. When `active=false` (no tools in the request) it collapses to a single-allocation passthrough with zero parsing cost. Invariants verified by 8 new unit tests: - inactive=false → pure passthrough - single complete tag in one chunk - tag split across N chunks → reassembled - text before AND after tags → both visible - invalid JSON inside tag → preserved as visible (no silent loss) - unclosed tag at flush → buffered prefix emitted as visible - UTF-8 char-boundary safety on tail flush - multiple calls with text between → indices tracked correctly 2. `chat::chat_completions_stream` wiring - Pass `req.tools.as_deref()` (not None) to `prepare_chat_prompt_with_thinking` so the chat template renders the tool spec the model recognises. - Construct a `StreamingToolCallTracker` keyed off `stream_includes_tools`. - On every chunk: route the reasoning-tracker's visible text through the tool tracker; emit `ToolCallDelta` SSE events for each completed call; emit content delta for the surviving visible text. - Defer `finish_reason` until after the tool tracker has drained, so we report `"tool_calls"` when the response actually contained any. - Final flush drains both trackers so no tokens vanish silently. 3. `chat_template::normalize_tool_call_for_template` New `pub fn` that walks a tool-call JSON value and (a) hoists `function.{name,arguments}` to the top level — Qwen's `chat_template.jinja` references `tool_call.name` and `tool_call.arguments` directly, not the OpenAI-nested shape; (b) parses string-encoded `arguments` to a JSON value, fixing the `cannot convert value into pairs (in chat:120)` minijinja error that crashed multi-turn conversations carrying assistant tool_calls in their history. Verified by 4 new unit tests covering OpenAI shape, Qwen-flat shape (no-op), unparseable-string arguments (kept as string), and non-object inputs (no-op). `convert_messages` in the chat route calls the normaliser per tool call so both the streaming and non-streaming paths get the fix. Test impact: - higgs-models: 242 passed (no change) - higgs-engine: 338 passed (+27 tracker tests, +4 normalize tests, 25 ignored) - higgs: 457 passed (no regression on existing convert_messages tests) - cargo clippy -Dwarnings: clean across all three crates
…bind
Qwen's chat_template.jinja:107-108 rebinds the loop variable:
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
After this rebind, line 120's `tool_call.arguments|items` walks into
the ORIGINAL `function.arguments` JSON-encoded string — bypassing the
top-level `arguments` we just normalised. minijinja raises "cannot
convert value into pairs" and the request 500s.
Two-part fix:
1. `normalize_tool_call_for_template` now normalises BOTH the hoisted
top-level `arguments` AND the still-nested `function.arguments`.
Pulled the parse-and-coerce logic into a private helper
`normalize_arguments_value` so it's applied identically at both
sites (DRY).
2. The coercion now defends against every non-mapping shape — null,
bool, number, array, string-that-fails-to-parse — by replacing the
value with an empty `{}`. A warn is logged so pathological shapes
stay visible. Previously only string-parsing was attempted and any
other shape leaked through to the template.
New invariant test (`normalize_handles_qwen_rebind_to_function`) pins
both `tool_call.arguments` AND `tool_call.function.arguments` to be
mappings after normalize, so future edits can't regress the rebind
path silently. Five additional coercion tests cover null / array /
number / unparseable-string / array-as-string cases.
Verified by manual smoke: `nanobot agent -l -m "Use your tools to list
files in /tmp"` previously 500'd with the template error; now returns
a real directory listing streamed back through `StreamingToolCallTracker`.
Test count: chat_template 36 (was 31, +5 net for the new shapes), all
green. Clippy `-Dwarnings` clean across higgs / higgs-engine / higgs-models.
Also: cargo fmt sweep on the new code that landed unformatted in the
prior commit; pure layout, no behaviour change.
`req.tools.as_deref()` returns `Some(&[])` when the request carries an
empty `tools` array. The chat-template renderer treats that as `tools`
being *defined* in the Jinja context (just empty), versus `None` which
omits the key entirely. Templates branching on `{% if tools is defined %}`
vs `{% if tools %}` see different state.
Convert empty slices to `None` before handing them to
`prepare_chat_prompt_with_thinking`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses CRITICAL finding from the closed upstream PR panbanda#63 review: a model that emits `<tool_call>` and never produces a matching `</tool_call>` would grow `StreamingToolCallTracker::buffer` until OOM — a model-controlled DoS vector. On overflow we abandon the parse, surface `<tool_call>` plus the buffered bytes as visible content (preserving the "never silently drop tokens" invariant), and reset state so a subsequent well-formed tool call in the same stream still parses. New test `streaming_unbounded_buffer_capped_and_recovers` pushes MAX_INSIDE_TOOL_CALL_BYTES + 1 inside an unclosed tag followed by a valid call, and asserts both that the overflow is flushed verbatim and that the post-overflow call still parses (completed_count == 1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… coercion Qwen3.5/3.6 models' chat_template.jinja instructs them to emit `<tool_call><function=NAME><parameter=KEY>value</parameter></function></tool_call>`, but the existing parser only decodes the older JSON-in-`<tool_call>` shape, so tool calls leaked into `content` with `finish_reason: stop` and the OpenAI streaming/non-streaming clients saw nothing. Add a second parse path that dispatches on the block's first token (`<function=` → XML, else → JSON), reusing the existing `StreamingToolCallTracker` boundary detection unchanged. XML values are raw strings, so coerce them through a `ToolSchema` built from the request's `tools` array (integer/number/boolean/object/array → typed JSON, string-typed `"123"` stays a string), falling back to best-effort JSON parsing when no schema is declared. - `parse_tool_calls` and `StreamingToolCallTracker::new` now take an `Option<&ToolSchema>` / `Option<ToolSchema>`; the streaming path captures it from `req.tools` before the `async_stream!` block, the non-streaming path threads it at the call site - Backward-compatible: pre-existing 28 JSON tests still pass; one new test exercises a mixed JSON+XML stream Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ModelArgs::head_dim()` always computed `hidden_size / num_attention_heads`, ignoring an explicit `head_dim` in config.json. Models that set a head_dim differing from that ratio (e.g. MiniCPM5-1B: head_dim=128 while 1536/16=96) then loaded but produced garbage — RoPE, the 1/sqrt(head_dim) scale, and head reshaping all used the wrong dimension. Add `head_dim_override: Option<i32>` (serde `head_dim`); `head_dim()` and `checked_head_dim()` return it when present, else the existing ratio. Regression-safe: every other Llama-family config omits `head_dim`, so the field is None and behavior is unchanged. Verified live — MiniCPM5 went from "I cannot answer: 1." to "Paris is the capital of France." Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
decode_tokens used `skip_special_tokens=true`, which strips ALL special tokens — including content-bearing ones. Models that encode tool-call markup as special tokens (e.g. MiniCPM5: `<function>`/`<param>` are token ids 18-21) had the structure deleted before the tool parser could see it, so tool calls silently vanished into empty content. Decode with `skip_special_tokens=false` and instead strip a precomputed set of control tokens: the EOS ids plus the `<|…|>` chat-control delimiters and the classic sentinels (`<s>`, `</s>`, `<unk>`, …). Content tokens like `<think>`, `<tool_call>`, `<function>` are preserved. Plain text has no special tokens and decodes identically, so normal responses are unaffected. Verified: MiniCPM5 tool calls now surface as structured tool_calls; Qwen3.5 regression-clean (its `<|im_end|>` no longer leaks into content). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The custom `tojson` filter took only the piped value, so HF templates that call `tojson(ensure_ascii=false)` (e.g. MiniCPM5 at chat:6) failed prompt rendering with "too many arguments" and 500'd any tools-bearing request. Accept (and ignore) trailing kwargs via `minijinja::value::Kwargs`. `serde_json::to_string` already emits UTF-8, matching ensure_ascii=false, so the value is a no-op. Templates calling `tojson` with no args are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MiniCPM5 emits tool calls as `<function name="NAME"><param name="KEY">VALUE
</param></function>` — no `<tool_call>` wrapper, attribute-named, with optional
`<![CDATA[…]]>` values. Neither the JSON nor the Qwen `<function=` XML parser
handled it, so calls leaked into content.
Add a third parse path, dispatched on content shape (a top-level `<function `
opener vs the existing `<tool_call>` wrapper):
- `parse_minicpm_function` extracts the name attr and `<param>` entries, with
CDATA-aware value extraction; `minicpm_function_end` finds the closing
`</function>` while skipping CDATA spans (so a literal `</function>` inside a
value can't end the block early).
- The streaming tracker's `inside_tool_call: bool` becomes
`enum Inside { None, ToolCall, Function }`; the ToolCall arm is unchanged.
- Reuses ToolSchema/coerce_param_value for typed argument coercion.
6 new tests (single, multi, CDATA-with-literal-</function>, schema coercion,
no-param, streaming split mid-opener+mid-CDATA); all existing JSON/Qwen tests
still pass. Verified live e2e on MiniCPM5-1B (non-streaming + streaming).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR adds multi-shape tool-call parsing with schema-driven coercion and streaming safety, normalizes tool-call JSON for templates, preserves content-bearing markup while filtering control/sentinel tokens during decoding, adds template-aware tojson kwargs, expands thinking-detection heuristics, adds an explicit head_dim override, and integrates these across streaming and non-streaming chat handlers. ChangesTool-call ecosystem and chat integration
Sequence Diagram(s)sequenceDiagram
participant Client
participant ChatHandler
participant PromptBuilder
participant ToolTracker
participant SSE
Client->>ChatHandler: request with tools
ChatHandler->>ToolTracker: build schema from tools
PromptBuilder->>PromptBuilder: render prompt with tool definitions
ChatHandler->>ChatHandler: start model streaming
loop for each chunk
ChatHandler->>ToolTracker: feed chunk visible text
ToolTracker->>ToolTracker: extract tool-call blocks
ToolTracker->>SSE: emit ToolCallDelta for each tool call
ToolTracker->>SSE: emit ContentDelta for remaining visible text
end
ChatHandler->>ToolTracker: drain at end
ToolTracker->>SSE: flush buffered tool-call deltas
ToolTracker->>ChatHandler: report tool_calls_present
ChatHandler->>SSE: override finish_reason to tool_calls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
📋 Motivation & root-cause writeup (why each of the 4 changes was needed, with evidence): #178 |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/higgs/src/routes/chat.rs (1)
270-279:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize empty
toolsthe same way in both response modes.At Line 319, non-streaming treats
tools: []as tool-enabled (is_some()), while streaming only enables tool parsing for non-empty lists. The same request can therefore yield parsedtool_calls/finish_reason = "tool_calls"in non-streaming but plain text in streaming.Suggested fix
- let tools = req.tools.as_deref(); + let tools = req.tools.as_deref().filter(|t| !t.is_empty()); @@ - let has_tools = req.tools.is_some(); + let has_tools = tools.is_some();- let tool_schema = higgs_engine::tool_parser::ToolSchema::from_tools(req.tools.as_deref()); + let prompt_tools = req.tools.as_deref().filter(|t| !t.is_empty()); + let tool_schema = higgs_engine::tool_parser::ToolSchema::from_tools(prompt_tools); @@ - let prompt_tools = req - .tools - .as_deref() - .and_then(|t| if t.is_empty() { None } else { Some(t) }); let mut prompt_tokens = engine .prepare_chat_prompt_with_thinking(&messages, prompt_tools, thinking_enabled_stream)Also applies to: 319-352, 421-425, 461-466
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/higgs/src/routes/chat.rs` around lines 270 - 279, The code treats an empty tools list differently between streaming and non-streaming paths because tools is taken as Some(empty) in one path but not the other; normalize by converting req.tools.as_deref() into None for empty slices before any downstream checks. Replace uses of let tools = req.tools.as_deref() with a normalized option (e.g., filter out empty slices) so downstream logic like prepare_chat_prompt_with_thinking(..., tools, ...) and any is_some() checks that enable tool parsing behave identically in both streaming and non-streaming branches; apply the same normalization in all places where req.tools.as_deref() or tool existence is checked (including the streaming parsing/finish_reason logic and non-streaming response handling).
🧹 Nitpick comments (1)
crates/higgs-engine/src/tool_parser.rs (1)
497-500: ⚡ Quick winDocument the new public fields.
StreamingToolOutputis newly public, butvisibleandnew_tool_callsdo not have field-level docs.Suggested fix
pub struct StreamingToolOutput { + /// Text that should be forwarded to the client as normal assistant content. pub visible: String, + /// Tool calls that became complete while processing the current chunk. pub new_tool_calls: Vec<ParsedToolCall>, }As per coding guidelines, "Add doc comments on public structs/fields in Rust when changing user-facing behavior".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/higgs-engine/src/tool_parser.rs` around lines 497 - 500, StreamingToolOutput is now public but its public fields lack documentation; add doc comments for the struct and for each public field—describe the purpose of StreamingToolOutput and then annotate visible (String) to explain what the string contains/represents (e.g., aggregated or incremental visible output to the user) and new_tool_calls (Vec<ParsedToolCall>) to explain it contains parsed tool calls discovered during streaming and their semantics/order; update the doc comments above the pub struct StreamingToolOutput and above pub visible and pub new_tool_calls accordingly so consumers understand expected contents and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/higgs-engine/src/chat_template.rs`:
- Around line 186-192: The doc comment describing argument parsing is out of
date: normalize_arguments_value() now coerces any parsed non-object (including
strings that parse to non-object JSON or unparseable strings) to an empty object
`{}`; update the public-facing comments that currently claim unparseable strings
are left untouched to instead state that after normalization non-object results
are converted to `{}`. Edit the doc block around the earlier description (and
the similar block later in the file) to mention normalize_arguments_value()'s
coercion behavior and the new contract so callers do not expect the original
string to survive normalization.
In `@crates/higgs-engine/src/tool_parser.rs`:
- Around line 409-420: parse_minicpm_function currently searches for NAME_ATTR
across the whole block which can match attributes in nested tags; restrict the
search to the opening <function ...> tag only by first locating the opener's end
('>') from the start of block and then searching for NAME_ATTR only within that
slice. In other words, find the index of the first '>' after the opening
'<function', take the substring representing the opener, then locate NAME_ATTR,
extract the quoted name from that opener substring (validating quotes and
non-empty), and fall back to returning None if not present or malformed; update
references in parse_minicpm_function to use this opener-limited search instead
of block.find(NAME_ATTR).
- Around line 249-252: The match arm conflates ParamType::Integer and
ParamType::Number by using parsed_if(Value::is_number), allowing floats for
integers; change the arms so ParamType::Integer uses a stricter check (e.g.,
replace parsed_if(Value::is_number) in the Integer arm with parsed_if(|v|
v.is_i64() || v.is_u64() || (v.is_f64() && v.as_f64().map(|f| f.fract() ==
0.0).unwrap_or(false)))) while keeping ParamType::Number using
parsed_if(Value::is_number); update the match cases around ParamType::Integer
and ParamType::Number in the code near function parsed_if / as_string to enforce
integer-only values.
---
Outside diff comments:
In `@crates/higgs/src/routes/chat.rs`:
- Around line 270-279: The code treats an empty tools list differently between
streaming and non-streaming paths because tools is taken as Some(empty) in one
path but not the other; normalize by converting req.tools.as_deref() into None
for empty slices before any downstream checks. Replace uses of let tools =
req.tools.as_deref() with a normalized option (e.g., filter out empty slices) so
downstream logic like prepare_chat_prompt_with_thinking(..., tools, ...) and any
is_some() checks that enable tool parsing behave identically in both streaming
and non-streaming branches; apply the same normalization in all places where
req.tools.as_deref() or tool existence is checked (including the streaming
parsing/finish_reason logic and non-streaming response handling).
---
Nitpick comments:
In `@crates/higgs-engine/src/tool_parser.rs`:
- Around line 497-500: StreamingToolOutput is now public but its public fields
lack documentation; add doc comments for the struct and for each public
field—describe the purpose of StreamingToolOutput and then annotate visible
(String) to explain what the string contains/represents (e.g., aggregated or
incremental visible output to the user) and new_tool_calls (Vec<ParsedToolCall>)
to explain it contains parsed tool calls discovered during streaming and their
semantics/order; update the doc comments above the pub struct
StreamingToolOutput and above pub visible and pub new_tool_calls accordingly so
consumers understand expected contents and behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ff6757fb-ce21-4ad3-9203-4c19bd4f81c0
📒 Files selected for processing (5)
crates/higgs-engine/src/chat_template.rscrates/higgs-engine/src/simple.rscrates/higgs-engine/src/tool_parser.rscrates/higgs-models/src/transformer.rscrates/higgs/src/routes/chat.rs
`detect_thinking_support` only recognised `model_type` qwen3_5/qwen3_5_moe, so hybrid reasoning models with a different model_type ran with thinking OFF — e.g. MiniCPM5-1B (model_type=llama) answered without its reasoning channel: fast but inaccurate (bat-and-ball at temp 0.9 → $1.05/$1.60; with thinking on → correct $0.05 with self-check). Generalise the capability check to also honour the model author's own signal — the chat template referencing `enable_thinking` (covers Qwen3.5/3.6, MiniCPM5, future reasoning models) — keeping the qwen3_5* model_type as a fallback. The caller still requires a single-token `</think>`, so a stray mention can't enable a non-reasoning model. This is capability only; the per-request default (e.g. Qwen3.6 reasons off unless asked) stays in `model_defaults_to_non_thinking` and is unaffected — panbanda#150 behaviour preserved (reasoning tests green). MiniCPM5 isn't qwen3.6, so it defaults thinking on. Verified live: MiniCPM5 auto-enables thinking (no env var), emits reasoning, answers correctly, and tool calls still work alongside it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- coerce_param_value: split `integer` from `number` — `is_number` accepts
floats, so an `integer` param wrongly coerced `3.14`. `integer` now only
accepts i64/u64 values, falling back to the raw string otherwise.
- chat (non-streaming): treat an empty `tools: []` as absent, mirroring the
streaming path — it no longer defines `tools` in the template context or
triggers tool parsing.
- docs: refresh `normalize_tool_call_for_template` (non-object arguments are
coerced to `{}`, not left untouched; dropped a stray leading line) and
document the public `StreamingToolOutput` fields.
Adds a test pinning the integer/fractional coercion contract.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse_minicpm_function scanned the whole block for `name="…"`, so a malformed payload like `<function ><param name="city">…` was parsed as a tool call named `city` instead of being preserved verbatim — routing bad model output into the tool-execution path. Read the attribute only from the opening `<function …>` tag (before its first `>`); a missing name now yields `None` (preserved). Adds a regression test, and backticks a few identifiers in test doc comments to satisfy `clippy::doc_markdown`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Takes
openbmb/MiniCPM5-1B(MLX 4-bit) from loads-but-emits-garbage to fully working chat and tool-calling on higgs. Four self-contained commits — three are general fixes that help any model, one adds the MiniCPM tool-call format.The 4 new commits
feat(models): honor explicit head_dim—ModelArgs::head_dim()always computedhidden_size / num_attention_heads, ignoring an explicithead_diminconfig.json. MiniCPM5 setshead_dim=128while1536/16=96, so it loaded but produced garbage (wrong RoPE dimension +1/√head_dimscale). Now honors the config value. Regression-safe: every other Llama-family config omitshead_dim→None→ behaviour unchanged. General fix for any explicit-head_dimcheckpoint.fix(engine): preserve content special tokens on decode—decode_tokensusedskip_special_tokens=true, deleting content-bearing special tokens. MiniCPM5 encodes its tool-call markup (<function>/<param>, token ids 18–21) as special tokens, so calls silently vanished into empty content. Now decodes with specials kept and strips only a precomputed control set (EOS ids +<|…|>delimiters +<s>/</s>/…). Plain text has no special tokens and decodes identically, so normal responses are unaffected. Also fixes latent control-token leakage for other models.fix(chat_template): accept ensure_ascii kwarg in tojson— HF templates calltojson(ensure_ascii=false)(MiniCPM5 atchat:6); the custom filter rejected kwargs and 500'd any tools-bearing request. Now accepts (and ignores) them —serde_json::to_stringalready emits UTF-8, i.e.ensure_ascii=false. Templates callingtojsonwith no args are unaffected.feat(tool_parser): MiniCPM5tool format— a third tool-call format (attribute-named, optional<![CDATA[…]]>values, no<tool_call>wrapper), dispatched on content shape. CDATA-aware end detection (a literal</function>inside a value can't close the block early); the streaming tracker's flag becomes anInside { None, ToolCall, Function }enum (the existing arms unchanged). Reuses theToolSchemacoercion from feat(chat): streaming tool-call deltas + Qwen-friendly arg normalisation #164. 6 new unit tests.Verification (release builds)
cargo test --release— higgs-engine 258, higgs-models 457, higgs 98; 0 failed.cargo clippy/cargo fmtclean across touched crates.mlx-community/MiniCPM5-1B-4bit:"Paris is the capital of France."— was"I cannot answer: 1."before the head_dim fix){"name":"get_weather","arguments":{"city":"London"}},finish_reason: tool_calls, no markup leak — non-streaming and streaming.Qwen3.5-4B-MLX-4bit: tool calls still parse to structuredtool_calls; the decode change also removes a stray<|im_end|>that could leak into content.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Improvements