Name and Version
./build/bin/llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7807 MiB):
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 7807 MiB
version: 8988 (6118c04)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 4060 (8gb vram)
Models
gemma-4-E4B-it-Q4_K_M.gguf
mmproj-gemma-4-E4B-it-BF16.gguf
lmstudio comunity version https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF
Problem description & steps to reproduce
When llama-server streams a tool call response, the first SSE delta incorrectly combines name, id, type, and the opening argument fragment ({) in a single chunk. Most OpenAI-compatible clients expect the defining chunk to carry an empty arguments string, with the actual argument bytes arriving as separate subsequent fragments. As a result, clients that accumulate argument fragments lose the leading { and end up with invalid JSON, causing tool calls to execute with empty parameters.
At first, I made a python proxy for the fix then got help with claude code for the fix in llama-server
Analysis
This is claude code... but very very summarized
Affected component
tools/server/server-task.cpp — server_task_result_cmpl_partial::update()
Observed behaviour
First SSE chunk received by client:
{
"choices": [{
"delta": {
"tool_calls": [{
"index": 0,
"id": "R9g09d5ky0p4gQIJl6pyaJZWEj2jnaYL",
"type": "function",
"function": {
"name": "query_agent",
"arguments": "{"
}
}]
}
}]
}
Subsequent fragments (one per token):
{"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"agent_name\""}}]}}]}
{"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":":\"search\""}}]}}]}
...
A client that accumulates fragments by index gets:
arg_fragments[0] = "\"agent_name\":\"search\",..." ← missing leading "{"
Jason.decode(...) (or equivalent) fails → tool call executed with arguments: {}.
Expected behaviour
First chunk (defining) — name and id only, no argument content:
{"function": {"name": "query_agent", "arguments": ""}}
Second chunk — first argument fragment:
{"function": {"arguments": "{"}}
Subsequent chunks continue accumulating until the complete JSON is assembled:
{"agent_name":"search","question":"..."} → parses correctly.
Root cause
update_chat_msg() has a filter_tool_calls parameter (default false) that, when
true, activates splitting logic (lines 167–218) that:
- Emits a name-only header chunk before any arguments
- Emits argument bytes as separate argument-only fragments
However, the streaming partial-result path in server_task_result_cmpl_partial::update()
never passed filter_tool_calls = true:
// server-task.cpp:1378 — before fix
state.update_chat_msg(content, true, oaicompat_msg_diffs);
// ↑ filter_tool_calls defaults to false
With filter_tool_calls = false, line 165 simply does diffs = std::move(all_diffs),
bypassing the splitting logic entirely and emitting raw diffs from compute_diffs() which
combine name + first argument fragment in one diff.
The non-streaming final-result path (server-task.h:384) passes is_partial = false
and is unaffected — the complete arguments JSON is always valid at that point.
Fix
One-line change in tools/server/server-task.cpp:1378:
// Before
state.update_chat_msg(content, true, oaicompat_msg_diffs);
// After
state.update_chat_msg(content, true, oaicompat_msg_diffs, /* filter_tool_calls= */ true);
This works for my case but not sure if is globally ok
Affected clients
Any OpenAI-compatible streaming client that:
- Accumulates
function.arguments fragments by index across multiple SSE chunks
- Attempts to JSON-decode only after
finish_reason: tool_calls
Notably: req_llm (Elixir), LangChain, LlamaIndex, and similar agent frameworks.
Workaround (pre-fix)
A proxy can split the combined chunk before forwarding:
def _split_defining_tool_call_chunks(chunk):
extra_frags = []
for choice in chunk.get("choices", []):
for tc in choice.get("delta", {}).get("tool_calls", []):
fn = tc.get("function", {})
if fn.get("name") and fn.get("arguments"):
args = fn["arguments"]
fn["arguments"] = ""
extra_frags.append({
"choices": [{"delta": {"tool_calls": [
{"index": tc["index"], "function": {"arguments": args}}
]}, "index": choice.get("index", 0), "finish_reason": None}]
})
return [chunk] + extra_frags if extra_frags else [chunk]
First Bad Commit
566059a Autoparser - complete refactoring of parser architecture (#18675)
The filter_tool_calls parameter and splitting logic were added in this commit, but the streaming path (server_task_result_cmpl_partial::update,
server-task.cpp:1378) was never updated to pass true. The fix code exists and is correct — it was simply never activated.
Relevant log output
Logs
# First SSE delta — name + arguments opening brace COMBINED in one chunk:
sse tool_call delta: [{"index": 0, "id": "R9g09d5ky0p4gQIJl6pyaJZWEj2jnaYL", "type": "function", "function": {"name": "query_agent", "arguments": "{"}}]
# Subsequent deltas — argument fragments only (one per token):
sse tool_call delta: [{"index": 0, "function": {"arguments": "\"agent"}}]
sse tool_call delta: [{"index": 0, "function": {"arguments": "_"}}]
sse tool_call delta: [{"index": 0, "function": {"arguments": "name"}}]
sse tool_call delta: [{"index": 0, "function": {"arguments": "\":"}}]
sse tool_call delta: [{"index": 0, "function": {"arguments": "\"search\""}}]
sse finish_reason=tool_calls
# Client accumulates fragments by index → assembled string:
# arg_fragments[0] = "\"agent_name\":\"search\",..." ← leading "{" is lost
# JSON decode fails → tool executed with empty arguments
# Resulting tool error sent back by client:
"Invalid parameters: required :agent_name option not found, received options: []"
Name and Version
./build/bin/llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7807 MiB):
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 7807 MiB
version: 8988 (6118c04)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 4060 (8gb vram)
Models
gemma-4-E4B-it-Q4_K_M.gguf
mmproj-gemma-4-E4B-it-BF16.gguf
lmstudio comunity version https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF
Problem description & steps to reproduce
When llama-server streams a tool call response, the first SSE delta incorrectly combines
name,id,type, and the opening argument fragment ({) in a single chunk. Most OpenAI-compatible clients expect the defining chunk to carry an emptyargumentsstring, with the actual argument bytes arriving as separate subsequent fragments. As a result, clients that accumulate argument fragments lose the leading{and end up with invalid JSON, causing tool calls to execute with empty parameters.At first, I made a python proxy for the fix then got help with claude code for the fix in llama-server
Analysis
This is claude code... but very very summarized
Affected component
tools/server/server-task.cpp—server_task_result_cmpl_partial::update()Observed behaviour
First SSE chunk received by client:
{ "choices": [{ "delta": { "tool_calls": [{ "index": 0, "id": "R9g09d5ky0p4gQIJl6pyaJZWEj2jnaYL", "type": "function", "function": { "name": "query_agent", "arguments": "{" } }] } }] }Subsequent fragments (one per token):
{"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"agent_name\""}}]}}]} {"choices":[{"delta":{"tool_calls":[{"index":0,"function":{"arguments":":\"search\""}}]}}]} ...A client that accumulates fragments by index gets:
Jason.decode(...)(or equivalent) fails → tool call executed witharguments: {}.Expected behaviour
First chunk (defining) — name and id only, no argument content:
{"function": {"name": "query_agent", "arguments": ""}}Second chunk — first argument fragment:
{"function": {"arguments": "{"}}Subsequent chunks continue accumulating until the complete JSON is assembled:
{"agent_name":"search","question":"..."}→ parses correctly.Root cause
update_chat_msg()has afilter_tool_callsparameter (defaultfalse) that, whentrue, activates splitting logic (lines 167–218) that:However, the streaming partial-result path in
server_task_result_cmpl_partial::update()never passed
filter_tool_calls = true:With
filter_tool_calls = false, line 165 simply doesdiffs = std::move(all_diffs),bypassing the splitting logic entirely and emitting raw diffs from
compute_diffs()whichcombine name + first argument fragment in one diff.
The non-streaming final-result path (
server-task.h:384) passesis_partial = falseand is unaffected — the complete arguments JSON is always valid at that point.
Fix
One-line change in
tools/server/server-task.cpp:1378:This works for my case but not sure if is globally ok
Affected clients
Any OpenAI-compatible streaming client that:
function.argumentsfragments byindexacross multiple SSE chunksfinish_reason: tool_callsNotably:
req_llm(Elixir), LangChain, LlamaIndex, and similar agent frameworks.Workaround (pre-fix)
A proxy can split the combined chunk before forwarding:
First Bad Commit
566059a Autoparser - complete refactoring of parser architecture (#18675)
The filter_tool_calls parameter and splitting logic were added in this commit, but the streaming path (server_task_result_cmpl_partial::update,
server-task.cpp:1378) was never updated to pass true. The fix code exists and is correct — it was simply never activated.
Relevant log output
Logs