chat : fix Llama 3.x throwing runtime exception if response contains { by jpohhhh · Pull Request #20806 · ggml-org/llama.cpp

jpohhhh · 2026-03-20T15:29:26Z

NOTE: I'm aware the server binary has a flag to disable tool call parsing altogether, @pwilkin mentioned it when closing #20800. Opened this PR because: that cannot help API callers, and this is a severe regression- all Llama 3.x requests with tools throw runtime exception if response is freeform and contains {, ex. a hello world C program). This PR's description clarifies that, the other theoratically left open to option to close if the report was server-binary-only and the PR contributor was amenable to disabling all tool calls with Llama 3.x.

For templates like Llama 3.3 where tool_start is "{" (no distinctive marker), the content parser stops at any brace and the tools parser takes over. If the model output contains braces that aren't valid tool calls, the tools parser fails with nothing to absorb the remaining input. Ex. "write me a C program" 500s without starting the server with --skip-chat-parsing. That's fine for the server if a priori you know llama 3.x will be used with the server and can afford to disable tool call altogether. It won't work for API users.

Regression introduced in 566059a (Autoparser #18675, 2026-03-06).

Two failure modes on current master:

Content silently truncated at first "{" (partial match)
server: HTTP 500 crash (full parse throws), server API: runtime exception

Fix: wrap the existing parser in a choice() with a content-only fallback. The tools path is tried first; when it fails, the fallback returns everything as content. No behavior change for valid tool calls.

Unit test:

cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat

Server repro (Llama 3.2 3B, temp=0, tools enabled):

llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja

200 before `566059a`, 500 after

curl http://localhost:8080/v1/chat/completions -d '{
"messages": [{"role": "user", "content": "Write a hello world C program. Just the code, no explanation."}],
"tools": [{"type": "function", "function": {
"name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}],
"temperature": 0, "max_tokens": 200
}'

For templates like Llama 3.3 where tool_start is "{" (no distinctive marker), the content parser stops at any brace and the tools parser takes over. If the model output contains braces that aren't valid tool calls, the tools parser fails with nothing to absorb the remaining input. Regression introduced in 566059a (Autoparser ggml-org#18675, 2026-03-06). Two failure modes on current master: - Content silently truncated at first "{" (partial match) - HTTP 500 crash (full parse throws) Fix: wrap the existing parser in a choice() with a content-only fallback. The tools path is tried first; when it fails, the fallback returns everything as content. No behavior change for valid tool calls. Unit test: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat Server repro (Llama 3.2 3B, temp=0, tools enabled): llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja # 200 before 566059a, 500 after curl http://localhost:8080/v1/chat/completions -d '{ "messages": [{"role": "user", "content": "Write a hello world C program. Just the code, no explanation."}], "tools": [{"type": "function", "function": { "name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]} }}], "temperature": 0, "max_tokens": 200 }'

pwilkin · 2026-03-20T16:07:13Z

Please stop this endless spam of PRs to imaginary issues you cannot reproduce with real-life scenarios. Open an issue with an actual model query or message history first.

jpohhhh · 2026-03-20T16:13:16Z

Please stop this endless spam of PRs to imaginary issues you cannot reproduce with real-life scenarios. Open an issue with an actual model query or message history first.

See server commands, those are an actual model query.

Moving forward, it is clear you are intending to communicate "file an issue with the server one liners before the PR", which I will do.

jpohhhh requested review from a team and pwilkin as code owners March 20, 2026 15:29

github-actions Bot added the testing Everything test related label Mar 20, 2026

pwilkin closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chat : fix Llama 3.x throwing runtime exception if response contains {#20806

chat : fix Llama 3.x throwing runtime exception if response contains {#20806
jpohhhh wants to merge 1 commit into
ggml-org:masterfrom
jpohhhh:fix-json-native-content-fallback-v2

jpohhhh commented Mar 20, 2026 •

edited

Loading

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

jpohhhh commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpohhhh commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

200 before 566059a, 500 after

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

jpohhhh commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpohhhh commented Mar 20, 2026 •

edited

Loading

200 before `566059a`, 500 after