chat : fix Llama 3.x throwing runtime exception if response contains {#20806
Closed
jpohhhh wants to merge 1 commit into
Closed
chat : fix Llama 3.x throwing runtime exception if response contains {#20806jpohhhh wants to merge 1 commit into
jpohhhh wants to merge 1 commit into
Conversation
For templates like Llama 3.3 where tool_start is "{" (no distinctive
marker), the content parser stops at any brace and the tools parser
takes over. If the model output contains braces that aren't valid tool
calls, the tools parser fails with nothing to absorb the remaining
input.
Regression introduced in 566059a (Autoparser ggml-org#18675, 2026-03-06).
Two failure modes on current master:
- Content silently truncated at first "{" (partial match)
- HTTP 500 crash (full parse throws)
Fix: wrap the existing parser in a choice() with a content-only
fallback. The tools path is tried first; when it fails, the fallback
returns everything as content. No behavior change for valid tool calls.
Unit test:
cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat
Server repro (Llama 3.2 3B, temp=0, tools enabled):
llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja
# 200 before 566059a, 500 after
curl http://localhost:8080/v1/chat/completions -d '{
"messages": [{"role": "user", "content": "Write a hello world C program. Just the code, no explanation."}],
"tools": [{"type": "function", "function": {
"name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}],
"temperature": 0, "max_tokens": 200
}'
Member
|
Please stop this endless spam of PRs to imaginary issues you cannot reproduce with real-life scenarios. Open an issue with an actual model query or message history first. |
Contributor
Author
See server commands, those are an actual model query. Moving forward, it is clear you are intending to communicate "file an issue with the server one liners before the PR", which I will do. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NOTE: I'm aware the server binary has a flag to disable tool call parsing altogether, @pwilkin mentioned it when closing #20800. Opened this PR because: that cannot help API callers, and this is a severe regression- all Llama 3.x requests with tools throw runtime exception if response is freeform and contains {, ex. a hello world C program). This PR's description clarifies that, the other theoratically left open to option to close if the report was server-binary-only and the PR contributor was amenable to disabling all tool calls with Llama 3.x.
For templates like Llama 3.3 where tool_start is "{" (no distinctive marker), the content parser stops at any brace and the tools parser takes over. If the model output contains braces that aren't valid tool calls, the tools parser fails with nothing to absorb the remaining input. Ex. "write me a C program" 500s without starting the server with
--skip-chat-parsing. That's fine for the server if a priori you know llama 3.x will be used with the server and can afford to disable tool call altogether. It won't work for API users.Regression introduced in 566059a (Autoparser #18675, 2026-03-06).
Two failure modes on current master:
Fix: wrap the existing parser in a choice() with a content-only fallback. The tools path is tried first; when it fails, the fallback returns everything as content. No behavior change for valid tool calls.
Unit test:
cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat
Server repro (Llama 3.2 3B, temp=0, tools enabled):
llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja
200 before 566059a, 500 after
curl http://localhost:8080/v1/chat/completions -d '{
"messages": [{"role": "user", "content": "Write a hello world C program. Just the code, no explanation."}],
"tools": [{"type": "function", "function": {
"name": "get_weather", "description": "Get weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
}}],
"temperature": 0, "max_tokens": 200
}'