Summary
When response_format (JSON schema) is used with enable_thinking: true, grammar enforcement is completely inactive. The model generates unconstrained output. With enable_thinking: false, grammar works correctly.
This was previously raised in #12276 (closed as stale, with a request to reopen). Filing fresh with a clean reproduction on a current build that includes the autoparser (#18675).
Evidence: grammar not enforced for ANY model with thinking ON
Tested two models with the same json_schema response_format requesting fields plan_summary, steps, acceptance_criteria:
| Model |
Thinking ON + response_format |
Grammar enforced? |
| Qwen3.5-35B-A3B |
500 "Failed to parse input" (fenced JSON in error body) |
No — grammar didn't prevent fences, PEG parser rejects |
| Qwen3-VL-8B |
200 but wrong schema (task_id, subtasks instead of requested fields) |
No — grammar didn't enforce schema, but bare JSON passes PEG |
| Qwen3.5-35B-A3B (thinking OFF) |
200, correct schema |
Yes |
The 500 with Qwen3.5 is just the loud symptom — that model wraps JSON in markdown fences, which the PEG parser rejects. The quieter failure (Qwen3-VL) shows the real bug: grammar is not applied at all when thinking is enabled. Models that naturally produce bare JSON appear to succeed but silently ignore the schema.
Reproduction
Build: includes #18675 (autoparser), commit d088d5b
# Start llama-server with thinking ON
llama-server --chat-template-kwargs '{"enable_thinking": true}' \
--model Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
--ctx-size 32768 --cont-batching --n-gpu-layers 999 --port 8081
curl -s http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-35B-A3B-UD-Q4_K_XL",
"messages": [
{"role": "system", "content": "You are a systems architect. Produce a task plan as JSON matching the schema provided."},
{"role": "user", "content": "Design a caching layer for a multi-tenant SaaS application with Redis and Memcached backends."}
],
"temperature": 0,
"max_tokens": 8192,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "task_plan",
"strict": true,
"schema": {
"type": "object",
"properties": {
"plan_summary": {"type": "string"},
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"step_number": {"type": "integer"},
"description": {"type": "string"},
"specialist": {"type": "string"}
},
"required": ["step_number", "description", "specialist"]
}
},
"acceptance_criteria": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["plan_summary", "steps", "acceptance_criteria"]
}
}
}
}'
Result (thinking ON, Qwen3.5-35B-A3B) — 500:
{"error":{"code":500,"message":"Failed to parse input at pos 778: ```json\n{\"project\":{\"name\":\"Multi-Tenant SaaS Caching Layer\"...","type":"server_error"}}
Result (thinking OFF, same request) — correct schema:
{"finish_reason":"stop","content":"{\"plan_summary\":\"Develop a basic user authentication system.\",\"steps\":[{\"step_number\":1,...}],\"acceptance_criteria\":[...]}","usage":{"completion_tokens":69}}
Deterministic at temperature=0. Reproduced multiple times.
Environment
Related
Summary
When
response_format(JSON schema) is used withenable_thinking: true, grammar enforcement is completely inactive. The model generates unconstrained output. Withenable_thinking: false, grammar works correctly.This was previously raised in #12276 (closed as stale, with a request to reopen). Filing fresh with a clean reproduction on a current build that includes the autoparser (#18675).
Evidence: grammar not enforced for ANY model with thinking ON
Tested two models with the same
json_schemaresponse_format requesting fieldsplan_summary,steps,acceptance_criteria:task_id,subtasksinstead of requested fields)The 500 with Qwen3.5 is just the loud symptom — that model wraps JSON in markdown fences, which the PEG parser rejects. The quieter failure (Qwen3-VL) shows the real bug: grammar is not applied at all when thinking is enabled. Models that naturally produce bare JSON appear to succeed but silently ignore the schema.
Reproduction
Build: includes #18675 (autoparser), commit d088d5b
Result (thinking ON, Qwen3.5-35B-A3B) — 500:
{"error":{"code":500,"message":"Failed to parse input at pos 778: ```json\n{\"project\":{\"name\":\"Multi-Tenant SaaS Caching Layer\"...","type":"server_error"}}Result (thinking OFF, same request) — correct schema:
{"finish_reason":"stop","content":"{\"plan_summary\":\"Develop a basic user authentication system.\",\"steps\":[{\"step_number\":1,...}],\"acceptance_criteria\":[...]}","usage":{"completion_tokens":69}}Deterministic at temperature=0. Reproduced multiple times.
Environment
--reasoning-format deepseekRelated