Skip to content

Grammar enforcement not applied when thinking is enabled (response_format + enable_thinking) #20345

@shanevcantwell

Description

@shanevcantwell

Summary

When response_format (JSON schema) is used with enable_thinking: true, grammar enforcement is completely inactive. The model generates unconstrained output. With enable_thinking: false, grammar works correctly.

This was previously raised in #12276 (closed as stale, with a request to reopen). Filing fresh with a clean reproduction on a current build that includes the autoparser (#18675).

Evidence: grammar not enforced for ANY model with thinking ON

Tested two models with the same json_schema response_format requesting fields plan_summary, steps, acceptance_criteria:

Model Thinking ON + response_format Grammar enforced?
Qwen3.5-35B-A3B 500 "Failed to parse input" (fenced JSON in error body) No — grammar didn't prevent fences, PEG parser rejects
Qwen3-VL-8B 200 but wrong schema (task_id, subtasks instead of requested fields) No — grammar didn't enforce schema, but bare JSON passes PEG
Qwen3.5-35B-A3B (thinking OFF) 200, correct schema Yes

The 500 with Qwen3.5 is just the loud symptom — that model wraps JSON in markdown fences, which the PEG parser rejects. The quieter failure (Qwen3-VL) shows the real bug: grammar is not applied at all when thinking is enabled. Models that naturally produce bare JSON appear to succeed but silently ignore the schema.

Reproduction

Build: includes #18675 (autoparser), commit d088d5b

# Start llama-server with thinking ON
llama-server --chat-template-kwargs '{"enable_thinking": true}' \
  --model Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --ctx-size 32768 --cont-batching --n-gpu-layers 999 --port 8081

curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen3.5-35B-A3B-UD-Q4_K_XL",
  "messages": [
    {"role": "system", "content": "You are a systems architect. Produce a task plan as JSON matching the schema provided."},
    {"role": "user", "content": "Design a caching layer for a multi-tenant SaaS application with Redis and Memcached backends."}
  ],
  "temperature": 0,
  "max_tokens": 8192,
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "task_plan",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "plan_summary": {"type": "string"},
          "steps": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "step_number": {"type": "integer"},
                "description": {"type": "string"},
                "specialist": {"type": "string"}
              },
              "required": ["step_number", "description", "specialist"]
            }
          },
          "acceptance_criteria": {
            "type": "array",
            "items": {"type": "string"}
          }
        },
        "required": ["plan_summary", "steps", "acceptance_criteria"]
      }
    }
  }
}'

Result (thinking ON, Qwen3.5-35B-A3B) — 500:

{"error":{"code":500,"message":"Failed to parse input at pos 778: ```json\n{\"project\":{\"name\":\"Multi-Tenant SaaS Caching Layer\"...","type":"server_error"}}

Result (thinking OFF, same request) — correct schema:

{"finish_reason":"stop","content":"{\"plan_summary\":\"Develop a basic user authentication system.\",\"steps\":[{\"step_number\":1,...}],\"acceptance_criteria\":[...]}","usage":{"completion_tokens":69}}

Deterministic at temperature=0. Reproduced multiple times.

Environment

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions