Grammar enforcement not applied when thinking is enabled (response_format + enable_thinking)

## Summary

When `response_format` (JSON schema) is used with `enable_thinking: true`, grammar enforcement is completely inactive. The model generates unconstrained output. With `enable_thinking: false`, grammar works correctly.

This was previously raised in #12276 (closed as stale, with a request to reopen). Filing fresh with a clean reproduction on a current build that includes the autoparser (#18675).

## Evidence: grammar not enforced for ANY model with thinking ON

Tested two models with the same `json_schema` response_format requesting fields `plan_summary`, `steps`, `acceptance_criteria`:

| Model | Thinking ON + response_format | Grammar enforced? |
|-------|-------------------------------|-------------------|
| Qwen3.5-35B-A3B | 500 "Failed to parse input" (fenced JSON in error body) | No — grammar didn't prevent fences, PEG parser rejects |
| Qwen3-VL-8B | 200 but **wrong schema** (`task_id`, `subtasks` instead of requested fields) | No — grammar didn't enforce schema, but bare JSON passes PEG |
| Qwen3.5-35B-A3B (thinking OFF) | 200, correct schema | **Yes** |

The 500 with Qwen3.5 is just the loud symptom — that model wraps JSON in markdown fences, which the PEG parser rejects. The quieter failure (Qwen3-VL) shows the real bug: **grammar is not applied at all when thinking is enabled.** Models that naturally produce bare JSON appear to succeed but silently ignore the schema.

## Reproduction

Build: includes #18675 (autoparser), commit d088d5b74

```bash
# Start llama-server with thinking ON
llama-server --chat-template-kwargs '{"enable_thinking": true}' \
  --model Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --ctx-size 32768 --cont-batching --n-gpu-layers 999 --port 8081

curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen3.5-35B-A3B-UD-Q4_K_XL",
  "messages": [
    {"role": "system", "content": "You are a systems architect. Produce a task plan as JSON matching the schema provided."},
    {"role": "user", "content": "Design a caching layer for a multi-tenant SaaS application with Redis and Memcached backends."}
  ],
  "temperature": 0,
  "max_tokens": 8192,
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "task_plan",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "plan_summary": {"type": "string"},
          "steps": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "step_number": {"type": "integer"},
                "description": {"type": "string"},
                "specialist": {"type": "string"}
              },
              "required": ["step_number", "description", "specialist"]
            }
          },
          "acceptance_criteria": {
            "type": "array",
            "items": {"type": "string"}
          }
        },
        "required": ["plan_summary", "steps", "acceptance_criteria"]
      }
    }
  }
}'
```

**Result (thinking ON, Qwen3.5-35B-A3B) — 500:**
```json
{"error":{"code":500,"message":"Failed to parse input at pos 778: ```json\n{\"project\":{\"name\":\"Multi-Tenant SaaS Caching Layer\"...","type":"server_error"}}
```

**Result (thinking OFF, same request) — correct schema:**
```json
{"finish_reason":"stop","content":"{\"plan_summary\":\"Develop a basic user authentication system.\",\"steps\":[{\"step_number\":1,...}],\"acceptance_criteria\":[...]}","usage":{"completion_tokens":69}}
```

Deterministic at temperature=0. Reproduced multiple times.

## Environment

- llama-server built from d088d5b74 (includes autoparser #18675)
- Models tested: Qwen3.5-35B-A3B-UD-Q4_K_XL, Qwen3-VL-8B-Instruct-Q8_0
- `--reasoning-format deepseek`
- Linux, RTX 8000

## Related

- #12276 — Original feature request for grammar + reasoning (closed as stale, reopen requested)
- #19051 — Grammar fails silently when GBNF parsing errors occur (closed as stale)
- #18675 — Autoparser (merged, present in this build, doesn't resolve this)
- vLLM and SGLang both support grammar enforcement with reasoning models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar enforcement not applied when thinking is enabled (response_format + enable_thinking) #20345

Summary

Evidence: grammar not enforced for ANY model with thinking ON

Reproduction

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Thinking ON + response_format	Grammar enforced?
Qwen3.5-35B-A3B	500 "Failed to parse input" (fenced JSON in error body)	No — grammar didn't prevent fences, PEG parser rejects
Qwen3-VL-8B	200 but wrong schema (`task_id`, `subtasks` instead of requested fields)	No — grammar didn't enforce schema, but bare JSON passes PEG
Qwen3.5-35B-A3B (thinking OFF)	200, correct schema	Yes

Grammar enforcement not applied when thinking is enabled (response_format + enable_thinking) #20345

Description

Summary

Evidence: grammar not enforced for ANY model with thinking ON

Reproduction

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions