Skip to content

Can't disable thinking/reasoning mode + JSON format issues with OpenAI-compatible API #24

@Milor123

Description

@Milor123

I'm using the ghcr.io/z-lab/paroquant:serve container with z-lab/Qwen3.5-9B-PARO. The server starts fine and exposes an OpenAI-compatible endpoint at /v1/chat/completions.

Problem 1: Can't disable thinking mode

According to the Qwen3.5 official docs, thinking can be disabled via request body:

{
  "extra_body": {
    "chat_template_kwargs": {"enable_thinking": false}
  }
}

However, these fields are silently ignored. Docker logs show:

The following fields were present in the request but ignored: {'repeat_penalty', 'extra_body'}

It appears --reasoning-parser qwen3 is hardcoded in the container startup, which forces thinking mode server-side and prevents per-request disabling.

Problem 2: JSON path extraction fails when thinking is enabled

When thinking is forced on, the model generates long <|im_start|>think...<|im_end|> blocks before the actual response. Standard JSON path extractors (like the ones used by Calibre plugins) expect clean text and get stuck parsing thinking content.

Current docker run command

docker run --pull=always --rm -it --gpus all --ipc=host -p 8888:8000 \
  -v C:\Users\User\Documents\Clonitaditos\Qwen3.5-9B-PARO\.cache\paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve \
  --model z-lab/Qwen3.5-9B-PARO \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1

Questions

  1. Is there a way to disable thinking mode from the docker run command? A flag like --enable-thinking false or --reasoning-parser none?
  2. Is there a planned option to make extra_body.chat_template_kwargs.enable_thinking respected at runtime?
  3. Any workaround for now (e.g., post-processing the response to strip <|im_start|>think...<|im_end|> blocks)?

Request format we tried

{
  "model": "z-lab/Qwen3.5-9B-PARO",
  "messages": [
    {"role": "system", "content": "Translate..."},
    {"role": "user", "content": "Hello world"}
  ],
  "temperature": 0.1,
  "top_p": 0.1,
  "top_k": 50,
  "repeat_penalty": 1.05,
  "min_p": 0.0,
  "extra_body": {
    "chat_template_kwargs": {"enable_thinking": false}
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions