I'm using the ghcr.io/z-lab/paroquant:serve container with z-lab/Qwen3.5-9B-PARO. The server starts fine and exposes an OpenAI-compatible endpoint at /v1/chat/completions.
Problem 1: Can't disable thinking mode
According to the Qwen3.5 official docs, thinking can be disabled via request body:
{
"extra_body": {
"chat_template_kwargs": {"enable_thinking": false}
}
}
However, these fields are silently ignored. Docker logs show:
The following fields were present in the request but ignored: {'repeat_penalty', 'extra_body'}
It appears --reasoning-parser qwen3 is hardcoded in the container startup, which forces thinking mode server-side and prevents per-request disabling.
Problem 2: JSON path extraction fails when thinking is enabled
When thinking is forced on, the model generates long <|im_start|>think...<|im_end|> blocks before the actual response. Standard JSON path extractors (like the ones used by Calibre plugins) expect clean text and get stuck parsing thinking content.
Current docker run command
docker run --pull=always --rm -it --gpus all --ipc=host -p 8888:8000 \
-v C:\Users\User\Documents\Clonitaditos\Qwen3.5-9B-PARO\.cache\paroquant:/root/.cache/paroquant \
ghcr.io/z-lab/paroquant:serve \
--model z-lab/Qwen3.5-9B-PARO \
--gpu-memory-utilization 0.9 \
--max-num-seqs 1
Questions
- Is there a way to disable thinking mode from the docker run command? A flag like
--enable-thinking false or --reasoning-parser none?
- Is there a planned option to make
extra_body.chat_template_kwargs.enable_thinking respected at runtime?
- Any workaround for now (e.g., post-processing the response to strip
<|im_start|>think...<|im_end|> blocks)?
Request format we tried
{
"model": "z-lab/Qwen3.5-9B-PARO",
"messages": [
{"role": "system", "content": "Translate..."},
{"role": "user", "content": "Hello world"}
],
"temperature": 0.1,
"top_p": 0.1,
"top_k": 50,
"repeat_penalty": 1.05,
"min_p": 0.0,
"extra_body": {
"chat_template_kwargs": {"enable_thinking": false}
}
}
I'm using the
ghcr.io/z-lab/paroquant:servecontainer withz-lab/Qwen3.5-9B-PARO. The server starts fine and exposes an OpenAI-compatible endpoint at/v1/chat/completions.Problem 1: Can't disable thinking mode
According to the Qwen3.5 official docs, thinking can be disabled via request body:
{ "extra_body": { "chat_template_kwargs": {"enable_thinking": false} } }However, these fields are silently ignored. Docker logs show:
It appears
--reasoning-parser qwen3is hardcoded in the container startup, which forces thinking mode server-side and prevents per-request disabling.Problem 2: JSON path extraction fails when thinking is enabled
When thinking is forced on, the model generates long
<|im_start|>think...<|im_end|>blocks before the actual response. Standard JSON path extractors (like the ones used by Calibre plugins) expect clean text and get stuck parsing thinking content.Current docker run command
Questions
--enable-thinking falseor--reasoning-parser none?extra_body.chat_template_kwargs.enable_thinkingrespected at runtime?<|im_start|>think...<|im_end|>blocks)?Request format we tried
{ "model": "z-lab/Qwen3.5-9B-PARO", "messages": [ {"role": "system", "content": "Translate..."}, {"role": "user", "content": "Hello world"} ], "temperature": 0.1, "top_p": 0.1, "top_k": 50, "repeat_penalty": 1.05, "min_p": 0.0, "extra_body": { "chat_template_kwargs": {"enable_thinking": false} } }