Skip to content

fix: keep partial content when reasoning block is truncated by max_tokens#47

Open
rivetphilbot wants to merge 44 commits into
1CatAI:mainfrom
rivetphilbot:fix-empty-content-truncated-reasoning
Open

fix: keep partial content when reasoning block is truncated by max_tokens#47
rivetphilbot wants to merge 44 commits into
1CatAI:mainfrom
rivetphilbot:fix-empty-content-truncated-reasoning

Conversation

@rivetphilbot

Copy link
Copy Markdown

Problem

A non-streaming chat completion can come back with an empty content field even though the model generated coherent tokens. It happens whenever a request runs out of token budget while still inside the <think> reasoning block (i.e. max_tokens is reached before </think> is emitted).

Root cause

In OpenAIServingChat (non-streaming path, vllm/entrypoints/openai/chat_completion/serving.py), after the reasoning parser runs:

reasoning, content = reasoning_parser.extract_reasoning(parser_input_text, request=request)
if output.token_ids is not None and content is not None:
    try:
        content_ids = reasoning_parser.extract_content_ids(as_list(output.token_ids))
        content = tokenizer.decode(content_ids, skip_special_tokens=True)
    except Exception:
        pass

extract_reasoning correctly returns the partial thinking text as content. The handler then re-decodes from token ids via extract_content_ids, which returns [] when the reasoning block was never closed. tokenizer.decode([]) yields "", which overwrites the correct, non-empty content. The response ships empty.

Reproduced on a Qwen3 reasoning model: a prompt that asks the model to "think step by step" with a small max_tokens returns HTTP 200, finish_reason: "length", dozens of generated tokens — and empty content. The same prompt with adequate max_tokens (so </think> is reached) answers correctly.

Fix

Only override content with the re-decoded text when extract_content_ids actually returns ids. When it returns [] (unclosed think block), keep the content that extract_reasoning already produced.

The closed-think happy path is unchanged — extract_content_ids returns a non-empty list there and the re-decode proceeds exactly as before.

Verification

On a V100 build serving Qwen3 with the qwen3 reasoning parser:

  • Truncated-think request (max_tokens small) — before: empty content; after: partial reasoning text returned.
  • Ample-budget request (</think> reached) — unchanged, correct answer.
  • Long-context needle retrieval (20K, 51K tokens) — unaffected.

Scoped to one block, 11/-4 lines, no line-ending changes.

yangzhuxinyzx and others added 30 commits March 21, 2026 12:23
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com>
(cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com>
(cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
yangzhuxinyzx and others added 13 commits May 13, 2026 19:00
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
rivetphilbot added a commit to rivetphilbot/1Cat-vLLM that referenced this pull request Jun 1, 2026
…kens

extract_content_ids() returns [] when the <think> block was never closed
(generation truncated by max_tokens mid-reasoning). Decoding an empty id
list blanked out the content the reasoning parser already produced, so the
client got an empty/cut-off response on truncation. Only override content
when there are content ids. Mirrors 1CatAI#47.
The non-streaming chat handler re-decodes the completion text from
token ids via `extract_content_ids` after `extract_reasoning` has
already produced the content string.

`extract_content_ids` returns an empty list when the reasoning block
was never closed -- e.g. generation hit `max_tokens` while still inside
`<think>`. `tokenizer.decode([])` then yields an empty string, which
overwrites the (correct, non-empty) content that `extract_reasoning`
already extracted. The response goes out with empty `content` despite
the model having generated coherent tokens.

Only override `content` with the re-decoded text when `extract_content_ids`
actually returns ids. When it returns `[]`, keep what `extract_reasoning`
produced so truncated-think responses still carry their partial text.

The closed-think happy path is unaffected -- `extract_content_ids`
returns a non-empty list there and the re-decode proceeds as before.
@rivetphilbot rivetphilbot force-pushed the fix-empty-content-truncated-reasoning branch from dbde62d to 6235d4d Compare June 1, 2026 03:46
rjiangnju pushed a commit to rjiangnju/1Cat-vLLM-FP8 that referenced this pull request Jun 5, 2026
…kens

extract_content_ids() returns [] when the <think> block was never closed
(generation truncated by max_tokens mid-reasoning). Decoding an empty id
list blanked out the content the reasoning parser already produced, so the
client got an empty/cut-off response on truncation. Only override content
when there are content ids. Mirrors 1CatAI#47.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants