fix: keep partial content when reasoning block is truncated by max_tokens by rivetphilbot · Pull Request #47 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-19T10:14:07Z

Problem

A non-streaming chat completion can come back with an empty content field even though the model generated coherent tokens. It happens whenever a request runs out of token budget while still inside the <think> reasoning block (i.e. max_tokens is reached before </think> is emitted).

Root cause

In OpenAIServingChat (non-streaming path, vllm/entrypoints/openai/chat_completion/serving.py), after the reasoning parser runs:

reasoning, content = reasoning_parser.extract_reasoning(parser_input_text, request=request)
if output.token_ids is not None and content is not None:
    try:
        content_ids = reasoning_parser.extract_content_ids(as_list(output.token_ids))
        content = tokenizer.decode(content_ids, skip_special_tokens=True)
    except Exception:
        pass

extract_reasoning correctly returns the partial thinking text as content. The handler then re-decodes from token ids via extract_content_ids, which returns [] when the reasoning block was never closed. tokenizer.decode([]) yields "", which overwrites the correct, non-empty content. The response ships empty.

Reproduced on a Qwen3 reasoning model: a prompt that asks the model to "think step by step" with a small max_tokens returns HTTP 200, finish_reason: "length", dozens of generated tokens — and empty content. The same prompt with adequate max_tokens (so </think> is reached) answers correctly.

Fix

Only override content with the re-decoded text when extract_content_ids actually returns ids. When it returns [] (unclosed think block), keep the content that extract_reasoning already produced.

The closed-think happy path is unchanged — extract_content_ids returns a non-empty list there and the re-decode proceeds exactly as before.

Verification

On a V100 build serving Qwen3 with the qwen3 reasoning parser:

Truncated-think request (max_tokens small) — before: empty content; after: partial reasoning text returned.
Ample-budget request (</think> reached) — unchanged, correct answer.
Long-context needle retrieval (20K, 51K tokens) — unaffected.

Scoped to one block, 11/-4 lines, no line-ending changes.

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Updated the WeChat group QR code image in the README.

修复了错误的名字

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.

The non-streaming chat handler re-decodes the completion text from token ids via `extract_content_ids` after `extract_reasoning` has already produced the content string. `extract_content_ids` returns an empty list when the reasoning block was never closed -- e.g. generation hit `max_tokens` while still inside `<think>`. `tokenizer.decode([])` then yields an empty string, which overwrites the (correct, non-empty) content that `extract_reasoning` already extracted. The response goes out with empty `content` despite the model having generated coherent tokens. Only override `content` with the re-decoded text when `extract_content_ids` actually returns ids. When it returns `[]`, keep what `extract_reasoning` produced so truncated-think responses still carry their partial text. The closed-think happy path is unaffected -- `extract_content_ids` returns a non-empty list there and the re-decode proceeds as before.

…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.

yangzhuxinyzx and others added 30 commits March 21, 2026 12:23

[Core] Import 1Cat-vLLM-0.0.2 runtime and build system

4683901

[CI/Build] Vendor lmdeploy source for standalone builds

92c6efb

[Kernel] Add validation, examples, and benchmark assets

5262499

[Doc] Publish 1Cat-vLLM-0.0.2 release snapshot

b3b1abd

[Doc] Update rebuilt wheel download links

6fd0f8d

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

[Bugfix] Vendor runtime Python packages for source builds

a8783b0

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

[CI/Build][Doc] Add verified SM70 Docker runtime path

1e6c257

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Add files via upload

f29bd45

Change WeChat group QR code image

d6c28dc

Updated the WeChat group QR code image in the README.

Update README.md

18e5223

Add files via upload

3c7a8a3

Update Dockerfile.sm70-wheel

f5d2e15

修复了错误的名字

Add files via upload

feb8402

docs: update wechat group qr code

c1dce83

docs: update WeChat group QR code

82f59c8

Release 1Cat-vLLM 0.0.3

92a785c

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Merge 1CatAI main history for 0.0.3

eea9d81

Update README.md

04bb4b7

Update README.md

7a7549c

Update README.md

6276450

Update README.md

a1bf487

docs: clarify wheel runtime directory

197f1cc

[Kernel] Add V100 FA2 fp8 KV cache audits

58ebaa6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Trim V100 startup memory defaults

3b539f9

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

QRcode-update

437b358

[Core] Prepare 1.0.0 V100 release

a4daad6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Update 1.0.0 wheel install and MTP launch

761ae33

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Simplify public launch commands

0741a30

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Restore validated MTP launch profile

36536e5

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add MTP throughput note

29b73ec

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

yangzhuxinyzx and others added 13 commits May 13, 2026 19:00

[Bugfix] Restore spec proposer compatibility

0ac0632

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add TP2 MTP launch profile

05ac1a4

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Archive FP8 MTP investigation state

8b536c1

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

bf37452

[Kernel] Add SM70 FP8 MoE fast path

69749dd

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Credit flash-attention-v100

d18b16c

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Bugfix] Stabilize MTP state handling

acd2a31

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

06f7a38

docs: update WeChat group QR code to Group 3

f1a64a7

[Build] Prepare 1Cat-vLLM 1.0.1 release

42f23f6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Build] Prepare 1Cat-vLLM 1.1.0 beta release

a645fcb

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Refocus README on project overview

530ac4d

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

432f197

This was referenced Jun 1, 2026

[Bugfix] Default Qwen3 reasoning parser to prompt-has-open-think #52

Closed

ci: fix CRLF line endings in shell scripts #46

Closed

rivetphilbot mentioned this pull request Jun 1, 2026

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes #55

Open

rivetphilbot force-pushed the fix-empty-content-truncated-reasoning branch from dbde62d to 6235d4d Compare June 1, 2026 03:46

yangzhuxinyzx force-pushed the main branch from 63b05fc to 00323f2 Compare June 15, 2026 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep partial content when reasoning block is truncated by max_tokens#47

fix: keep partial content when reasoning block is truncated by max_tokens#47
rivetphilbot wants to merge 44 commits into
1CatAI:mainfrom
rivetphilbot:fix-empty-content-truncated-reasoning

rivetphilbot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rivetphilbot commented May 19, 2026

Problem

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants