fix: keep partial content when reasoning block is truncated by max_tokens#47
Open
rivetphilbot wants to merge 44 commits into
Open
fix: keep partial content when reasoning block is truncated by max_tokens#47rivetphilbot wants to merge 44 commits into
rivetphilbot wants to merge 44 commits into
Conversation
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
This was referenced Jun 1, 2026
rivetphilbot
added a commit
to rivetphilbot/1Cat-vLLM
that referenced
this pull request
Jun 1, 2026
…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.
The non-streaming chat handler re-decodes the completion text from token ids via `extract_content_ids` after `extract_reasoning` has already produced the content string. `extract_content_ids` returns an empty list when the reasoning block was never closed -- e.g. generation hit `max_tokens` while still inside `<think>`. `tokenizer.decode([])` then yields an empty string, which overwrites the (correct, non-empty) content that `extract_reasoning` already extracted. The response goes out with empty `content` despite the model having generated coherent tokens. Only override `content` with the re-decoded text when `extract_content_ids` actually returns ids. When it returns `[]`, keep what `extract_reasoning` produced so truncated-think responses still carry their partial text. The closed-think happy path is unaffected -- `extract_content_ids` returns a non-empty list there and the re-decode proceeds as before.
dbde62d to
6235d4d
Compare
rjiangnju
pushed a commit
to rjiangnju/1Cat-vLLM-FP8
that referenced
this pull request
Jun 5, 2026
…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A non-streaming chat completion can come back with an empty
contentfield even though the model generated coherent tokens. It happens whenever a request runs out of token budget while still inside the<think>reasoning block (i.e.max_tokensis reached before</think>is emitted).Root cause
In
OpenAIServingChat(non-streaming path,vllm/entrypoints/openai/chat_completion/serving.py), after the reasoning parser runs:extract_reasoningcorrectly returns the partial thinking text ascontent. The handler then re-decodes from token ids viaextract_content_ids, which returns[]when the reasoning block was never closed.tokenizer.decode([])yields"", which overwrites the correct, non-emptycontent. The response ships empty.Reproduced on a Qwen3 reasoning model: a prompt that asks the model to "think step by step" with a small
max_tokensreturns HTTP 200,finish_reason: "length", dozens of generated tokens — and emptycontent. The same prompt with adequatemax_tokens(so</think>is reached) answers correctly.Fix
Only override
contentwith the re-decoded text whenextract_content_idsactually returns ids. When it returns[](unclosed think block), keep thecontentthatextract_reasoningalready produced.The closed-think happy path is unchanged —
extract_content_idsreturns a non-empty list there and the re-decode proceeds exactly as before.Verification
On a V100 build serving Qwen3 with the
qwen3reasoning parser:max_tokenssmall) — before: emptycontent; after: partial reasoning text returned.</think>reached) — unchanged, correct answer.Scoped to one block, 11/-4 lines, no line-ending changes.