Skip to content

refactor: unify rec multi round decode mode with one-stage flag.#1000

Merged
DragonFive merged 1 commit intojd-opensource:mainfrom
LMX-xin:feat/refactor_rec_attention
Mar 12, 2026
Merged

refactor: unify rec multi round decode mode with one-stage flag.#1000
DragonFive merged 1 commit intojd-opensource:mainfrom
LMX-xin:feat/refactor_rec_attention

Conversation

@LMX-xin
Copy link
Copy Markdown
Collaborator

@LMX-xin LMX-xin commented Mar 5, 2026

What This PR Changes

  1. Uses a single gflag to select decode mode:
    • FLAGS_enable_xattention_one_stage == true -> one-stage
    • FLAGS_enable_xattention_one_stage == false -> two-stage
  2. Removes legacy/extra mode-selection paths:
    • enable_xattention_two_stage_decode
    • enable_xattention_one_stage_decode
    • is_xattention_two_stage_decode_enabled()
  3. Removes cache-state-based mode gating:
    • No branch selection based on xattention_two_stage_decode_cache.has_value()
    • No branch selection based on aggregated two_stage_* .defined() checks

@LMX-xin
Copy link
Copy Markdown
Collaborator Author

LMX-xin commented Mar 5, 2026

I will rebase after #933 merge

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring to unify the multi-round decode modes for rec under a single flag, FLAGS_enable_xattention_one_stage. The changes are extensive, touching attention kernels, metadata builders, and the CUDA graph executor. The two modes, one-stage and two-stage decode, are now cleanly separated, with the two-stage path implementing a shared/unshared attention optimization. New components like XAttentionWorkspace and xattention_planinfo have been added to support this. Crucially, a new test has been added to compare the outputs of both decode paths, ensuring correctness of the refactoring. The implementation appears solid and consistent across the codebase. I have no high or critical severity comments.

Comment thread xllm/core/layers/cuda/xattention_test.cpp Outdated
@LMX-xin LMX-xin force-pushed the feat/refactor_rec_attention branch from 4ed51fb to 43e1cc4 Compare March 11, 2026 03:52
@LMX-xin LMX-xin marked this pull request as ready for review March 11, 2026 06:26
JimHsiung
JimHsiung previously approved these changes Mar 11, 2026
DragonFive
DragonFive previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Collaborator

@DragonFive DragonFive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

magicheng0816
magicheng0816 previously approved these changes Mar 11, 2026
Comment thread xllm/core/runtime/rec_worker_impl.cpp Outdated
@LMX-xin LMX-xin dismissed stale reviews from magicheng0816, DragonFive, and JimHsiung via b6134fc March 11, 2026 08:14
@LMX-xin LMX-xin force-pushed the feat/refactor_rec_attention branch from 43e1cc4 to b6134fc Compare March 11, 2026 08:14
@DragonFive DragonFive merged commit 10b8122 into jd-opensource:main Mar 12, 2026
13 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants