refactor: unify rec multi round decode mode with one-stage flag.#1000
Conversation
|
I will rebase after #933 merge |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring to unify the multi-round decode modes for rec under a single flag, FLAGS_enable_xattention_one_stage. The changes are extensive, touching attention kernels, metadata builders, and the CUDA graph executor. The two modes, one-stage and two-stage decode, are now cleanly separated, with the two-stage path implementing a shared/unshared attention optimization. New components like XAttentionWorkspace and xattention_planinfo have been added to support this. Crucially, a new test has been added to compare the outputs of both decode paths, ensuring correctness of the refactoring. The implementation appears solid and consistent across the codebase. I have no high or critical severity comments.
973d75e to
4ed51fb
Compare
4ed51fb to
43e1cc4
Compare
b6134fc
43e1cc4 to
b6134fc
Compare
What This PR Changes
FLAGS_enable_xattention_one_stage == true-> one-stageFLAGS_enable_xattention_one_stage == false-> two-stageenable_xattention_two_stage_decodeenable_xattention_one_stage_decodeis_xattention_two_stage_decode_enabled()xattention_two_stage_decode_cache.has_value()two_stage_* .defined()checks