Skip to content

[feat] rollout indexer replay support#1183

Open
yueming-yuan wants to merge 1 commit into
radixark:mainfrom
yueming-yuan:indexer-replay
Open

[feat] rollout indexer replay support#1183
yueming-yuan wants to merge 1 commit into
radixark:mainfrom
yueming-yuan:indexer-replay

Conversation

@yueming-yuan
Copy link
Copy Markdown
Collaborator

Summary

  • Add a generic IndexerReplayManager and sequential replay registration for indexer top-k streams.
  • Thread indexer_topk through SGLang rollout/session/OpenAI response handling and training data plumbing.
  • Add generic rollout indexer replay shape args without DeepSeek-V4-specific fallbacks.

Tests

  • python -m compileall miles tests/fast/rollout/generate_utils/test_indexer_replay.py tests/fast/backends/megatron_utils/test_replay_utils.py
  • uvx ruff check ... on touched files
  • uvx black --check ... on touched files
  • python -m pytest --confcutdir=tests/fast/rollout/generate_utils tests/fast/rollout/generate_utils/test_indexer_replay.py tests/fast/rollout/generate_utils/test_sample_utils.py tests/fast/rollout/generate_utils/test_openai_endpoint_utils.py -k 'not test_create_fetches_session_server_instance_id'
  • python -m pytest --confcutdir=tests/fast/backends/megatron_utils tests/fast/backends/megatron_utils/test_replay_utils.py
  • python -m pytest --confcutdir=tests/fast/utils tests/fast/utils/test_types.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements indexer replay functionality, enabling the capture and re-use of indexer top-k decisions from the rollout engine during training. Changes span across CLI argument handling, data processing in Megatron and SGLang backends, and sample management utilities. Key feedback identifies a critical issue where the sequential replay registration fails under pipeline parallelism due to missing stream slicing. Additionally, the configuration for the IndexerReplayManager needs adjustments to support sequence parallelism and should avoid hardcoded thresholds by utilizing model configuration parameters.

Comment on lines +30 to +33
if replay_data.shape[1] != len(replay_list):
raise AssertionError(
f"replay data has {replay_data.shape[1]} streams, but {len(replay_list)} modules registered replay"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _register_replay_list_sequential function is incompatible with pipeline parallelism (PP > 1). replay_data (sourced from SGLang) contains indexer streams for all layers in the model, whereas replay_list only contains the modules registered on the current PP rank. This discrepancy will cause the AssertionError to trigger on any rank where the number of local indexer modules does not match the total number of streams.

To fix this, you should use the pipeline stage offset to slice the replay_data streams, similar to the logic implemented in _register_replay_list_moe.

Comment on lines +207 to +209
if_sp_region = False
enable_check_replay_result = False
replay_check_threshold = 0.7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues in the IndexerReplayManager configuration:

  1. if_sp_region is set to False. If indexers are used within transformer layers (which is typical for models like DeepSeek-V3/V4) and sequence parallelism is enabled, this will cause a shape mismatch crash in _get_replay_result because the scores tensor will be sliced while the replayed top_indices will not be.
  2. replay_check_threshold is set to 0.7. This is extremely loose compared to RoutingReplayManager (0.01). A 70% allowed mismatch effectively disables the utility of the CI correctness check for indexer replay, as training behavior would likely diverge significantly with such a high mismatch rate.

Additionally, ensure these parameters are retrieved from the model configuration rather than being hardcoded to maintain consistency with repository guidelines.

Suggested change
if_sp_region = False
enable_check_replay_result = False
replay_check_threshold = 0.7
if_sp_region = config.if_sp_region
enable_check_replay_result = False
replay_check_threshold = config.replay_check_threshold
References
  1. Model parameters, such as index_topk, should be retrieved from the model configuration rather than being hardcoded.

@yueming-yuan yueming-yuan changed the title Add rollout indexer replay support [feat] rollout indexer replay support May 22, 2026
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant