[feat] rollout indexer replay support#1183
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements indexer replay functionality, enabling the capture and re-use of indexer top-k decisions from the rollout engine during training. Changes span across CLI argument handling, data processing in Megatron and SGLang backends, and sample management utilities. Key feedback identifies a critical issue where the sequential replay registration fails under pipeline parallelism due to missing stream slicing. Additionally, the configuration for the IndexerReplayManager needs adjustments to support sequence parallelism and should avoid hardcoded thresholds by utilizing model configuration parameters.
| if replay_data.shape[1] != len(replay_list): | ||
| raise AssertionError( | ||
| f"replay data has {replay_data.shape[1]} streams, but {len(replay_list)} modules registered replay" | ||
| ) |
There was a problem hiding this comment.
The _register_replay_list_sequential function is incompatible with pipeline parallelism (PP > 1). replay_data (sourced from SGLang) contains indexer streams for all layers in the model, whereas replay_list only contains the modules registered on the current PP rank. This discrepancy will cause the AssertionError to trigger on any rank where the number of local indexer modules does not match the total number of streams.
To fix this, you should use the pipeline stage offset to slice the replay_data streams, similar to the logic implemented in _register_replay_list_moe.
| if_sp_region = False | ||
| enable_check_replay_result = False | ||
| replay_check_threshold = 0.7 |
There was a problem hiding this comment.
There are two issues in the IndexerReplayManager configuration:
if_sp_regionis set toFalse. If indexers are used within transformer layers (which is typical for models like DeepSeek-V3/V4) and sequence parallelism is enabled, this will cause a shape mismatch crash in_get_replay_resultbecause thescorestensor will be sliced while the replayedtop_indiceswill not be.replay_check_thresholdis set to0.7. This is extremely loose compared toRoutingReplayManager(0.01). A 70% allowed mismatch effectively disables the utility of the CI correctness check for indexer replay, as training behavior would likely diverge significantly with such a high mismatch rate.
Additionally, ensure these parameters are retrieved from the model configuration rather than being hardcoded to maintain consistency with repository guidelines.
| if_sp_region = False | |
| enable_check_replay_result = False | |
| replay_check_threshold = 0.7 | |
| if_sp_region = config.if_sp_region | |
| enable_check_replay_result = False | |
| replay_check_threshold = config.replay_check_threshold |
References
- Model parameters, such as index_topk, should be retrieved from the model configuration rather than being hardcoded.
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
1057aad to
a40696d
Compare
Summary
IndexerReplayManagerand sequential replay registration for indexer top-k streams.indexer_topkthrough SGLang rollout/session/OpenAI response handling and training data plumbing.Tests
python -m compileall miles tests/fast/rollout/generate_utils/test_indexer_replay.py tests/fast/backends/megatron_utils/test_replay_utils.pyuvx ruff check ...on touched filesuvx black --check ...on touched filespython -m pytest --confcutdir=tests/fast/rollout/generate_utils tests/fast/rollout/generate_utils/test_indexer_replay.py tests/fast/rollout/generate_utils/test_sample_utils.py tests/fast/rollout/generate_utils/test_openai_endpoint_utils.py -k 'not test_create_fetches_session_server_instance_id'python -m pytest --confcutdir=tests/fast/backends/megatron_utils tests/fast/backends/megatron_utils/test_replay_utils.pypython -m pytest --confcutdir=tests/fast/utils tests/fast/utils/test_types.py