Goal: 1. Maximize the utilization of heterogeneous resources. 2. Adaptive resource allocation to align the extending rollout time. 3. Improve robustness to handle dynamic availability. - Intra-stage (rollout) balance - Rollout workload balance, start from homogenous rollout instances. - [ ] Round-robin request assignment. - [ ] Per-sample tracking and workload balance (need to check if affect the training progress) - [ ] Decouple data plane and control plane (replicate requests to all rollout instances and send control message during rollout). - Inter-stage (rollout vs. training) balance - Rollout buffer zone - [ ] Estimate the time gap between update finish and rollout ready. - [ ] Training engine rollout locally before receiving streamed batches. - Dynamic rollout instances allocation - [ ] Add rollout instances at runtime when rollout time extend. - Pack sequences in rollout manager - [ ] Send rollout prompts to manager in a batch. - [ ] Rollout manager decompose into per-sample requests and send to rollout instances. - [ ] Manage the order of rollout results and pack into micro-batches. - [ ] Dynamic batch size with a lower bound (block if under; return all when asked). - Fault tolerance - Handle multiple failure cases - [ ] Spot instance preemption - [ ] Failure during weight transfer - [ ] Failure during rollout - Weight compression - [ ] Quantization+lossless compression to reduce the size of weight before transfer.
Goal: