Skip to content

[Q3 Development Plan]: Full elastic and fault tolerant support #1

@libertyeagle

Description

@libertyeagle

Goal:

  1. Maximize the utilization of heterogeneous resources.
  2. Adaptive resource allocation to align the extending rollout time.
  3. Improve robustness to handle dynamic availability.
  • Intra-stage (rollout) balance
    • Rollout workload balance, start from homogenous rollout instances.
      • Round-robin request assignment.
      • Per-sample tracking and workload balance (need to check if affect the training progress)
      • Decouple data plane and control plane (replicate requests to all rollout instances and send control message during rollout).
  • Inter-stage (rollout vs. training) balance
    • Rollout buffer zone
      • Estimate the time gap between update finish and rollout ready.
      • Training engine rollout locally before receiving streamed batches.
    • Dynamic rollout instances allocation
      • Add rollout instances at runtime when rollout time extend.
  • Pack sequences in rollout manager
    • Send rollout prompts to manager in a batch.
    • Rollout manager decompose into per-sample requests and send to rollout instances.
    • Manage the order of rollout results and pack into micro-batches.
    • Dynamic batch size with a lower bound (block if under; return all when asked).
  • Fault tolerance
    • Handle multiple failure cases
      • Spot instance preemption
      • Failure during weight transfer
      • Failure during rollout
  • Weight compression
    • Quantization+lossless compression to reduce the size of weight before transfer.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions