Skip to content

Latest commit

 

History

History
110 lines (79 loc) · 4.71 KB

File metadata and controls

110 lines (79 loc) · 4.71 KB

Bumblebee Roadmap

Updated 2026-04-14

Positioning

Bumblebee provides atomic distributed-training operations. Upper-layer systems such as VERL or custom training loops compose those operations instead of being forced into a monolithic framework.

Phase 1: Config Unification + Legacy Cleanup ✅

  • Shared ParallelConfig / OptimizerConfig aligned with VERL defaults
  • BBConfig / BridgeConfig moved under their respective backend directories
  • Removed SessionConfig + bb/session.py + bb/helpers.py (~1100 lines)
  • Added register_model / register_runtime APIs
  • Introduced per-impl typed ImplConfig registration
  • forward_backward now returns typed ForwardResult
  • Unified loss_fn signature across both backends
  • Enforced zero runtime/ dependency on benchmarks/

Phase 2: GitHub Public Surface Cleanup (Current)

Goal: bring public docs, examples, package metadata, and real code interfaces back into alignment.

Item Status
Align README / guides with the typed runtime contract In progress
Make register_model(...) the single recommended registration path In progress
Align pyproject.toml, README, and installation path In progress
Make build_model() the default documented public path In progress

Phase 3: VERL Integration

Goal: a VERL PPO loop can train through Bumblebee's runtime atomics.

Bumblebee already covers the core operations required by VERL actor/critic workers:

VERL call site Bumblebee equivalent Status
forward_backward_func rt.forward_backward(handle, data, loss_fn)
optimizer.step() rt.optimizer_step(handle) -> (ok, gn, nz)
optimizer.zero_grad() + zero_grad_buffer() rt.zero_grad(handle)
lr_scheduler.step() rt.lr_scheduler_step(handle)
load/offload model rt.to(handle, "cuda"/"cpu")
train() / eval() rt.train_mode(handle) / rt.eval_mode(handle)
export_weights rt.export_weights(handle)
save/load checkpoint rt.save_checkpoint(handle, path) / rt.load_checkpoint(handle, path)
is_mp_src_rank rt.is_mp_src_rank_with_outputs(handle)

Remaining gaps:

Gap Priority Notes
grad_offload P1 VERL supports param / optimizer / grad offload; BB still lacks grad offload
Router replay P1 Router forward should accept pre-recorded routing decisions
Distributed checkpointing P2 MC-style reshardable optimizer state

VERL-side adaptation (not implemented inside BB):

  • Config mapping: McoreEngineConfig -> BBConfig / BridgeConfig
  • Data conversion: TensorDict -> BB data iterator
  • Loss passthrough: VERL's loss_fn(model_output, data, dp_group=None) can be passed directly (dp_group is unused on the BB side)

Phase 4: Shared Infrastructure Priorities

Before continuing model-specific feature work, prioritize the shared capabilities that most affect real first use.

Feature Priority Why first
Resume P0 Still missing in 3 models; directly affects real training recovery
Offload P0 Still missing in 3 models; directly affects RLHF and memory-constrained scenarios
THD P1 Strongly tied to PackedBatch and VERL integration
THD + CP P1 Important combination, but narrower than plain THD

Phase 5: BB ↔ Bridge Accuracy Alignment

Goal: under the same configuration, BB and Bridge should produce matching loss / grad_norm / speed behavior.

Step Notes
Fixed-seed step-by-step comparison loss diff < 1e-5, grad_norm diff < 1e-3
Difference debugging forward -> backward -> optimizer_step
Speed alignment tok/s/gpu gap between BB and Bridge under 2%

Phase 6: New Models / New Primitives

Qwen3.5

If the architecture stays close to Qwen3, most code should be reusable, with roughly ~300 lines of new code:

  • config.py + model.py + checkpoint.py + protocol.py
  • register_model("qwen3_5_moe", ...)
  • Accuracy + speed alignment

DeepSeek V3.2

The architecture differs more substantially (MLA != GQA), so new primitives are required:

  • MLA attention module
  • Auxiliary-loss-free MoE router
  • Multi-Token Prediction
  • Roughly ~800 lines of new code

Quality Bar

Area Requirement
Accuracy Every model must align step-by-step against BridgeRuntime
Speed Per-configuration tok/s/gpu should stay within 2% of Bridge
Tests GPU integration tests are required (currently near-zero coverage; still P0)
Boundaries model/ must not import runtime/; runtime/ must not import benchmarks/