Updated 2026-04-14
Bumblebee provides atomic distributed-training operations. Upper-layer systems such as VERL or custom training loops compose those operations instead of being forced into a monolithic framework.
- Shared
ParallelConfig/OptimizerConfigaligned with VERL defaults BBConfig/BridgeConfigmoved under their respective backend directories- Removed
SessionConfig+bb/session.py+bb/helpers.py(~1100 lines) - Added
register_model/register_runtimeAPIs - Introduced per-impl typed
ImplConfigregistration forward_backwardnow returns typedForwardResult- Unified
loss_fnsignature across both backends - Enforced zero
runtime/dependency onbenchmarks/
Goal: bring public docs, examples, package metadata, and real code interfaces back into alignment.
| Item | Status |
|---|---|
| Align README / guides with the typed runtime contract | In progress |
Make register_model(...) the single recommended registration path |
In progress |
Align pyproject.toml, README, and installation path |
In progress |
Make build_model() the default documented public path |
In progress |
Goal: a VERL PPO loop can train through Bumblebee's runtime atomics.
Bumblebee already covers the core operations required by VERL actor/critic workers:
| VERL call site | Bumblebee equivalent | Status |
|---|---|---|
forward_backward_func |
rt.forward_backward(handle, data, loss_fn) |
✅ |
optimizer.step() |
rt.optimizer_step(handle) -> (ok, gn, nz) |
✅ |
optimizer.zero_grad() + zero_grad_buffer() |
rt.zero_grad(handle) |
✅ |
lr_scheduler.step() |
rt.lr_scheduler_step(handle) |
✅ |
| load/offload model | rt.to(handle, "cuda"/"cpu") |
✅ |
train() / eval() |
rt.train_mode(handle) / rt.eval_mode(handle) |
✅ |
export_weights |
rt.export_weights(handle) |
✅ |
| save/load checkpoint | rt.save_checkpoint(handle, path) / rt.load_checkpoint(handle, path) |
✅ |
is_mp_src_rank |
rt.is_mp_src_rank_with_outputs(handle) |
✅ |
Remaining gaps:
| Gap | Priority | Notes |
|---|---|---|
grad_offload |
P1 | VERL supports param / optimizer / grad offload; BB still lacks grad offload |
| Router replay | P1 | Router forward should accept pre-recorded routing decisions |
| Distributed checkpointing | P2 | MC-style reshardable optimizer state |
VERL-side adaptation (not implemented inside BB):
- Config mapping:
McoreEngineConfig->BBConfig/BridgeConfig - Data conversion: TensorDict -> BB data iterator
- Loss passthrough: VERL's
loss_fn(model_output, data, dp_group=None)can be passed directly (dp_groupis unused on the BB side)
Before continuing model-specific feature work, prioritize the shared capabilities that most affect real first use.
| Feature | Priority | Why first |
|---|---|---|
| Resume | P0 | Still missing in 3 models; directly affects real training recovery |
| Offload | P0 | Still missing in 3 models; directly affects RLHF and memory-constrained scenarios |
| THD | P1 | Strongly tied to PackedBatch and VERL integration |
| THD + CP | P1 | Important combination, but narrower than plain THD |
Goal: under the same configuration, BB and Bridge should produce matching loss / grad_norm / speed behavior.
| Step | Notes |
|---|---|
| Fixed-seed step-by-step comparison | loss diff < 1e-5, grad_norm diff < 1e-3 |
| Difference debugging | forward -> backward -> optimizer_step |
| Speed alignment | tok/s/gpu gap between BB and Bridge under 2% |
If the architecture stays close to Qwen3, most code should be reusable, with roughly ~300 lines of new code:
config.py+model.py+checkpoint.py+protocol.pyregister_model("qwen3_5_moe", ...)- Accuracy + speed alignment
The architecture differs more substantially (MLA != GQA), so new primitives are required:
- MLA attention module
- Auxiliary-loss-free MoE router
- Multi-Token Prediction
- Roughly ~800 lines of new code
| Area | Requirement |
|---|---|
| Accuracy | Every model must align step-by-step against BridgeRuntime |
| Speed | Per-configuration tok/s/gpu should stay within 2% of Bridge |
| Tests | GPU integration tests are required (currently near-zero coverage; still P0) |
| Boundaries | model/ must not import runtime/; runtime/ must not import benchmarks/ |