Bumblebee Roadmap

Updated 2026-04-14

Positioning

Bumblebee provides atomic distributed-training operations. Upper-layer systems such as VERL or custom training loops compose those operations instead of being forced into a monolithic framework.

Phase 1: Config Unification + Legacy Cleanup ✅

Shared ParallelConfig / OptimizerConfig aligned with VERL defaults
BBConfig / BridgeConfig moved under their respective backend directories
Removed SessionConfig + bb/session.py + bb/helpers.py (~1100 lines)
Added register_model / register_runtime APIs
Introduced per-impl typed ImplConfig registration
forward_backward now returns typed ForwardResult
Unified loss_fn signature across both backends
Enforced zero runtime/ dependency on benchmarks/

Phase 2: GitHub Public Surface Cleanup (Current)

Goal: bring public docs, examples, package metadata, and real code interfaces back into alignment.

Item	Status
Align README / guides with the typed runtime contract	In progress
Make `register_model(...)` the single recommended registration path	In progress
Align `pyproject.toml`, README, and installation path	In progress
Make `build_model()` the default documented public path	In progress

Phase 3: VERL Integration

Goal: a VERL PPO loop can train through Bumblebee's runtime atomics.

Bumblebee already covers the core operations required by VERL actor/critic workers:

VERL call site	Bumblebee equivalent	Status
`forward_backward_func`	`rt.forward_backward(handle, data, loss_fn)`	✅
`optimizer.step()`	`rt.optimizer_step(handle)` -> `(ok, gn, nz)`	✅
`optimizer.zero_grad()` + `zero_grad_buffer()`	`rt.zero_grad(handle)`	✅
`lr_scheduler.step()`	`rt.lr_scheduler_step(handle)`	✅
load/offload model	`rt.to(handle, "cuda"/"cpu")`	✅
`train()` / `eval()`	`rt.train_mode(handle)` / `rt.eval_mode(handle)`	✅
`export_weights`	`rt.export_weights(handle)`	✅
save/load checkpoint	`rt.save_checkpoint(handle, path)` / `rt.load_checkpoint(handle, path)`	✅
`is_mp_src_rank`	`rt.is_mp_src_rank_with_outputs(handle)`	✅

Remaining gaps:

Gap	Priority	Notes
`grad_offload`	P1	VERL supports param / optimizer / grad offload; BB still lacks grad offload
Router replay	P1	Router forward should accept pre-recorded routing decisions
Distributed checkpointing	P2	MC-style reshardable optimizer state

VERL-side adaptation (not implemented inside BB):

Config mapping: McoreEngineConfig -> BBConfig / BridgeConfig
Data conversion: TensorDict -> BB data iterator
Loss passthrough: VERL's loss_fn(model_output, data, dp_group=None) can be passed directly (dp_group is unused on the BB side)

Phase 4: Shared Infrastructure Priorities

Before continuing model-specific feature work, prioritize the shared capabilities that most affect real first use.

Feature	Priority	Why first
Resume	P0	Still missing in 3 models; directly affects real training recovery
Offload	P0	Still missing in 3 models; directly affects RLHF and memory-constrained scenarios
THD	P1	Strongly tied to `PackedBatch` and VERL integration
THD + CP	P1	Important combination, but narrower than plain THD

Phase 5: BB ↔ Bridge Accuracy Alignment

Goal: under the same configuration, BB and Bridge should produce matching loss / grad_norm / speed behavior.

Step	Notes
Fixed-seed step-by-step comparison	`loss diff < 1e-5`, `grad_norm diff < 1e-3`
Difference debugging	`forward -> backward -> optimizer_step`
Speed alignment	`tok/s/gpu` gap between BB and Bridge under 2%

Phase 6: New Models / New Primitives

Qwen3.5

If the architecture stays close to Qwen3, most code should be reusable, with roughly ~300 lines of new code:

config.py + model.py + checkpoint.py + protocol.py
register_model("qwen3_5_moe", ...)
Accuracy + speed alignment

DeepSeek V3.2

The architecture differs more substantially (MLA != GQA), so new primitives are required:

MLA attention module
Auxiliary-loss-free MoE router
Multi-Token Prediction
Roughly ~800 lines of new code

Quality Bar

Area	Requirement
Accuracy	Every model must align step-by-step against `BridgeRuntime`
Speed	Per-configuration `tok/s/gpu` should stay within 2% of Bridge
Tests	GPU integration tests are required (currently near-zero coverage; still P0)
Boundaries	`model/` must not import `runtime/`; `runtime/` must not import `benchmarks/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bumblebee Roadmap

Positioning

Phase 1: Config Unification + Legacy Cleanup ✅

Phase 2: GitHub Public Surface Cleanup (Current)

Phase 3: VERL Integration

Phase 4: Shared Infrastructure Priorities

Phase 5: BB ↔ Bridge Accuracy Alignment

Phase 6: New Models / New Primitives

Qwen3.5

DeepSeek V3.2

Quality Bar

FilesExpand file tree

roadmap.md

Latest commit

History

roadmap.md

File metadata and controls

Bumblebee Roadmap

Positioning

Phase 1: Config Unification + Legacy Cleanup ✅

Phase 2: GitHub Public Surface Cleanup (Current)

Phase 3: VERL Integration

Phase 4: Shared Infrastructure Priorities

Phase 5: BB ↔ Bridge Accuracy Alignment

Phase 6: New Models / New Primitives

Qwen3.5

DeepSeek V3.2

Quality Bar