Bumblebee

A lightweight distributed training library for large language models. Bumblebee exposes a runtime API for orchestration, composable primitives for implementation work, and model composition plus registration hooks for bringing architectures into the system.

~4,000 lines of infrastructure = ~33,000 lines of Megatron-Core equivalent functionality. A library, not a framework.

中文版

Why Bumblebee

Megatron as Baseline: BridgeRuntime provides a Megatron-aligned validation path, so new implementations can be checked step-by-step against loss and grad-norm baselines.
Composable Primitives: TP/EP/PP/CP, GQA attention, MoE, checkpointing, and optimizer layers are exposed as replaceable building blocks instead of hidden framework internals.
Model Composer Friendly: model composition plugs in through protocol.py plus register_model(...), so model code can live in your own repository.
Agent-Native: the codebase is organized so agents can read primitives, generate model scaffolding, and validate against the runtime contract.

Public Surface

bumblebee/
├── runtime/       Public runtime API: create_runtime, RuntimeConfig, PackedBatch, ...
├── primitive/     Building blocks: modules, parallel, checkpointing, optimizers
└── model/         Model registration + reference implementations

Recommended public entrypoints:

from bumblebee.runtime import create_runtime, RuntimeConfig, BBConfig, BridgeConfig
from bumblebee.runtime.contracts import PackedBatch, ForwardResult, ModelHandle
from bumblebee.model import register_model

Bridge note:

BridgeConfig keeps Megatron/mbridge tuning in override_ddp_config, override_transformer_config, and override_optimizer_config.
Use transformer keys such as attention_backend, moe_aux_loss_coeff, recompute_granularity, and recompute_modules inside override_transformer_config.
Old bridge top-level aliases such as attention_backend_override, router_aux_loss_coef, recompute, and recompute_modules have been removed.

Installation

pip install -e .

Hard runtime dependencies declared by the package:

torch>=2.4
safetensors>=0.4
transformer-engine[pytorch]

Notes:

Bumblebee is intentionally TE-only. Missing hard dependencies should fail immediately instead of silently falling back.
BridgeRuntime additionally expects an mbridge / Megatron-Core environment. The bb backend is the simpler starting point for package-level bring-up.

Quick Start

Quick Benchmark Command

# BB path
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/Qwen3-30B-A3B \
  backend.model_name=qwen3_moe \
  backend.impl=lite \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

# Bridge baseline path (requires mbridge / Megatron environment)
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bridge \
  hf_path=/models/Qwen3-30B-A3B \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

See Benchmark Workflow for the recommended comparison order and result-saving options.

Train with `bumblebee.runtime`

from bumblebee.runtime import BBConfig, ParallelConfig, RuntimeConfig, create_runtime

cfg = RuntimeConfig(
    backend="bb",
    hf_path="/models/Qwen3-30B-A3B",
    backend_cfg=BBConfig(
        model_name="qwen3_moe",
        parallel=ParallelConfig(tp=8, ep=4),
    ),
)

rt = create_runtime(cfg)
handle = rt.build_model()

with rt.train_mode(handle):
    rt.zero_grad(handle)
    result = rt.forward_backward(handle, data, loss_fn=None, num_microbatches=2)
    ok, grad_norm, num_zeros = rt.optimizer_step(handle)
    rt.lr_scheduler_step(handle)

print(result.metrics["loss"], ok, grad_norm, num_zeros)

Public contract notes:

create_runtime(RuntimeConfig(...)) captures the default backend config.
rt.build_model() is the recommended public path; passing overrides to build_model(...) is an advanced escape hatch.
forward_backward(...) returns ForwardResult, not a raw dict.
optimizer_step(...) returns (update_successful, grad_norm, num_zeros).

Compose a Model

my_repo/my_model/
  config.py            # architecture params
  model.py             # primitive composition
  checkpoint.py        # HF <-> BB weight mapping
  protocol.py          # build_* / load_hf_weights / make_forward_step

Register it:

from bumblebee.model import register_model

register_model(
    "my_model",
    package="my_repo.my_model",
    hf_model_types=["my_model"],
    impls={
        "lite": "my_repo.my_model.protocol",
    },
)

Then run:

torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/my_model \
  backend.model_name=my_model

More Benchmark Examples

# BB lite path
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/Qwen3-30B-A3B \
  backend.model_name=qwen3_moe \
  backend.impl=lite \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

# Bridge baseline path (requires mbridge/Megatron environment)
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bridge \
  hf_path=/models/Qwen3-30B-A3B \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

Documentation

Current docs live in this repository:

Current Priorities

The current GitHub-facing priorities are:

Stabilize the public runtime contract and examples.
Close the minimum VERL / RLHF loop around PackedBatch, per-token loss plumbing, and export/offload usage.
Fill shared cross-model gaps before model-specific feature work:
- Resume
- Offload
- THD
- THD + CP
Continue the next primitive/model wave:
- MLA attention
- Auxiliary-loss-free MoE
- MTP
- MLP
- MoELayer extraction into primitive/

Research Directions

Bumblebee's architecture still keeps research-friendly extension points:

Kernel replacement and backend comparison
Parallelism strategy experiments
MoE routing and expert scheduling
Model composition through protocol modules and registry
Agent-assisted implementation and verification loops

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude/skills		.claude/skills
bumblebee		bumblebee
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bumblebee

Why Bumblebee

Public Surface

Installation

Quick Start

Quick Benchmark Command

Train with `bumblebee.runtime`

Compose a Model

More Benchmark Examples

Documentation

Current Priorities

Research Directions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bumblebee

Why Bumblebee

Public Surface

Installation

Quick Start

Quick Benchmark Command

Train with bumblebee.runtime

Compose a Model

More Benchmark Examples

Documentation

Current Priorities

Research Directions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Train with `bumblebee.runtime`

Packages