Skip to content

conver334/bumblebee

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bumblebee

A lightweight distributed training library for large language models. Bumblebee exposes a runtime API for orchestration, composable primitives for implementation work, and model composition plus registration hooks for bringing architectures into the system.

~4,000 lines of infrastructure = ~33,000 lines of Megatron-Core equivalent functionality. A library, not a framework.

中文版

Bumblebee equips Megatron's parts

Why Bumblebee

  • Megatron as Baseline: BridgeRuntime provides a Megatron-aligned validation path, so new implementations can be checked step-by-step against loss and grad-norm baselines.
  • Composable Primitives: TP/EP/PP/CP, GQA attention, MoE, checkpointing, and optimizer layers are exposed as replaceable building blocks instead of hidden framework internals.
  • Model Composer Friendly: model composition plugs in through protocol.py plus register_model(...), so model code can live in your own repository.
  • Agent-Native: the codebase is organized so agents can read primitives, generate model scaffolding, and validate against the runtime contract.

Public Surface

bumblebee/
├── runtime/       Public runtime API: create_runtime, RuntimeConfig, PackedBatch, ...
├── primitive/     Building blocks: modules, parallel, checkpointing, optimizers
└── model/         Model registration + reference implementations

Recommended public entrypoints:

  • from bumblebee.runtime import create_runtime, RuntimeConfig, BBConfig, BridgeConfig
  • from bumblebee.runtime.contracts import PackedBatch, ForwardResult, ModelHandle
  • from bumblebee.model import register_model

Bridge note:

  • BridgeConfig keeps Megatron/mbridge tuning in override_ddp_config, override_transformer_config, and override_optimizer_config.
  • Use transformer keys such as attention_backend, moe_aux_loss_coeff, recompute_granularity, and recompute_modules inside override_transformer_config.
  • Old bridge top-level aliases such as attention_backend_override, router_aux_loss_coef, recompute, and recompute_modules have been removed.

Installation

pip install -e .

Hard runtime dependencies declared by the package:

  • torch>=2.4
  • safetensors>=0.4
  • transformer-engine[pytorch]

Notes:

  • Bumblebee is intentionally TE-only. Missing hard dependencies should fail immediately instead of silently falling back.
  • BridgeRuntime additionally expects an mbridge / Megatron-Core environment. The bb backend is the simpler starting point for package-level bring-up.

Quick Start

Quick Benchmark Command

# BB path
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/Qwen3-30B-A3B \
  backend.model_name=qwen3_moe \
  backend.impl=lite \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

# Bridge baseline path (requires mbridge / Megatron environment)
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bridge \
  hf_path=/models/Qwen3-30B-A3B \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

See Benchmark Workflow for the recommended comparison order and result-saving options.

Train with bumblebee.runtime

from bumblebee.runtime import BBConfig, ParallelConfig, RuntimeConfig, create_runtime

cfg = RuntimeConfig(
    backend="bb",
    hf_path="/models/Qwen3-30B-A3B",
    backend_cfg=BBConfig(
        model_name="qwen3_moe",
        parallel=ParallelConfig(tp=8, ep=4),
    ),
)

rt = create_runtime(cfg)
handle = rt.build_model()

with rt.train_mode(handle):
    rt.zero_grad(handle)
    result = rt.forward_backward(handle, data, loss_fn=None, num_microbatches=2)
    ok, grad_norm, num_zeros = rt.optimizer_step(handle)
    rt.lr_scheduler_step(handle)

print(result.metrics["loss"], ok, grad_norm, num_zeros)

Public contract notes:

  • create_runtime(RuntimeConfig(...)) captures the default backend config.
  • rt.build_model() is the recommended public path; passing overrides to build_model(...) is an advanced escape hatch.
  • forward_backward(...) returns ForwardResult, not a raw dict.
  • optimizer_step(...) returns (update_successful, grad_norm, num_zeros).

Compose a Model

my_repo/my_model/
  config.py            # architecture params
  model.py             # primitive composition
  checkpoint.py        # HF <-> BB weight mapping
  protocol.py          # build_* / load_hf_weights / make_forward_step

Register it:

from bumblebee.model import register_model

register_model(
    "my_model",
    package="my_repo.my_model",
    hf_model_types=["my_model"],
    impls={
        "lite": "my_repo.my_model.protocol",
    },
)

Then run:

torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/my_model \
  backend.model_name=my_model

More Benchmark Examples

# BB lite path
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bb \
  hf_path=/models/Qwen3-30B-A3B \
  backend.model_name=qwen3_moe \
  backend.impl=lite \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

# Bridge baseline path (requires mbridge/Megatron environment)
torchrun --nproc-per-node=8 -m bumblebee.benchmarks.bench \
  backend=bridge \
  hf_path=/models/Qwen3-30B-A3B \
  backend.parallel.tp=4 \
  backend.parallel.ep=8 \
  bench.seq_len=8192 \
  bench.num_microbatches=16 \
  bench.steps=10 \
  bench.warmup=0

Documentation

Current docs live in this repository:

Current Priorities

The current GitHub-facing priorities are:

  1. Stabilize the public runtime contract and examples.
  2. Close the minimum VERL / RLHF loop around PackedBatch, per-token loss plumbing, and export/offload usage.
  3. Fill shared cross-model gaps before model-specific feature work:
    • Resume
    • Offload
    • THD
    • THD + CP
  4. Continue the next primitive/model wave:
    • MLA attention
    • Auxiliary-loss-free MoE
    • MTP
    • MLP
    • MoELayer extraction into primitive/

Research Directions

Bumblebee's architecture still keeps research-friendly extension points:

  • Kernel replacement and backend comparison
  • Parallelism strategy experiments
  • MoE routing and expert scheduling
  • Model composition through protocol modules and registry
  • Agent-assisted implementation and verification loops

License

Apache 2.0

About

A lightweight distributed training library for large language models. Bumblebee exposes a runtime API for orchestration, composable primitives for implementation work, and model composition plus registration hooks for bringing architectures into the system.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%