Releases: foundry-org/foundry
Foundry 0.0.2
Foundry 0.0.2
SGLang graduates to a fully validated engine. This release brings the SGLang
integration to parity with vLLM across single GPU, data parallel, and expert
parallel — with a self-contained recipe and no vLLM build dependency for EP.
Highlights
- SGLang single GPU / DP / EP all validated end-to-end.
SAVE → LOAD → query verified on single-GPU Qwen3-1.7B / 4B / 14B,
data-parallel (DP=2) Qwen3-1.7B, and expert-parallel (EP=2, DeepEP
low-latency) Qwen3-30B-A3B-FP8. Foundry-restored decode graphs match baseline
throughput within run-to-run noise; cold start drops from ~30 s of capture to
a ~0.4 s restore. - NVSHMEM auto-detection — no manual path, no vLLM ep_kernels.
Foundry now resolves DeepEP'slibnvshmem_host.sofrom thenvidia-nvshmem
wheel (atorchdependency) the same way it auto-detectslibcuda_hook.so.
DeepEP installs via SGLang's ownci_install_deepep.sh; the vLLM EP-kernel
helper is no longer required. - Minimal SGLang fork — 47 lines, activation only.
The fork carries just a CLI flag, a config field, and threeapply_server_args
call sites (server_args.py,foundry_shim.py,scheduler.py,
data_parallel_controller.py). All save/load logic lives in the integration
layer; the edits are inert unless--foundry-graph-extension-config-pathis set.
Engine integrations
-
SGLang — integration for SGLang v0.5.13. Working configurations: single
GPU, data parallel (DP), expert parallel (EP, DeepEP low-latency + DP-attention with fa3). Self-contained recipes underrecipe/sglang/(shared TOML pair +
per-config serve scripts, mirroringrecipe/vllm/) for Qwen3-1.7B (single),
Qwen3-1.7B (DP), and Qwen3-30B-A3B-FP8 (EP). TP attention stays unsupported
(NCCL all-reduce vs the VMM region) — EP uses DP-attention instead. EP requires
SGLang's pinned kernels — DeepEP9af0e0d(not vLLM's29d31c0),
sgl-deep-gemm >= 0.1.2, andflash-attn-3(fa3 sidesteps a FlashInfer
ragged-prefill off-by-one under this config). -
Manual DeepEP / NVSHMEM bootstrap (SGLang has no
prepare_comm_buffer).
vLLM exposesprepare_communication_buffer_for_model, a clean upstream site
where foundry brings up the NVSHMEM runtime. SGLang has no equivalent — it
creates the singleton DeepEPBufferlazily on the first MoE dispatch, inside
the warmup forwards foundry suppresses, which would pushBuffer(...)into the
captured stream ("operation not permitted when stream is capturing"). Foundry
instead walks the model to theDeepEPDispatcherand forces buffer creation
(bootstrap_deepep_buffer) before the capture loop on SAVE and before replay
on LOAD, withinit_nvshmem_for_loaded_modulesrun once on LOAD before any
NVSHMEM-kernel graph replays.
Fixes
- Per-rank VMM device binding (DP/TP/EP).
set_allocation_regionbinds to
the current CUDA device, so the integration now callsset_device(gpu_id)
before reserving the region — rank > 0 previously reserved oncuda:0and
faulted. - Set the main CUDA context on graph-build pool workers (C++). EP graphs
carryNODE_EVENT_RECORD/WAITnodes, so on-demand prep callscuEventCreate
onSimpleThreadPoolworkers that had no current context. Added
cuCtxSetCurrent(main_ctx)as the first call on those workers (mirroring the
background thread) — without it LOAD faulted with "invalid device context".
Dense graphs never hit this path, which is why it surfaced only on SGLang EP. - EP capture/load correctness. A SAVE-only warmup pass (triggers DeepGEMM
per-shape JIT + lazy init outside the captured graph),deepep_adaptermode
init on LOAD (replay asserts on_captured_deepep_mode, which the replaced
capture loop never sets), and the FlashInfer per-bs pre-pass gated off for fa3
while still populatingdecode_cuda_graph_metadatapost-load for replay.
Docs
- New
docs/sglang/set (overview, direct-edits, hooks, memory-lifecycle,
save-load-workflow, memory-consistency) and a self-containedrecipe/sglang/
README with install, run, performance, and troubleshooting. - Top-level README parallelism status table updated to mark SGLang single GPU,
DP, and EP as validated.
Foundry v0.0.1
Foundry 0.0.1
First public release of Foundry — a CUDA-graph persistence library that
captures an entire model's CUDA graphs (plus their device context: modules,
workspaces, VMM layout) once and replays them at startup, eliminating compile,
warmup, and capture from cold-start time.
Highlights
- Deterministic memory layout — zero patching on graph load.
Foundry indirects memory allocation to the same reserved memory region
with monotonic cursor advancing for both SAVE and LOAD. Therefore,
during CUDA graph reconstruction, captured pointers resolve as-is — no
pointer rewriting, no per-graph fix-ups, no runtime recompilation. - CUDA module extraction — library-agnostic execution context saving.
A driver-levelLD_PRELOADhook (libcuda_hook.so) intercepts CUDA module
loads at the driver boundary, so foundry captures the device-resident code
irrespective of which framework, kernel library, or codegen path produced
it (PyTorch, Inductor, cuBLAS NVJET, NVSHMEM, DeepEP, DeepGEMM, …). - Fast graph reconstruction via template-based CUDA graph grouping.
Captured graphs are grouped by topology into templates with node-param sets;
On load, the template is rebuilt once per group and instances are reconstructed
on demand, keeping load fast and asynchronous.
Engine integrations
- vLLM — compatible with vLLM v0.21. Working configurations: single GPU,
data parallel (DP), expert parallel (EP, DeepEP low-latency). End-to-end
recipes underrecipe/vllm/for Qwen3-1.7B, Qwen3-14B (DP),
Qwen3-30B-A3B (EP), Qwen3-30B-A3B-FP8 (EP). - SGLang — integration layer for SGLang v0.5.13 (adapted fork coming
soon). - Integration architecture. Engine-specific logic lives in a thin
integration layer (foundry/integration/<engine>/); engine forks contain
only minimal hook calls.
Verified kernel & comm support
- cuBLAS NVJET kernels (Hopper+).
- torch.compile modules.
- NVSHMEM / DeepEP validated.
- DeepGEMM FP8 MoE validated.
Dependency
- PyTorch 2.11.0 (compatible with 2.9 – 2.11).
- CUDA 12+, CMake 4.0+, Boost.
Documentation
- Integration design notes and per-engine recipes under
docs/andrecipe/. - vLLM recipe README covers save/load workflow, archive layout, and required
env settings.
Repository hygiene
- Open-source pre-commit hooks (ruff, ruff-format, clang-format,
markdownlint, actionlint, DCO sign-off). - Smoke tests covering re-export imports and archive round-trip.
Roadmap
- Adapted vLLM and SGLang forks published alongside the release.
- Tensor parallel support.
- Host-process checkpoint/restore.
- PD-disaggregated serving.
- Single shared communication-stub capture across replicas.