This document is contributor-facing and focuses on how Butterfly is structured and why the current implementation behaves the way it does.
Naming note: Butterfly / BNA is the project name everywhere user-facing. Wayfinder / HCSA are legacy identifiers that survive in some module names (bna/attention_hcsa.py), CLI mode strings (--mode wayfinder is a legacy alias for --mode butterfly), and historical benchmark paths. They mean the same runtime; new readers should treat any "wayfinder" / "HCSA" mention as "Butterfly".
Source: bna/graph/abi.py
Butterfly uses a backend-agnostic graph ABI with two key tensors:
neigh_idx: paddedint32adjacency list with-1as padding. Shape[T, D]or[H, T, D].edge_type:uint8labels in{PAD, CYCLE, WINDOW, LANDMARK, REWIRE}.
This ABI is shared across integration and attention-dispatch code paths so the topology contract stays stable across PyTorch and MLX.
Let positions be 0..T-1. Build a Hamiltonian cycle C over these positions.
For token i, candidate neighbors are:
Wmost recent predecessors (causal window)- cycle neighbors of
iinC - optional landmarks at stride
s
Apply causal mask j < i to obtain directed sparse neighborhoods.
Cost model (d = head/embedding dim):
- Dense causal: per-token fan-in
T-1, total workO(T^2 d) - HCSA: bounded fan-in
D <= W + 2 + T/s(often closer toW + 1 + T/safter masking), total workO(T D d)
Example at T=4096, W=64, s=64:
- HCSA candidate fan-in
D=130 - Dense fan-in
4095 - approximately
31xfewer edges per token
Key implementation mechanisms (current tree):
- Vectorized all-head dispatch (
bna/mlx/fused_attention.py)
- processes all query heads together with fused SDPA kernels
- avoids per-head Python loops and unnecessary eval barriers
- supports contiguous-window path and gather path depending on graph/data alignment
- Perms-only graph construction (
bna/topology/core.py)
Topology.construct_perms_only()builds cycle permutation views without always materializing full neighbor-index graphs- reduces topology-build overhead on permute-centric paths
- Exact graph horizon behavior in integrations (
bna/integrations/glm_mlx.py)
- graph horizons align with KV cache sizing decisions
- helps active-row path selection avoid unnecessary random gather behavior in key cases
| Path | Purpose |
|---|---|
bna/attention_hcsa.py, bna/model.py |
Core sparse attention + reference GPT |
bna/cycles.py, bna/graph_strategies.py |
Cycle construction + strategy wrappers |
bna/graph/abi.py |
Graph ABI (neighbor indices + edge typing) |
bna/graph/analysis.py |
Diagnostics: spectral gap, random-walk mixing, resilience |
bna/topology/core.py |
Topology runtime and perms-only construction |
bna/compiler/, configs/graph_specs/*.wf |
Graph-spec compiler + cache artifacts |
bna/mlx/attention.py |
MLX dispatch paths (dense, gather, permute-window) |
bna/mlx/fused_attention.py |
Vectorized fused dispatch paths |
bna/integrations/ |
Model integrations (GLM, Qwen, GPT-2) |
bna/torch/ |
PyTorch/CUDA backend |
scripts/ |
Benchmarks, training, ablations, diagnostics |
tests/ |
Correctness and regression tests |
benchmarks/, results/ |
Measurement artifacts |
- Keep benchmark execution sequential for reproducibility and memory safety.
- Keep public claims tied to measured evidence in
docs/FIRST_RELEASE.md. - See
CONTRIBUTING.mdfor PR expectations.