Composable building blocks for code/document retrieval and reranking, in pure
Go with no cgo in the core (stdlib + golang.org/x/text only). Chunk text,
embed it, search it lexically and semantically, fuse the rankings, and rerank
with a transformer reranking model — each package is small, independently
importable, and parity-tested against a Python reference.
The dependency DAG is shallow: most packages are leaves; encoder requires
embed + linalg (+ sparse for the SPLADE expansion head). The one heavier dependency — gotreesitter (pure-Go, but a
large embedded-grammar payload) — is quarantined in the separate
chunk/treesitter submodule, so importing the core never pulls it in.
Generation lives in
goinfer. The decoder-only LLM runtime (Gemma 3 / Qwen / Llama …), its SentencePiece/ byte-level tokenizers, constrained decoding, and the optional WebGPU (cgo) backend were split out so aikit stays a small, cgo-free retrieval library. goinfer depends inward on aikit (embed,linalg).
| Package | Purpose | Deps (beyond stdlib) |
|---|---|---|
topk |
bounded min-heap top-K selector (generic) | — |
ann |
cosine ANN over a dense matrix — exact flat scan + approximate HNSW graph | linalg, topk |
bm25 |
identifier-aware BM25 lexical index (Lucene-variant); Tokenize (code) + TokenizePlain (general text) |
topk |
fuse |
rank fusion (RRF) + relative-score fusion (RSF) — blend lexical + dense rankings for hybrid search | — |
sparse |
learned-sparse (SPLADE) retrieval — inverted index + sparse-dot scoring over vectors from encoder.SPLADE (in-process) or precomputed |
topk |
bench |
reproducible recall + latency harness for the dense indexes (Flat / HNSW / FlatI8) — Experimental tooling | ann |
linalg |
SIMD f32 dot/matmul (NEON on arm64, AVX2/FMA on amd64) + int8/int4 quant kernels |
— |
embed |
Model2Vec inference: WordPiece tokenizer + safetensors loader + L2-norm | golang.org/x/text |
encoder |
CodeRankEmbed (NomicBert) + MiniLM-class BERT embedder + SPLADE expansion + cross-encoder reranker — transformer inference scored by cosine / sparse dot / relevance logit; pluggable matmul Backend |
embed, linalg, sparse |
chunk |
language-aware chunker registry + regex, markdown, line chunkers |
— |
chunk/treesitter (submodule) |
tree-sitter-backed syntactic chunker | gotreesitter, …/aikit |
chunk/treesitter is a separate Go module (…/aikit/chunk/treesitter) so the
gotreesitter dependency is opt-in: go get …/aikit/chunk/treesitter only when
you want syntactic chunking; the core stays dependency-light.
A runnable end-to-end pipeline (chunk → embed → ANN + BM25 → RRF fuse →
cross-encoder rerank → top-K) lives in examples/rag/. The shape:
// Lexical (BM25) and dense (ANN over embeddings) each rank the chunks…
lex := bm25Index.TopK(bm25.Tokenize(query), 50)
den := annIndex.Query(queryVec, 50)
// …fuse the two rankings (rank-based, no score-scale juggling)…
fused := fuse.RRF(fuse.DefaultK,
fuse.Keys(lex, func(r bm25.Result) int { return r.Doc }),
fuse.Keys(den, func(h ann.Hit) int { return h.Index }))
// …then rerank the fused shortlist with the encoder for final order.encoder's matmul routes through a Backend; the default is pure-Go SIMD CPU.
A WebGPU backend can be slotted in by importing goinfer/gpu under -tags gpu
— without aikit ever importing cgo.
For the zero-deploy story, examples/embedded-corpus/
is a single self-contained binary that //go:embeds the Model2Vec model, a prebuilt
int8 index, and the corpus, and answers Go/aikit questions over hybrid (dense +
lexical) search with no external files and ~50 ms startup — the
//go:embed-a-corpus lane no Python or ONNX stack reaches.
The core is pure Go (no cgo) and builds + tests on Linux, macOS, and Windows
(amd64 and arm64) — CI covers all three. SIMD acceleration in linalg uses NEON
on arm64 and AVX2/FMA on amd64 (runtime-detected, scalar fallback otherwise), on
every OS.
The mmap-backed loaders (embed.OpenSafetensorsMmap, OpenGGUFMmap) use real
memory-mapping on unix and fall back to a heap read on Windows — identical
API and results, just without OS-page-cache sharing (so a large checkpoint costs
heap RAM there). The non-mmap loaders (OpenSafetensors*) are heap-backed on
every platform.
The only cgo in the ecosystem is the optional WebGPU backend (goinfer/gpu,
webgpu), which needs a C toolchain. chunk/treesitter (gotreesitter) is
pure-Go too — it's a separate opt-in module only because of its large embedded
grammars, not cgo. The core pulls in neither.
Measured against pure-Go ANN libraries on real Model2Vec embeddings (N=8000,
dim 256, M=16, EfSearch=64, k=10; recall@10 vs exact cosine). Reproduce with
benchmarks/ — cd benchmarks && GOWORK=off go run . — which also
documents the methodology and why synthetic vectors can't measure recall@k.
| index | recall@10 | p50 latency | index memory |
|---|---|---|---|
| aikit HNSW | 0.995 | 0.085 ms | ~2 MB |
| aikit FlatI8 (int8) | 0.995 | 0.13 ms | ~2 MB |
| aikit Flat (exact) | 1.000 | 0.28 ms | ~0 MB (zero-copy) |
| coder/hnsw | 0.22 † | 0.058 ms | ~8 MB |
| chromem-go (exact) | 1.000 | 3.77 ms | ~4 MB |
FlatI8 is the standout — 0.995 recall at near-exact latency and ¼ the float32 memory. † coder/hnsw's recall is structurally construction-limited on clustered real embeddings (flat across search-ef 64→800; only ~0.4 even at M=64); it uses plain greedy neighbor selection, whereas aikit defaults to the Algorithm-4 diversity heuristic built for exactly this case. Verified fair (canonical API, correct distance, full k, finds the right region) — see the benchmark notes.
| cgo-free | model inference | exact | ANN graph | int8 | persistence | lexical + hybrid | learned-sparse | static binary | |
|---|---|---|---|---|---|---|---|---|---|
| aikit | ✅ | ✅ Model2Vec + CodeRankEmbed | ✅ Flat | ✅ HNSW (Alg-4) | ✅ FlatI8 | ✅ HNSW | ✅ BM25 + RRF/RSF | ✅ sparse | ✅ 1.8 MB |
| coder/hnsw | ✅ | — | — | ✅ | — | ✅ | — | — | ✅ |
| chromem-go | ✅ | via external API | ✅ | — | — | ✅ | — | — | ✅ |
| Bleve v2 | dense needs cgo (faiss) | — | — | ✅ vector | — | ✅ | ✅ full-text | — | dense: ✗ |
| hugot | ✗ (ONNX Runtime) | ✅ HF pipelines | — | — | — | — | — | — | ✗ |
aikit is the only one of these that ships the whole pipeline — local model
inference and dense + lexical + sparse retrieval and fusion — in a single
1.8 MB pure-Go static binary (CGO_ENABLED=0, the full ann+bm25+fuse+
embed surface). hugot covers inference but its fast backend needs the ONNX Runtime
native library (cgo); the vector DBs cover indexing but not inference. The //go:embed-a-corpus,
zero-deploy story is the lane no Python or ONNX stack reaches.
On the BeIR/scifact test set (a canonical BEIR task), aikit — potion-retrieval-32M
embeddings + exact Flat cosine — scores nDCG@10 0.638 (300 queries, 5183 docs).
That's a cross-referenceable number: SciFact + nDCG@10 is the standard MTEB/BEIR
protocol (the model's overall MTEB retrieval score is 35.06), and 0.638 is right where
a strong static retriever lands — near all-MiniLM-L6-v2's own SciFact nDCG@10, at a
fraction of the cost and pure-Go. Reproduce: scripts/prep_beir.py, then
cd benchmarks && GOWORK=off go run ./beir.
aikit runs the transformer paths — the MiniLM bi-encoder and the cross-encoder — in
pure Go. all-MiniLM-L6-v2 encodes at ~21 texts/sec (≈47 ms/text, single thread) on
CPU with no ONNX Runtime, no GPU, CGO_ENABLED=0; concurrent encoding scales that
~linearly across cores. (Primary dense retrieval uses Model2Vec static embeddings —
microseconds per text, the table above; the transformer path is the higher-fidelity
reranking/embedding step over a shortlist.) Measure it: cd benchmarks && GOWORK=off go run ./inference.
The contrast with hugot is a deployment tradeoff, not a raw-speed one. hugot's fast CPU backend is ONNX Runtime — a native shared library + cgo — and is faster than pure Go; it also ships a pure-Go GoMLX backend its docs scope to "simpler workloads / smaller models." aikit's bet runs the other way: no runtime to install, link, or version — one static binary that already holds the model. Same checkpoint on both sides, so it's apples-to-apples on quality; the difference is what you deploy.
These two tiers define what 1.0 promises. The split is frozen for v1.0, and
the Hard tier is verified backward-compatible across the 0.4.x and 0.5.x minors
(apidiff, zero incompatible changes).
From v1.0 these follow semver: no breaking change before a v2.0. This is the API to build on.
topk.Selector[T],topk.Newann.New,ann.Flat.Query,ann.Hitbm25.Build,bm25.Index,bm25.Result,bm25.Tokenizefuse.RRF,fuse.RRFWeighted,fuse.Keys,fuse.Resultembed.Load,embed.LoadFromFS,embed.StaticModelembed.LoadTokenizer,embed.Tokenizerembed.OpenSafetensors*encoder.Load,encoder.LoadFromFS,encoder.Model,encoder.Encoderinterfacechunk.Chunkerinterface;chunk.{Chunk, Register, Get, Names, ChunkFile, Language}- Concrete chunker names registered under
regex,markdown,treesitter
Young, tuning-driven surfaces that ship in 1.0 but are explicitly excluded from the compatibility promise: they may change in any release (minor or patch). Supported and useful — but pin a version, or prefer the Hard-tier equivalent, if you need stability. Each graduates to the Hard tier once it settles.
linalg— promoted to public in v0.4.0 (wasinternal/linalg).Dot,MatmulBTand the int8/int4 quant kernels are stable in shape but the surface is young and tuning-driven.encoder.Backend/encoder.RegisterBackend/encoder.NewBackend— the matmul-provider seam; new in v0.4.0.ann.HNSW/ann.NewHNSW/ann.BuildHNSW/ann.Config— theHit/Querysurface is stable, but graph internals andConfigdefaults may tune. Neighbor selection defaults to the diversity heuristic (Algorithm 4) for high recall on clustered data;Config.SimpleNeighborsopts back to plain M-nearest.ann.HNSW.MarshalBinary/ann.Load— index persistence (the//go:embed-an-index pattern). The serialized format is versioned from day one but stays Experimental until the graph internals settle.ann.FlatI8/ann.NewFlatI8— int8-quantized dense index (¼ the memory, scored via the W8A8 kernel). SameHit/Queryshape asFlat; new surface, so Experimental.ann.Config.Int8— int8-quantized HNSW: ¼ the vector memory, built + searched + persisted in the integer domain (useslinalg.DotI8). Recall is unchanged on real embeddings (measured Δ0 vs f32). New surface, settling.linalg.MatmulBTAcc64—MatmulBTwith float64 dot accumulation (bit-identical to a scalar f64 reference), for f32 reassociation error amplified downstream (attention → discrete MoE router). New surface.ann.FlatI8.MarshalBinary/ann.LoadFlatI8/ann.LoadFlatI8Mmap— int8-index persistence (the//go:embed-an-index pattern).LoadFlatI8Mmapis zero-copy (aliases the int8 codes from a read-only mapping for instant startup + page-cache sharing);FlatI8.Closereleases it. Versioned format, settling alongsideFlatI8.Flat/HNSW/FlatI8.QueryFilter(q, k, keep)— query-time logical-delete / live-set filter (the index stays immutable). New surface, settling.bm25.TokenizePlain— new general-text (Unicode word) analyzer alongside the code-tunedTokenize(which stays the default); pick whichever fits the corpus.fuse.RSF/fuse.RSFWeighted/fuse.Scored/fuse.Scores— new relative-score fusion alongside the rank-basedRRF; new surface, settling.embed.Truncate— new Matryoshka (MRL) embedding truncate + L2-renormalize helper; pairs withann.FlatI8for compounded memory reduction.sparse— the whole package is new (learned-sparse / SPLADE retrieval). TheSparseVec/Index/Queryshape is settled, but it ships only the index + scorer half (an in-process masked-LM expansion head is a planned follow-up that may extend the surface), so it stays Experimental until that lands.encoder.LoadQ8/encoder.ModelQ8(int8 quant) — alternate precision path.encoder.LoadBERT/encoder.BERT/BERT.Encode— MiniLM-class BERT encoder (learned positions + GELU FFN + mean pooling), cgo-free, parity-pinned to all-MiniLM-L6-v2 (cosine 1.0). New surface, settling.encoder.LoadSPLADE/encoder.SPLADE/SPLADE.Expand— in-process SPLADE learned-sparse expansion (BERT + masked-LM head →sparse.SparseVec), parity 1.0 vs the reference. Closes thesparseloop end-to-end. New surface.encoder.LoadCrossEncoder/encoder.CrossEncoder/CrossEncoder.Score— BERT cross-encoder reranker (scores a query/document pair → relevance logit), parity- pinned to ms-marco-MiniLM-L-6-v2. The cross-encoder half of reranking. New surface.- The mmap variant of
embed.OpenSafetensors. - The concrete chunker structs (
regex.Chunker,markdown.Chunker,treesitter.Chunker) and theirNew()— preferchunk.Get("regex"). chunk/treesitter— its own opt-in module, tagged in lockstep with the core whenever the submodule itself changes (chunk/treesitter/v1.0.0requiresaikit v1.0.0). When a core release doesn't touch the submodule it gets no new tag — the existing one keeps working, since the core'schunk.Chunkercontract is Hard-tier stable (e.g. nothing in 1.1.x or 1.2.0 changed it). Itstreesitter.ChunkerAPI is stable, but it stays Experimental because it depends on the pre-1.0, single-maintainergotreesitter— a break there could force a change here.
bm25's tokenizer is code-tuned (identifier splitting: camelCase / PascalCase / ACRONYM / digit splits, plus the lowercased run). A feature for code/RAG consumers; a hidden assumption for general NLP.encoder's CodeRankEmbed weights are code-tuned. Same caveat.annassumes L2-normalized input vectors. The normalization contract lives at theembedboundary, not inann.embedaccumulates in float64 during inference and indexes throughmapping[]— both correctness-critical (float32 silently fails the ≥1−1e-5 cosine bar on longer inputs; non-mapping access produces wrong embeddings).- Indexes are immutable after build (
ann,bm25,sparse) — a cornerstone that gives lock-free concurrentQueryand snapshot consistency. Changing corpora are handled by rebuild-and-swap, base+delta+fuse, or logical delete (QueryFilter), never by mutating an index. See architecture.md design rule 4.
Model-dependent tests skip cleanly when their per-machine assets aren't present,
so a fresh go test ./... is green with embed/encoder parity tests skipped.
Populate the assets with the Hugging Face CLI (pip install -U huggingface_hub)
— no aikit-specific tooling required:
# Model2Vec (embed parity tests) → testdata/model
huggingface-cli download minishlab/potion-code-16M \
tokenizer.json config.json model.safetensors --local-dir testdata/model
# CodeRankEmbed (encoder parity tests) → testdata/encoder-model
huggingface-cli download nomic-ai/CodeRankEmbed \
tokenizer.json config.json model.safetensors --local-dir testdata/encoder-modelembed.Load handles both Model2Vec on-disk formats: the vocabulary-quantized
potion-code-16M (with mapping/weights tensors) and the standard format
with only an embeddings tensor (direct token-id indexing, mean pooling). For
general (non-code) retrieval, prefer minishlab/potion-retrieval-32M — the
strongest static retrieval model — over the code-tuned potion-code-16M.
(If you also use ken, ken download-model [--rerank] --to <dir> fetches the same snapshots.)
Regenerate the committed golden fixtures:
.venv/bin/python scripts/pin_inference.py # Model2Vec → testdata/golden.json
.venv/bin/python scripts/pin_encoder.py # CodeRankEmbed → testdata/encoder_golden.jsonv0.x is pre-1.0; breaking changes can still land between 0.x minors when the
design requires it (the CHANGELOG records each). v0.4.0 split the LLM runtime
out to goinfer, promoted linalg to public, and added the encoder.Backend
seam — the last hard-tier-affecting break.
The Hard tier has held backward-compatible across 0.4.x and 0.5.x (verified
with apidiff — zero incompatible changes), meeting the two-consecutive-minors
bar, so it is frozen for v1.0. From v1.0 the Hard tier follows semver
(breaking changes only at a v2.0); the Experimental tier is excluded from that
promise and may change in any release until it graduates.
The persisted index blobs (ann.HNSW / ann.FlatI8 MarshalBinary) are
magic-tagged and versioned. Pre-1.0 policy: rebuild per minor — a blob is not a
stable cross-version interchange format; re-serialize your index after an aikit minor
upgrade. The safety net is loud, not silent: Load* rejects any version it doesn't
recognize with ann.ErrFormat (never a crash or a misread), so a stale blob fails
visibly and you regenerate. The format version is bumped freely within 0.x when the
layout improves. If you //go:embed blobs in your own releases, pin the aikit minor
or rebuild in your pipeline (a go generate step, as
examples/embedded-corpus does). At 1.0 this tightens to
a stronger guarantee (read N−1, or reserved-field forward-compatibility) — the next
format bump reserves header flag bytes as the mechanism for the latter.
MIT. See THIRD_PARTY_LICENSES.md for upstream
attributions (Model2Vec, semble, gotreesitter, golang.org/x/text).