feat(backend): add GPU FlashLib IVF backend (flashlib_ivf) + IVF-vs-I… by andy-yang-1 · Pull Request #348 · StarTrail-org/LEANN

andy-yang-1 · 2026-06-02T21:32:03Z

…VF benchmark

Add leann-backend-flashlib-ivf, a GPU IVF-Flat (inverted file) approximate-NN backend built on FlashLib's flash_ivf_flat (Triton/CuteDSL) - the GPU counterpart of the FAISS ivf backend, sharing its nlist/nprobe recall knobs so the two are drop-in comparable. The built index (centroids/data/ids/CSR offsets) is persisted with torch.save and reloaded to the GPU at search time (no k-means re-train). mips/cosine L2-normalize (FlashLib IVF ranks by squared L2).

Also add benchmarks/flashlib_ivf_vs_faiss_ivf.py (flashlib_ivf GPU vs ivf CPU at a matched nlist across an nprobe sweep: build, latency, throughput, recall@k vs exact GT), a CUDA-guarded correctness test, the flashlib-ivf extra + uv source wiring, and a flashlib_ivf section in the backend guide.

On an H200 at 1M x 768 (nlist=4096, 8 CPU threads, cosine): ~13x faster build and, at nprobe=32, ~6.5x lower single-query latency / ~75x higher batched throughput at comparable recall (GPU latency ~flat vs CPU linear in nprobe).

What does this PR do?

Related Issues

Fixes #

Checklist

Tests pass (uv run pytest)
Code formatted (ruff format and ruff check)
Pre-commit hooks pass (pre-commit run --all-files)

…VF benchmark Add leann-backend-flashlib-ivf, a GPU IVF-Flat (inverted file) approximate-NN backend built on FlashLib's flash_ivf_flat (Triton/CuteDSL) - the GPU counterpart of the FAISS `ivf` backend, sharing its nlist/nprobe recall knobs so the two are drop-in comparable. The built index (centroids/data/ids/CSR offsets) is persisted with torch.save and reloaded to the GPU at search time (no k-means re-train). mips/cosine L2-normalize (FlashLib IVF ranks by squared L2). Also add benchmarks/flashlib_ivf_vs_faiss_ivf.py (flashlib_ivf GPU vs ivf CPU at a matched nlist across an nprobe sweep: build, latency, throughput, recall@k vs exact GT), a CUDA-guarded correctness test, the flashlib-ivf extra + uv source wiring, and a flashlib_ivf section in the backend guide. On an H200 at 1M x 768 (nlist=4096, 8 CPU threads, cosine): ~13x faster build and, at nprobe=32, ~6.5x lower single-query latency / ~75x higher batched throughput at comparable recall (GPU latency ~flat vs CPU linear in nprobe).

yichuan-w · 2026-06-02T21:45:12Z

Nice PR to support a user who has an advanced GPU w/o recompute, nice work @andy-yang-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend): add GPU FlashLib IVF backend (flashlib_ivf) + IVF-vs-I…#348

feat(backend): add GPU FlashLib IVF backend (flashlib_ivf) + IVF-vs-I…#348
andy-yang-1 wants to merge 1 commit into
StarTrail-org:mainfrom
andy-yang-1:feat/flashlib-ivf-backend

andy-yang-1 commented Jun 2, 2026

Uh oh!

yichuan-w commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andy-yang-1 commented Jun 2, 2026

What does this PR do?

Related Issues

Checklist

Uh oh!

yichuan-w commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants