Skip to content

feat(backend): add GPU FlashLib IVF backend (flashlib_ivf) + IVF-vs-I…#348

Open
andy-yang-1 wants to merge 1 commit into
StarTrail-org:mainfrom
andy-yang-1:feat/flashlib-ivf-backend
Open

feat(backend): add GPU FlashLib IVF backend (flashlib_ivf) + IVF-vs-I…#348
andy-yang-1 wants to merge 1 commit into
StarTrail-org:mainfrom
andy-yang-1:feat/flashlib-ivf-backend

Conversation

@andy-yang-1
Copy link
Copy Markdown

…VF benchmark

Add leann-backend-flashlib-ivf, a GPU IVF-Flat (inverted file) approximate-NN backend built on FlashLib's flash_ivf_flat (Triton/CuteDSL) - the GPU counterpart of the FAISS ivf backend, sharing its nlist/nprobe recall knobs so the two are drop-in comparable. The built index (centroids/data/ids/CSR offsets) is persisted with torch.save and reloaded to the GPU at search time (no k-means re-train). mips/cosine L2-normalize (FlashLib IVF ranks by squared L2).

Also add benchmarks/flashlib_ivf_vs_faiss_ivf.py (flashlib_ivf GPU vs ivf CPU at a matched nlist across an nprobe sweep: build, latency, throughput, recall@k vs exact GT), a CUDA-guarded correctness test, the flashlib-ivf extra + uv source wiring, and a flashlib_ivf section in the backend guide.

On an H200 at 1M x 768 (nlist=4096, 8 CPU threads, cosine): ~13x faster build and, at nprobe=32, ~6.5x lower single-query latency / ~75x higher batched throughput at comparable recall (GPU latency ~flat vs CPU linear in nprobe).

What does this PR do?

Related Issues

Fixes #

Checklist

  • Tests pass (uv run pytest)
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

…VF benchmark

Add leann-backend-flashlib-ivf, a GPU IVF-Flat (inverted file) approximate-NN
backend built on FlashLib's flash_ivf_flat (Triton/CuteDSL) - the GPU counterpart
of the FAISS `ivf` backend, sharing its nlist/nprobe recall knobs so the two are
drop-in comparable. The built index (centroids/data/ids/CSR offsets) is persisted
with torch.save and reloaded to the GPU at search time (no k-means re-train).
mips/cosine L2-normalize (FlashLib IVF ranks by squared L2).

Also add benchmarks/flashlib_ivf_vs_faiss_ivf.py (flashlib_ivf GPU vs ivf CPU at a
matched nlist across an nprobe sweep: build, latency, throughput, recall@k vs exact
GT), a CUDA-guarded correctness test, the flashlib-ivf extra + uv source wiring, and
a flashlib_ivf section in the backend guide.

On an H200 at 1M x 768 (nlist=4096, 8 CPU threads, cosine): ~13x faster build and,
at nprobe=32, ~6.5x lower single-query latency / ~75x higher batched throughput at
comparable recall (GPU latency ~flat vs CPU linear in nprobe).
@yichuan-w
Copy link
Copy Markdown
Collaborator

Nice PR to support a user who has an advanced GPU w/o recompute, nice work @andy-yang-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants