Seven reliability primitives for any LLM API. No new framework, no SDK lock-in -- just small modules that wrap whatever provider you already use (OpenAI, Anthropic, Gemini, Groq, vLLM, llama.cpp, anything OpenAI-compat).
Docs: https://NORTHTEKDevs.github.io/lemmas/
pip install lemmas # once published to PyPI
# or until then:
pip install git+https://github.com/NORTHTEKDevs/lemmas.gitfrom openai import OpenAI
from lemmas import cove
from lemmas.adapters import openai_complete
complete = openai_complete(OpenAI(), model="gpt-4o-mini")
result = cove(complete, query="Who invented the laser?")
print(result.final) # revised, fact-checked answer
print(result.baseline) # original first draft
print(result.questions) # what the model asked itselfThat's the whole API surface for one primitive. There are four of them.
| Module | Paper | What it does |
|---|---|---|
cove |
Dhuliawala et al. 2023 (Meta) | Chain-of-Verification. Generate, plan verification questions, answer each independently, revise. Reduces hallucination on long-form factual answers. |
self_consistency |
Wang et al. 2022 | Sample N completions at temperature > 0, return the plurality answer. Beats greedy decoding on reasoning benchmarks by 10-20 points. |
best_of_n |
classic test-time compute pattern | Sample N, score each with a scorer fn (LLM-as-judge, length, keywords, or your own reward model), return the highest-scoring. The natural companion to self-consistency for open-ended tasks. |
reflexion |
Shinn et al. 2023 | Iterative try -> critique -> retry loop. Critic feedback is fed back into the next attempt, so the model learns from its mistakes within one conversation. Plug in an LLM critic or a programmatic test (unit tests, JSON schema, exact match). |
debate |
Du, Li, Mordatch 2023 | Multi-agent debate. N agents draft independent answers, then revise after seeing the others' drafts, for R rounds. A judge (or convergence) picks the winner. Same-model or cross-model. |
DriftDetector |
rolling embedding centroid + Welford's variance | Per-bucket drift detection. Flags traffic-shape changes (abuse, eval-set staleness, prompt drift) via z-score with absolute-distance fallback. |
race |
Dean & Barroso "The Tail at Scale" 2013 | Hedged execution. Race N callables in parallel, return whichever finishes first. Generic -- not LLM-specific. |
Async parity. Every primitive has an async sibling under lemmas.asyncio:
acove, aself_consistency, abest_of_n, arace. The N-sample primitives
(aself_consistency, abest_of_n) parallelize their LLM calls with
asyncio.gather -- so what would have been 5x sequential latency becomes
~1x concurrent.
All four are backend-agnostic: each takes a callable, not a client. You can use them with any provider, in any combination, in any framework.
Modern LLM platforms (LangChain, LlamaIndex, LiteLLM, etc.) give you routing and abstractions. They don't give you the inference-time reliability methods from the research literature. You end up reimplementing CoVe and self-consistency by hand in every project.
These four primitives are the ones I've reached for repeatedly. Together they cover:
- factuality at inference time (CoVe)
- reasoning robustness (self-consistency)
- observability of behavior change (drift)
- tail-latency control (hedged execution)
Each one is ~150 lines. The whole library is < 1 kLoC. No magic, no dependencies beyond numpy.
from lemmas import cove
from lemmas.adapters import anthropic_complete
from anthropic import Anthropic
complete = anthropic_complete(Anthropic(), model="claude-haiku-4-5-20251001")
r = cove(complete,
query="List the five highest mountains in North America.",
n_questions=5)
print("BASELINE (may contain hallucinations):")
print(r.baseline)
print("\nVERIFICATION:")
for q, a in zip(r.questions, r.answers):
print(f" Q: {q}\n A: {a}")
print(f"\nREVISED ANSWER (had {r.revisions} revision pass):")
print(r.final)How it works:
- Baseline. Model produces a first answer.
- Plan. Model generates N verification questions about the kinds of claims a good answer would contain. It does this without seeing the baseline, so questions stay unbiased.
- Execute. Each question is answered independently. Bad claims can't verify each other.
- Revise. Model rewrites the baseline using the Q/A pairs, correcting or removing anything contradicted by its own verifications.
Cost: N+2 model calls. Wins on: TriviaQA, WikiData lists, biographies, multi-fact questions. Doesn't help on: math, code, single-fact lookups.
from lemmas import self_consistency
from lemmas.adapters import openai_complete
from openai import OpenAI
# IMPORTANT: bake temperature > 0 into your complete fn.
complete = openai_complete(OpenAI(), model="gpt-4o-mini", temperature=0.7)
prompt = (
"Janet's ducks lay 16 eggs per day. She eats three for breakfast "
"every morning and bakes muffins with four. She sells the remainder "
"at the farmers' market for $2 per fresh duck egg. How much in "
"dollars does she make every day at the farmers' market?\n"
"Think step by step, then end with 'Answer: <number>'."
)
r = self_consistency(complete,
messages=[{"role": "user", "content": prompt}],
n=7,
extractor="last_number")
print(f"plurality answer: {r.answer} (confidence={r.confidence:.2f})")
print(f"vote counts: {r.vote_counts}")Four extractors:
| Extractor | When to use |
|---|---|
last_line |
Default. Works for "The answer is X." patterns. |
last_number |
Math word problems. |
regex |
Custom -- you supply the pattern; group 1 is the answer. |
similarity |
Open-ended generation. Embeds all N samples, returns the one closest to the centroid. Requires an embed_fn. |
Cost: N model calls. The original paper recommends N=20-40 for hard reasoning benchmarks; N=5 is enough for most tasks.
The similarity extractor is lemmas-specific -- it lets you do
self-consistency on tasks where there's no discrete answer to vote on
(summaries, code generation, creative writing). It picks the sample
nearest the semantic centroid, which empirically picks the "median"
generation -- the one that best represents what the model would say on
average.
from lemmas import DriftDetector
from lemmas.adapters import openai_embed
from openai import OpenAI
embed = openai_embed(OpenAI(), model="text-embedding-3-small")
detector = DriftDetector(embed_fn=embed, z_threshold=3.0, warmup_n=20)
# On every incoming prompt to a deployed feature:
for prompt in incoming_prompts:
s = detector.observe(bucket="feature-search-v1", text=prompt)
if s.is_drift:
slack.post(f"prompt drift on search-v1: z={s.z_score:.2f} "
f"after {s.n} observations")What it does:
- Maintains a unit-norm centroid of recent prompt embeddings per bucket.
- On each new prompt, computes cosine distance to the centroid.
- Updates centroid and variance via exponential moving average.
- Returns a z-score; flags drift when |z| > threshold and the bucket has passed warmup.
State is in-memory by default. Pass persist_fn + load_fn to
serialize across processes (SQLite, Redis, whatever):
detector = DriftDetector(
embed_fn=embed,
persist_fn=lambda bucket, state: redis.set(f"drift:{bucket}", json.dumps(state)),
load_fn=lambda bucket: json.loads(redis.get(f"drift:{bucket}") or "null"),
)Use cases:
- Detect that a customer's traffic has shifted (potential abuse or new use case).
- Detect that a deployed prompt template is being used differently than tested.
- Detect when your eval set has gone stale relative to live traffic.
from lemmas import race
from lemmas.adapters import openai_complete, anthropic_complete
primary = openai_complete(...) # gpt-4o
backup = anthropic_complete(...) # claude-haiku
messages = [{"role": "user", "content": "Summarize this article: ..."}]
result = race([
("openai", lambda: primary(messages)),
("anthropic", lambda: backup(messages)),
], timeout_secs=10.0)
print(f"winner: {result.winner} after {result.latency_ms:.0f}ms")
print(result.value)What it does:
Runs all N callables in parallel in a ThreadPoolExecutor. Returns
whichever succeeds first; cancels the rest; records losers for
telemetry. Failures don't kill the race (set fail_fast=True to change
that).
This is generic. It's not LLM-specific -- any zero-arg callables work. Use it to race two embedding providers, two web fetches, two database shards, whatever.
Cost model: if all callables eventually succeed, you pay N × compute. In practice the losers get cancelled mid-flight (HTTP connections closed, provider calls aborted) so you pay closer to 1.2× -- 1.5× for major tail latency wins.
The natural companion to self-consistency. Where self-consistency uses
voting to pick the answer, best_of_n uses a scorer function:
from lemmas import best_of_n, llm_judge_scorer
from lemmas.adapters import openai_complete
complete = openai_complete(OpenAI(), model="gpt-4o-mini", temperature=0.7)
judge = openai_complete(OpenAI(), model="gpt-4o-mini", temperature=0.0)
r = best_of_n(
complete,
messages=[{"role": "user", "content": "Write a haiku about Anchorage."}],
scorer=llm_judge_scorer(judge,
rubric="Rate this haiku 0-10 on imagery, meter, and surprise. "
"Respond with only the number."),
n=5,
)
print(r.answer) # winning haiku
print(r.scores) # all 5 scoresWhen to use which:
| self_consistency | best_of_n | |
|---|---|---|
| Task has a discrete answer | yes (vote on it) | overkill |
| Task is open-ended | no (no token to vote on) | yes (score each) |
| You have a reward model | not used | plug it in as scorer |
| You want LLM-as-judge | no | yes (use llm_judge_scorer) |
Three scorer factories are included: llm_judge_scorer, length_scorer,
keyword_scorer. You can also pass any Callable[[str], float].
Every primitive has an async sibling:
from lemmas.asyncio import acove, aself_consistency, abest_of_n, arace
# N samples run concurrently instead of sequentially:
result = await aself_consistency(async_complete, messages=[...], n=10)
# Race async coroutines:
result = await arace([
("openai", lambda: openai_complete_async(msgs)),
("anthropic", lambda: anthropic_complete_async(msgs)),
])For aself_consistency and abest_of_n, this turns N x latency into ~1 x
latency. For acove, the N verification answers fan out concurrently
(steps 1, 2, 4 are still sequential because they depend on each other).
from lemmas import reflexion, programmatic_critic
from lemmas.adapters import openai_complete
complete = openai_complete(OpenAI(), model="gpt-4o-mini", temperature=0.3)
def my_critic(candidate: str) -> tuple[bool, str]:
# Run the candidate against unit tests, JSON schema, exact-match,
# whatever you have a verifiable signal for.
if "FizzBuzz" in candidate and "Fizz" in candidate:
return True, "PASS"
return False, "missing FizzBuzz handling"
r = reflexion(
complete,
query="Write fizzbuzz(n: int) in Python.",
critic=programmatic_critic(my_critic),
max_iterations=4,
)
print(f"passed: {r.passed} iterations: {r.iterations}")
print(r.final)The critic's feedback is concatenated into the next attempt's prompt so the model gets to "see" what went wrong. Strictly stronger than best-of-N when you have a verifiable signal -- it learns from each failure within the same conversation.
Two built-in critic factories:
llm_critic(complete, rubric=...)-- LLM-as-judge; passes when the verdict contains "PASS".programmatic_critic(fn)-- wraps your own(candidate) -> (passed, feedback).
lemmas.adapters includes thin wrappers for the popular SDKs so you
don't have to write the (messages) -> str glue yourself:
from lemmas.adapters import (
# Provider SDKs (no hard dep -- only loaded when you call it)
openai_complete, # openai.OpenAI() client -> CompleteFn
openai_embed,
anthropic_complete, # anthropic.Anthropic() client
gemini_complete, # google.generativeai.GenerativeModel() or module
gemini_embed,
groq_complete, # groq.Groq() client (OpenAI-shaped)
# Zero-SDK HTTP path -- works with anything OpenAI-compatible
openai_compatible_complete, # vLLM, llama.cpp, Together, Fireworks, DeepSeek,
# Anyscale, Perplexity, LM Studio, Ollama (/v1), ...
openai_compatible_embed,
# Test stubs (deterministic, no network)
echo_complete,
varying_echo_complete,
)If you use a provider that isn't here, write your own three-line adapter:
def my_complete(messages: list[dict]) -> str:
resp = my_client.do_chat(messages=messages, ...)
return resp.textThat's the whole interface.
These compose. A high-stakes RAG endpoint might do:
# Hedge between two LLM providers, then verify the winner's output.
def cove_with_hedge(query: str) -> str:
def run_a(): return primary(messages_for(query))
def run_b(): return backup(messages_for(query))
raw = race([("openai", run_a), ("anthropic", run_b)]).value
# Now use CoVe to clean up that fastest-arriving answer.
return cove(primary, query=query).finalOr self-consistency + drift together:
# Sample N answers AND record the prompt for drift monitoring.
detector.observe(bucket=f"key:{api_key}", text=prompt)
result = self_consistency(complete, messages=[...], n=5)
return result.answer- No router. Use LiteLLM, Portkey, or your own.
- No observability backend. Pipe the result objects to Langfuse, Helicone, Datadog, or print them. Lemmas gives you the data; you decide where it goes.
- No retrieval / RAG. Use LlamaIndex, LangChain, or a real vector DB.
- No agent loop. Use the framework of your choice.
- No streaming. All four primitives operate on complete responses. (CoVe needs the full baseline; self-consistency needs full samples; hedged execution returns the first complete response; drift is per-prompt.)
These are deliberate. Lemmas is a small library, not a framework.
- CoVe adds N+2 model calls. For 5 questions on Claude Haiku, that's ~$0.002 of additional spend per query and ~6× the wall time of greedy. Reserve it for high-stakes factual answers.
- self_consistency is embarrassingly parallel; lemmas doesn't parallelize
internally (your
completeis synchronous), but wrapping it in an async caller is straightforward. - DriftDetector is O(d) per observation where d is the embedding dimension. The bottleneck is the embedding API call, not the math.
raceusesconcurrent.futures.ThreadPoolExecutor; safe for the blocking HTTP clients that every provider SDK currently uses.
For a quick spin without writing code, lemmas ships with a tiny CLI:
# Offline (uses the deterministic echo stub):
python -m lemmas cove "Who painted the Mona Lisa?"
python -m lemmas self_consistency "What is 2+2?" --n 3
# With a real provider (OpenAI):
OPENAI_API_KEY=sk-... python -m lemmas cove "Where was Marie Curie born?"
OPENAI_API_KEY=sk-... python -m lemmas self_consistency \
"Janet has 16 eggs..." --n 5 --extractor last_numberPass --help for the full list of subcommands and flags.
git clone https://github.com/NORTHTEKDevs/lemmas.git
cd lemmas
pip install -e .[dev]
pytest tests/ # ~6s; no network calls.
ruff check .Tests use deterministic stub CompleteFn / EmbedFn callables. No API
keys required.
Lemmas uses PyPI Trusted Publishing (OIDC, no manual token). Tag-driven release flow:
git tag v0.2.1
git push origin v0.2.1
# .github/workflows/release.yml builds + publishes automaticallyOne-time setup on PyPI (maintainers only):
- Sign in at https://pypi.org and go to Account settings → Publishing.
- Click Add a new pending publisher and fill in:
- PyPI Project Name:
lemmas - Owner:
NORTHTEKDevs - Repository name:
lemmas - Workflow filename:
release.yml - Environment name:
pypi
- PyPI Project Name:
- The next tag push will publish automatically.
If you use lemmas in a paper or product writeup:
@software{lemmas,
author = {Baer, Kristian},
title = {Lemmas: reliability primitives for LLM APIs},
year = {2026},
url = {https://github.com/NORTHTEKDevs/lemmas}
}
For the underlying methods, cite the original papers:
- Dhuliawala et al. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. arxiv:2309.11495.
- Wang et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arxiv:2203.11171.
- Dean & Barroso (2013). The Tail at Scale. Communications of the ACM.
MIT. Use it however you like.