Loopsmith

Loopsmith is an eval and promotion harness for AI agents. It helps you improve agents the way you improve software: test changes, compare outputs, keep only what holds up.

Tagline: Improve agents the way you improve software: define the eval, test the candidate, keep only what survives evidence.

Loopsmith is harness-agnostic. It works with OpenClaw, Hermes, Codex, OpenCode, Claude Code, or any other agent setup that can produce baseline/candidate outputs and read/write repo files.

Use Cases

compare baseline and candidate agent behaviour with evidence
turn recurring failures into eval cases instead of complaints
promote prompt, policy, or evaluator changes only after review
keep a ledger of why an agent behaviour changed

Loopsmith vs Proof Loop

Loopsmith is not the sprint protocol itself. That is proof-loop.

Proof Loop governs a single task: frozen acceptance criteria, separate verifier, durable verdict artifacts, and no self-certified done claims.

Loopsmith improves repeated agent behaviour over time: baseline vs candidate, eval packs, scoring, promotion/rejection, and a ledger.

Use Proof Loop inside a task. Use Loopsmith when the same failure pattern keeps coming back and the agent, prompt, policy, or evaluator itself needs measurable improvement. Both are intentionally file/protocol based, so they can travel across harnesses instead of depending on one vendor runtime. See docs/proof-loop-relationship.md.

What it is for

Use Loopsmith when an agent is producing output that is:

good enough to be dangerous
repetitive or sludgy
hard to trust
hard to review consistently
drifting after prompt or policy changes

Loopsmith is for cases where taste alone is not enough and blind prompting is not good enough.

It helps answer questions like:

Is this candidate actually better than the baseline?
Did we improve the output or just rewrite it differently?
Which failures should block promotion?
What is live right now, and why?

What it is not for

Loopsmith is not a chatbot wrapper, a benchmark vanity project, or a generic agent platform.

It is not trying to replace judgment. It is trying to make judgment more disciplined.

The loop

Each loop compares:

a baseline
a candidate
one or more eval cases
a verdict and promotion state

A candidate must improve evidence, not just sound clever.

Credibility Artifact

See examples/README.md for the example index, or jump straight to examples/before-after-eval.md. It shows how a recurring research-brief quality problem becomes a baseline-vs-candidate eval with review artifacts.

One concrete example

A research agent can be factually competent but still painful to read. The brief may repeat the same thesis across sections, keep weak topics alive, and bury the useful signal under repetitive scaffolding.

Loopsmith can treat that as a bounded quality problem:

baseline = current research brief policy
candidate = shorter, sharper signal-density policy
eval = anti-sludge, anti-repetition, weak-topic-drop checks
promotion = only after the candidate clearly beats the baseline

See:

docs/research-brief-quality-pack.md
candidates/scout/research-policy-v3.md
candidates/scout/candidate-signal-density.json

The kinds of failures Loopsmith can catch

Loopsmith is useful for recurring failure modes such as:

robotic direct-chat replies
generic research sludge
false completion claims
vague QA verdicts
proof without proof
cumulative regression dishonesty
repetitive scaffolding that hides weak signal

When To Use Which Repo

Use this repo when a failure pattern keeps coming back and needs to become an eval instead of another complaint. Loopsmith compares baseline and candidate behaviour, writes review artifacts, and promotes only what survives evidence.

Use the neighbouring tools at different points in the workflow:

Need	Use
Turn a fuzzy request into an executable agent brief	Brief Master
Prove one coding task is actually done	Proof Loop
Improve repeated agent behaviour with evals	Loopsmith
Keep source-backed memory for long-running agents	Sovereign Brain
Stop frontend agents producing generic UI sludge	no-slop-ui

A practical chain looks like this: messy request -> Brief Master brief -> Proof Loop task -> Loopsmith eval if the same failure keeps recurring -> Sovereign Brain records the durable decision.

Related Tools

Proof Loop - task-level completion protocol. Loopsmith is the next step when the same proof failure keeps recurring.
Sovereign Brain - source-backed memory and review workflow for long-running agents; useful context for eval decisions and agent behaviour history.
Brief Master - improves the briefs that become eval inputs, candidate policies, or Proof Loop specs.

Repo layout

agents/ — agent profiles
evals/ — agent and shared eval pack definitions
baseline/ — current baseline outputs or fixtures
candidates/ — candidate variants under test
promoted/ — promoted candidate manifests
rejected/ — rejected candidate manifests
ledger/ — promotion history
policies/ — mutation boundaries and promotion rules
runs/ — generated run logs, summaries, review queue, promotion index, and provenance views
src/ — schemas, scoring, loaders, runner, CLI, summaries, operator views
docs/ — design notes, usage, review flow, artifact policy, evaluator strategy, Proof Loop relationship, shared-pack guidance, sanitisation notes, and pack patterns

CLI examples

python3 src/cli.py run --agent conductor
python3 src/cli.py run --agent scout --json
python3 src/cli.py run --agent iris
python3 src/cli.py run --agent rex
python3 src/cli.py run-shared --pack golden:anti-bullshit
python3 src/cli.py promote --agent conductor --candidate candidate-001 --approved-by reviewer

Research brief quality packs

Loopsmith can improve research agents that are technically competent but operationally dull to read.

A research brief quality pack can encode recurring failure modes like:

repeated thesis inflation
template fatigue across sections
weak-topic retention
reader-specific scaffolding bloat
fake completeness instead of signal density

See docs/research-brief-quality-pack.md for the public pattern.

Operator views

Once a repo has multiple packs and promotion states, the human operator needs a clean control surface. Loopsmith generates:

pack summaries
a review queue
a promotion index
a baseline provenance view

so a reviewer can quickly see what is eligible, what needs review, what regressed, what is currently live, and where that live state came from.

Shared packs

Some failure modes are not agent-local. Shared packs let Loopsmith express cross-agent behavioural families as first-class objects with explicit metadata, participating agents, and clearer operator-facing summaries.

Evaluator-specific logic

Some cases are too important to judge with loose heuristics alone. Loopsmith supports case-specific evaluators for proof-heavy checks such as:

Forge proof-before-done
Iris AC verdict discipline
Iris review-vs-validation boundary
Rex cumulative regression honesty
Rex layered reporting honesty

Current status

Loopsmith shipped a real v1 and is now moving through hardening passes. The public-share cleanup is documented in docs/recovery-pass.md.

V1 delivered

repo skeleton
initial eval schema
initial run logging schema
mutation boundaries
first loop runner
3 strong demo agents
starter packs for the rest
public sanitisation

V2 delivered so far

better scoring (pass_fail, rubric, composite)
promotion flow with human approval
file-driven runner + CLI
stronger Iris and Rex packs
anti-bullshit golden cases
pack-level review summaries
stronger shared-pack review flow
review queue and promotion index
case-specific evaluators for proof-heavy cases
documented evaluator strategy and selective expansion rules
artifact policy and baseline provenance views
shared packs as first-class objects with metadata
reusable research-brief quality pack pattern for anti-sludge and signal-density tuning

Current deep areas

Loopsmith is currently deepest in these kinds of agent work:

direct response quality
research brief quality
proof-before-done implementation discipline
review verdict quality
acceptance and regression reporting honesty

The rest of the repo still ships with lighter starter packs while the core patterns are being hardened.

Design rules

No giant-file soup
Split by concern
Explicit mutation boundaries
Human promotion gate for meaningful changes
Public-safe structure from the start

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loopsmith

Use Cases

Loopsmith vs Proof Loop

What it is for

What it is not for

The loop

Credibility Artifact

One concrete example

The kinds of failures Loopsmith can catch

When To Use Which Repo

Related Tools

Repo layout

CLI examples

Research brief quality packs

Operator views

Shared packs

Evaluator-specific logic

Current status

V1 delivered

V2 delivered so far

Current deep areas

Design rules

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
agents		agents
assets		assets
baseline		baseline
candidates		candidates
docs		docs
evals		evals
examples		examples
ledger		ledger
policies		policies
promoted		promoted
rejected		rejected
runs		runs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md

Folders and files

Latest commit

History

Repository files navigation

Loopsmith

Use Cases

Loopsmith vs Proof Loop

What it is for

What it is not for

The loop

Credibility Artifact

One concrete example

The kinds of failures Loopsmith can catch

When To Use Which Repo

Related Tools

Repo layout

CLI examples

Research brief quality packs

Operator views

Shared packs

Evaluator-specific logic

Current status

V1 delivered

V2 delivered so far

Current deep areas

Design rules

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages