FreshArena

FreshArena is a dynamic, verifiable, and adversarial evaluation protocol for coding agents.

Instead of testing agents on a fixed set of public problems, FreshArena generates fresh task instances from versioned task families, evaluates submissions with deterministic verifier packages, and runs submit-then-test adversarial checks after the solver commits its answer.

The first release focuses on JSON transformation tasks to study whether fresh generated tasks reveal overfitting that fixed benchmarks miss.

Core Research Question

Do the same solvers rank significantly differently on fixed public tasks vs. fresh generated tasks?

If yes, FreshArena demonstrates that fixed benchmarks hide real capability gaps — and that dynamic evaluation is worth pursuing.

Quick Start

# Run with non-LLM baseline (no API key required)
fresharena run examples/non-llm-baseline

# Replay a recorded evaluation
fresharena replay records/samples/sample-run.jsonl

# Verify the JSON transform world
fresharena verify worlds/json-transform

Repository Layout

fresharena/
  packages/
    faep-schema/        # FAEP v0.1 record schema + Zod types
    core/               # Shared evaluation engine
    cli/                # fresharena CLI
    verifier-runtime/   # Deterministic verifier sandbox
    reporter/           # HTML / JSONL report generation
  worlds/
    json-transform/     # First task world: JSON normalize, diff, patch, merge
  solvers/
    non-llm/            # Reference, weak, and buggy baseline solvers
    llm/                # LLM solver adapters (OpenAI-compatible, Anthropic, local)
  records/samples/      # Example FAEP evaluation records
  reports/static/       # Pre-generated HTML reports
  docs/                 # Protocol spec, scoring, experiment design

Evaluation Tracks

Track	Purpose	LLM
Model-Fixed	Compare agent workflow / prompt strategies	Fixed
Model-Open	Compare full product capability	Any
Non-LLM Baseline	Reproducible lower bound, no API cost	None
Budget-Normalized	Compare cost efficiency	Any, fixed budget

Main paper conclusions must come from the Model-Fixed track.

FAEP: FreshArena Evaluation Protocol

Each evaluation run produces a faep_record — a signed, replayable JSONL artifact containing:

task spec + seed hash
solver metadata + workflow hash
generator + tester metadata
verifier version + result hash
score vector (canonical / hidden / adversarial / immunity pass)
replay command + log hash

See docs/protocol-faep.md for the full schema.

Relationship to WasmAgent Projects

Project	Role in FreshArena	Required in MVP
`wasmagent-js`	Sandbox / tool-use runtime reference	No
`open-agent-audit`	Evidence record enhancement layer	Optional
`trace-pipeline`	Export failure traces as training data	Phase 2
`bscode`	Coding task source / solver baseline	Phase 2

License

Apache-2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.changeset		.changeset
.claude-bot		.claude-bot
.claude		.claude
.github/workflows		.github/workflows
docs		docs
packages		packages
records/samples		records/samples
scripts		scripts
solvers		solvers
worlds/json-transform		worlds/json-transform
.gitignore		.gitignore
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
package.json		package.json
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FreshArena

Core Research Question

Quick Start

Repository Layout

Evaluation Tracks

FAEP: FreshArena Evaluation Protocol

Relationship to WasmAgent Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FreshArena

Core Research Question

Quick Start

Repository Layout

Evaluation Tracks

FAEP: FreshArena Evaluation Protocol

Relationship to WasmAgent Projects

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages