JaRVIS-Bench

A/B evaluation framework measuring whether JaRVIS reflective journaling improves Claude Code's performance on long-horizon repository generation tasks.

Hypothesis: Reflective journaling helps AI coding agents maintain coherence on long-horizon tasks — planning better, catching mistakes earlier, and producing higher-quality code.

Uses NL2Repo-Bench (104 Python library generation tasks) as the task and evaluation infrastructure.

Prerequisites

Claude Code CLI — installed and authenticated (claude on PATH)
Docker — running (used for pytest evaluation of generated repos)
Python 3.11+ — managed with uv
Anthropic API key — set as ANTHROPIC_API_KEY (used for LLM-as-judge scoring)

Quick Start

# 1. Clone and set up
git clone https://github.com/epicrunze/JaRVIS-Bench.git
cd JaRVIS-Bench
./scripts/setup.sh          # Clones NL2RepoBench + JaRVIS into vendor/, installs deps
source .venv/bin/activate

# 2. Run a smoke test (1 easy task, 1 run per condition)
./scripts/run-eval.sh --smoke

Running Evaluations

Shell entry point

# Smoke test — single easy task, 1 run
./scripts/run-eval.sh --smoke

# Single task — 3 runs per condition (baseline + jarvis-prompted)
./scripts/run-eval.sh --task <task_name>

# Full benchmark — all 104 tasks
./scripts/run-eval.sh --full

# Task subset from file
./scripts/select-tasks.sh --easy --sample 10 > my_tasks.txt
./scripts/run-eval.sh --tasks-from my_tasks.txt

# Options
./scripts/run-eval.sh --task <name> --condition baseline --runs 5 --timeout 1800

# Grade or report on an existing batch
./scripts/run-eval.sh --grade-only <batch_id>
./scripts/run-eval.sh --report-only <batch_id>

Python CLI (`jarvis-bench`)

# Run
jarvis-bench run --smoke
jarvis-bench run --task <name> --condition jarvis-prompted --runs 5
jarvis-bench run --full --timeout 1800
jarvis-bench run --tasks-from my_tasks.txt

# Grade
jarvis-bench grade --run-id <run_id>
jarvis-bench grade --batch-id <batch_id>

# Report
jarvis-bench report --batch-id <batch_id>

All subcommands accept --project-root (defaults to cwd) and -v for debug logging.

Selecting task subsets

./scripts/select-tasks.sh --easy           # ≤50 test cases
./scripts/select-tasks.sh --medium         # 51–299 test cases
./scripts/select-tasks.sh --hard           # ≥300 test cases
./scripts/select-tasks.sh --all            # All 104 tasks
./scripts/select-tasks.sh --hard --sample 5  # Random 5 hard tasks

Pipe into run-eval.sh:

./scripts/select-tasks.sh --easy --sample 3 > tasks.txt
./scripts/run-eval.sh --tasks-from tasks.txt

Interpreting Results

After a batch completes and is graded, generate a report:

jarvis-bench report --batch-id <batch_id>
# Report written to results/<batch_id>/report.md

The report contains:

Aggregate results — mean pass rate and quality score per condition, with standard deviations
Win/Tie/Loss — per-task comparison (JaRVIS vs baseline). A task is a "win" if JaRVIS's mean pass rate exceeds baseline by >1%, "loss" if below by >1%, "tie" otherwise
Per-task breakdown — pass rate and quality score for each task under each condition
Improvement analysis — top 5 most improved and most regressed tasks

Primary metric: Test pass rate (passed / total tests, via Docker pytest) Supplementary metric: LLM-as-judge quality scores (architectural coherence, code quality, completeness on 0–10 scale)

Project Structure

JaRVIS-Bench/
├── harness/                    # Python evaluation engine
│   ├── __init__.py
│   ├── __main__.py             # Click CLI (jarvis-bench command)
│   ├── config.py               # Dataclasses, enums, config, task discovery
│   ├── runner.py               # Claude Code invocation, workspace setup
│   ├── grader.py               # Docker pytest + LLM-as-judge scoring
│   └── reporter.py             # Aggregation, statistics, markdown reports
├── scripts/
│   ├── setup.sh                # One-time setup (clone deps, create venv)
│   ├── run-eval.sh             # Shell wrapper for python -m harness
│   └── select-tasks.sh         # Task subset selection by difficulty
├── docs/
│   ├── methodology.md          # A/B design, metrics, limitations
│   └── nl2repo-integration.md  # How we interface with NL2RepoBench
├── vendor/                     # Gitignored, created by setup.sh
│   ├── NL2RepoBench/           # Task specs + evaluation infra
│   └── JaRVIS/                 # Reflection skills (copied into workspaces)
├── workspaces/                 # Gitignored — per-run Claude Code workspaces
├── results/                    # Gitignored — run results, grades, reports
├── scaffold.md                 # Phased build plan
├── pyproject.toml
└── CLAUDE.md

Extending

Adding new tasks

Tasks come from NL2Repo-Bench. Each task lives in vendor/NL2RepoBench/test_files/{name}/ with:

File	Purpose
`start.md`	Natural-language specification given to the coding agent
`test_commands.json`	JSON array of shell commands (typically pytest invocations)
`test_files.json`	Test file references used during evaluation
`test_case_count.txt`	Expected number of test cases

To add custom tasks, create a directory following this structure. The grader also needs a Docker base image — see docs/nl2repo-integration.md for details.

Customizing evaluation

Docker grading (harness/grader.py): Stages workspace, removes package/test files, builds Docker image, runs test commands, parses pytest output
LLM judge (harness/grader.py): Reads workspace files, sends to Claude Sonnet for scoring on three dimensions
Reporter (harness/reporter.py): Aggregates grades, computes win/tie/loss, renders markdown

See docs/methodology.md for full details on metrics and statistical approach.

Development

source .venv/bin/activate
uv pip install -e ".[dev]"
ruff check harness/
mypy harness/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
configs		configs
docker		docker
docs		docs
harness		harness
results		results
scripts		scripts
workspaces		workspaces
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
scaffold.md		scaffold.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JaRVIS-Bench

Prerequisites

Quick Start

Running Evaluations

Shell entry point

Python CLI (`jarvis-bench`)

Selecting task subsets

Interpreting Results

Project Structure

Extending

Adding new tasks

Customizing evaluation

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JaRVIS-Bench

Prerequisites

Quick Start

Running Evaluations

Shell entry point

Python CLI (jarvis-bench)

Selecting task subsets

Interpreting Results

Project Structure

Extending

Adding new tasks

Customizing evaluation

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python CLI (`jarvis-bench`)

Packages