A/B evaluation framework measuring whether JaRVIS reflective journaling improves Claude Code's performance on long-horizon repository generation tasks.
Hypothesis: Reflective journaling helps AI coding agents maintain coherence on long-horizon tasks — planning better, catching mistakes earlier, and producing higher-quality code.
Uses NL2Repo-Bench (104 Python library generation tasks) as the task and evaluation infrastructure.
- Claude Code CLI — installed and authenticated (
claudeon PATH) - Docker — running (used for pytest evaluation of generated repos)
- Python 3.11+ — managed with
uv - Anthropic API key — set as
ANTHROPIC_API_KEY(used for LLM-as-judge scoring)
# 1. Clone and set up
git clone https://github.com/epicrunze/JaRVIS-Bench.git
cd JaRVIS-Bench
./scripts/setup.sh # Clones NL2RepoBench + JaRVIS into vendor/, installs deps
source .venv/bin/activate
# 2. Run a smoke test (1 easy task, 1 run per condition)
./scripts/run-eval.sh --smoke# Smoke test — single easy task, 1 run
./scripts/run-eval.sh --smoke
# Single task — 3 runs per condition (baseline + jarvis-prompted)
./scripts/run-eval.sh --task <task_name>
# Full benchmark — all 104 tasks
./scripts/run-eval.sh --full
# Task subset from file
./scripts/select-tasks.sh --easy --sample 10 > my_tasks.txt
./scripts/run-eval.sh --tasks-from my_tasks.txt
# Options
./scripts/run-eval.sh --task <name> --condition baseline --runs 5 --timeout 1800
# Grade or report on an existing batch
./scripts/run-eval.sh --grade-only <batch_id>
./scripts/run-eval.sh --report-only <batch_id># Run
jarvis-bench run --smoke
jarvis-bench run --task <name> --condition jarvis-prompted --runs 5
jarvis-bench run --full --timeout 1800
jarvis-bench run --tasks-from my_tasks.txt
# Grade
jarvis-bench grade --run-id <run_id>
jarvis-bench grade --batch-id <batch_id>
# Report
jarvis-bench report --batch-id <batch_id>All subcommands accept --project-root (defaults to cwd) and -v for debug logging.
./scripts/select-tasks.sh --easy # ≤50 test cases
./scripts/select-tasks.sh --medium # 51–299 test cases
./scripts/select-tasks.sh --hard # ≥300 test cases
./scripts/select-tasks.sh --all # All 104 tasks
./scripts/select-tasks.sh --hard --sample 5 # Random 5 hard tasksPipe into run-eval.sh:
./scripts/select-tasks.sh --easy --sample 3 > tasks.txt
./scripts/run-eval.sh --tasks-from tasks.txtAfter a batch completes and is graded, generate a report:
jarvis-bench report --batch-id <batch_id>
# Report written to results/<batch_id>/report.mdThe report contains:
- Aggregate results — mean pass rate and quality score per condition, with standard deviations
- Win/Tie/Loss — per-task comparison (JaRVIS vs baseline). A task is a "win" if JaRVIS's mean pass rate exceeds baseline by >1%, "loss" if below by >1%, "tie" otherwise
- Per-task breakdown — pass rate and quality score for each task under each condition
- Improvement analysis — top 5 most improved and most regressed tasks
Primary metric: Test pass rate (passed / total tests, via Docker pytest) Supplementary metric: LLM-as-judge quality scores (architectural coherence, code quality, completeness on 0–10 scale)
JaRVIS-Bench/
├── harness/ # Python evaluation engine
│ ├── __init__.py
│ ├── __main__.py # Click CLI (jarvis-bench command)
│ ├── config.py # Dataclasses, enums, config, task discovery
│ ├── runner.py # Claude Code invocation, workspace setup
│ ├── grader.py # Docker pytest + LLM-as-judge scoring
│ └── reporter.py # Aggregation, statistics, markdown reports
├── scripts/
│ ├── setup.sh # One-time setup (clone deps, create venv)
│ ├── run-eval.sh # Shell wrapper for python -m harness
│ └── select-tasks.sh # Task subset selection by difficulty
├── docs/
│ ├── methodology.md # A/B design, metrics, limitations
│ └── nl2repo-integration.md # How we interface with NL2RepoBench
├── vendor/ # Gitignored, created by setup.sh
│ ├── NL2RepoBench/ # Task specs + evaluation infra
│ └── JaRVIS/ # Reflection skills (copied into workspaces)
├── workspaces/ # Gitignored — per-run Claude Code workspaces
├── results/ # Gitignored — run results, grades, reports
├── scaffold.md # Phased build plan
├── pyproject.toml
└── CLAUDE.md
Tasks come from NL2Repo-Bench. Each task lives in vendor/NL2RepoBench/test_files/{name}/ with:
| File | Purpose |
|---|---|
start.md |
Natural-language specification given to the coding agent |
test_commands.json |
JSON array of shell commands (typically pytest invocations) |
test_files.json |
Test file references used during evaluation |
test_case_count.txt |
Expected number of test cases |
To add custom tasks, create a directory following this structure. The grader also needs a Docker base image — see docs/nl2repo-integration.md for details.
- Docker grading (
harness/grader.py): Stages workspace, removes package/test files, builds Docker image, runs test commands, parses pytest output - LLM judge (
harness/grader.py): Reads workspace files, sends to Claude Sonnet for scoring on three dimensions - Reporter (
harness/reporter.py): Aggregates grades, computes win/tie/loss, renders markdown
See docs/methodology.md for full details on metrics and statistical approach.
source .venv/bin/activate
uv pip install -e ".[dev]"
ruff check harness/
mypy harness/