MatrixArena 🏟️

A decentralised, peer-review evaluation framework where top-tier LLMs dynamically generate tasks, solve them, and judge each other.

Static benchmarks like MMLU are easily gamed. MatrixArena replaces them with a continuous, rotating evaluation loop: one model generates a coding problem, every model independently solves it, and all non-generator models cross-judge every solution. The result is an Elo-based leaderboard that reflects multi-dimensional coding ability and is hard to overfit.

Architecture

MatrixArena/
├── main.py                  # CLI entry point
├── requirements.txt         # Python dependencies
├── .env.example             # API key template
│
├── config/
│   ├── settings.py          # Config loader (YAML + env vars, auto-disables missing keys)
│   └── models.yaml          # Pool of participating models
│
├── prompts/                 # Prompt templates
│   ├── generator.txt        # Novel problem generation + Elo incentive rules
│   ├── solver.txt           # Solution output + originality warning + Elo incentive
│   └── judge.txt            # Scoring rubric + execution result + Judge Elo incentive
│
├── core/
│   ├── gateway.py           # litellm async wrapper (retries, backoff, empty/truncation recovery)
│   ├── orchestrator.py      # Full Gen → Solve × N → Execute × N → Judge × N² loop
│   └── elo_rating.py        # Elo for Solvers (K=32) + Generator calibration (K=16)
│
├── sandbox/
│   ├── Dockerfile           # Isolated execution image (non-root, no network)
│   └── executor.py          # SubprocessExecutor: syntax check + per-test harness
│
├── data/                    # Runtime outputs (git-ignored)
│   ├── battles.jsonl        # Append-only compact battle log
│   ├── leaderboard.json     # Latest Elo snapshot
│   └── cycles/              # Full per-cycle artifacts
│       └── <YYYYMMDD_HHMMSS>_c<N>/
│           ├── summary.json
│           ├── generator_problem.json
│           ├── solver_<model_slug>.json       # one per solver
│           ├── execution_<model_slug>.json    # one per solver
│           └── judgments_<model_slug>.json    # one per solver (all judges)
│
└── dashboard/
    └── app.py               # Streamlit UI (leaderboard, Elo history, cycle detail)

Evaluation Cycle

Each cycle runs 5 sequential phases:

### Phase 1 ── GENERATION ──────────────────────────────────────────
  **Modes:**
  - **Random:** A model is randomly chosen as the Generator.
  - **Sequential (NEW):** If the total number of cycles specified (`--cycles`) is a multiple of the number of available models, the system enter **Sequential Mode**. Each model takes turns being the Generator exactly once (or twice, if `cycles = 2 * N`, etc.). This ensures 100% fairness in problem generation opportunities.
  
  The Generator creates a novel coding problem including:
  (JSON: title, description, test_cases, evaluation_criteria)

### Phase 2 ── SOLVING (all N models, concurrent) ──────────────────
  Every model in the pool — including the Generator — independently
  writes a Python solution to the same problem.
  **Resilience:** If a solver times out (default 240s) or fails, it receives a score based on its actual performance (usually 0).

### Phase 3 ── SANDBOX EXECUTION ───────────────────────────────────
  Each of the N solutions is run locally in a subprocess sandbox.
  Syntax check → test harness injection → per-test pass/fail.
  **Hardened Sandbox:** Prevents infinite loops or hung processes using `stdin=null` and strict timeouts.

### Phase 4 ── JUDGING (concurrent) ────────────────────────────────
  All N models (including the Generator) act as Judges.
  Each Judge scores every solution except its own.
  → N × (N-1) total scoring calls, all concurrent.
  **Fairness & Reliability:** 
  - A model never judges its own solution.
  - **Score Aggregation:** If a judge fails to provide a score (timeout or API error), its "vote" is discarded. It no longer defaults to 5.0, ensuring that unreliable judges don't skew the leaderboard.

### Phase 5 ── ELO UPDATE ──────────────────────────────────────────
  Each Solver's Elo is updated from its aggregate judge score (K=32).
  The Generator's Elo is updated by a calibration formula:
    outcome = 1 − |normalized_score − 0.5| × 2
  Peaks at score=5/10 (good difficulty), penalises trivial or
  impossible problems.
---
## Key Features & Resilience

- **Sequential Generator Mode:** Automatically triggered when `cycles` is a multiple of the model count, ensuring every model serves as a generator equally.
- **Enhanced Timeouts:** Default 240s timeout for API calls to accommodate slow-reasoning models (e.g., Qwen, O1/O3).
- **Raw Output Logging:** Every raw LLM response is saved in `data/cycles/<timestamp>/raw/` for deep-dive analysis and debugging.
- **Hard Abort on Generation Failure:** If a generator fails to produce a valid problem, the cycle is aborted immediately (preventing "Two Sum" fallbacks) to save execution time and credits.
- **Robust JSON Parsing:** Advanced multi-stage parsing (Regex + Brace Balance) allows the system to extract JSON from models that provide conversational "fluff" or Markdown wrappers.

Fairness rule: a model never judges its own solution.

Role matrix for N=10 models

Role	Count	Who
Generator	1	Randomly chosen each cycle
Solvers	10	All models (inc. Generator)
Judges per solution	9	All models except the Solver being judged
Total judge API calls	~90	Fully concurrent

Quick Start

1. Clone & install dependencies

git clone https://github.com/Wes1eyyy/MatrixArena.git
cd MatrixArena
pip install -r requirements.txt

2. Configure API keys

cp .env.example .env
# Set OPENROUTER_API_KEY (required)
# Set ARK_API_KEY if you want Doubao Seed 2.0 Code (optional)

3. Run evaluation cycles

# Run 3 cycles (default from .env NUM_CYCLES or hardcoded default)
python main.py

# Run a specific number of cycles
python main.py --cycles 10

Progress is printed in real time as each phase completes:

--- Cycle 1/3 ---
  Generator  : openrouter/anthropic/claude-opus-4.6
  Solvers    : all 10 models
  Judges pool: 10 models (all, each skips own solution)
  [1/5] Generating problem...
        Problem : "Chrono-Gated Courier Network"
  [2/5] Solving (10 solvers in parallel)...
        [claude-opus-4.6] → 42 line(s) of code
        [gemini-3-pro-preview] → 38 line(s) of code
        ...
  [3/5] Running sandbox execution for all solutions...
        [claude-opus-4.6] 5/5 tests passed
        ...
  [4/5] Judging (90 calls: 10 solutions × ~9 judges each)...
        [grok-4.1-fast] judged [gemini-3-pro-preview] → 8.5/10
        ...
  [5/5] Updating Elo ratings...
  Overall avg : 7.42/10
  Best solver : claude-opus-4.6 (8.90/10)

4. View the dashboard

streamlit run dashboard/app.py

Opens a Streamlit web app with:

Leaderboard — Elo bar chart + ranked table
Elo History — per-cycle line chart with model filter
Battle Log — summary table of all cycles
Cycle Detail — per-solver code, execution results, judge radar chart, feedback

CLI Reference

Command	Description
`python main.py`	Run cycles (default or `NUM_CYCLES` from `.env`)
`python main.py --cycles N`	Run exactly N cycles
`python main.py --health-check`	Ping all models and exit
`python main.py --no-health-check`	Skip health check and run immediately
`python main.py --reset-last`	Remove the most recent cycle and revert Elo
`python main.py --reset-all`	Wipe all battles, leaderboard, and cycle artifacts

Health check behaviour

[OK ] — model responded within timeout
[429] — model is temporarily rate-limited upstream; auto-excluded from this run
[FAIL] — hard error (auth, not found, etc.); run aborts unless --no-health-check

Gateway Resilience

core/gateway.py handles production edge cases automatically:

Situation	Behaviour
Rate limit (429)	Waits `retry_after_seconds` from the error, then retries
Transient error	Exponential backoff: 1 s → 2 s → 4 s …
Empty response	Detected as failure, retried with backoff
Truncated response (`finish_reason=length`)	Retries with continuation prompt asking for complete JSON
Health check 429	Fast-fails immediately (no wait); model skipped for this run

Data Storage

`data/battles.jsonl` — compact append-only log

One JSON line per cycle. Fields:

{
  "cycle_num": 1,
  "timestamp": 1741696367,
  "generator": "openrouter/anthropic/claude-opus-4.6",
  "solvers": ["openrouter/anthropic/claude-opus-4.6", ...],   // all N models
  "judges_pool": ["openrouter/anthropic/claude-opus-4.6", ...],    // all N
  "problem_title": "Chrono-Gated Courier Network",
  "average_scores": {"openrouter/anthropic/claude-opus-4.6": 8.9, ...},
  "overall_average": 7.42,
  "elo_after": {"openrouter/anthropic/claude-opus-4.6": 1218.4, ...}
}

`data/cycles/<YYYYMMDD_HHMMSS>_c<N>/` — full per-cycle artifacts

File	Contents
`summary.json`	Roles, per-solver scores, overall average, Elo snapshot
`generator_problem.json`	Complete Generator output: title, description, test cases, criteria
`solver_<slug>.json`	Runnable Python code + explanation (one file per solver)
`execution_<slug>.json`	Sandbox result: per-test status, pass/fail count, stderr
`judgments_<slug>.json`	All judge scores + feedback for this solver (one file per solver)

Example layout after 1 cycle with 10 models:

data/cycles/20260311_143022_c1/
├── summary.json
├── generator_problem.json
├── solver_openrouter_anthropic_claude-opus-4_6.json
├── solver_openrouter_google_gemini-3-pro-preview.json
├── ... (×10 solvers)
├── execution_openrouter_anthropic_claude-opus-4_6.json
├── ... (×10)
├── judgments_openrouter_anthropic_claude-opus-4_6.json
└── ... (×10)

All files under data/ are git-ignored and never committed.

Configuration

Edit config/models.yaml to change the model pool. Any model with an api_key_env pointing to an unset environment variable is automatically disabled at startup.

models:
  - id: openrouter/anthropic/claude-opus-4.6
    provider: openrouter
    display_name: Claude Opus 4.6

  # Direct endpoint (non-OpenRouter)
  - id: openai/doubao-seed-2.0-code
    provider: ark
    display_name: Doubao Seed 2.0 Code
    api_base: https://ark.cn-beijing.volces.com/api/coding/v3
    api_key_env: ARK_API_KEY   # disabled if ARK_API_KEY is not set

initial_elo: 1200

Minimum pool size: 3 models (1 generator + 1 solver + 1 judge).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatrixArena 🏟️

Architecture

Evaluation Cycle

Role matrix for N=10 models

Quick Start

1. Clone & install dependencies

2. Configure API keys

3. Run evaluation cycles

4. View the dashboard

CLI Reference

Health check behaviour

Gateway Resilience

Data Storage

`data/battles.jsonl` — compact append-only log

`data/cycles/<YYYYMMDD_HHMMSS>_c<N>/` — full per-cycle artifacts

Configuration

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config		config
core		core
dashboard		dashboard
prompts		prompts
sandbox		sandbox
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MatrixArena 🏟️

Architecture

Evaluation Cycle

Role matrix for N=10 models

Quick Start

1. Clone & install dependencies

2. Configure API keys

3. Run evaluation cycles

4. View the dashboard

CLI Reference

Health check behaviour

Gateway Resilience

Data Storage

data/battles.jsonl — compact append-only log

data/cycles/<YYYYMMDD_HHMMSS>_c<N>/ — full per-cycle artifacts

Configuration

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/battles.jsonl` — compact append-only log

`data/cycles/<YYYYMMDD_HHMMSS>_c<N>/` — full per-cycle artifacts

Packages