Paper2Run

Paper2Run takes a GitHub repository (and optionally a paper) and tries to reproduce it inside a Docker container. The CLI walks the repo, asks a model to draft an environment plan, builds the image, runs a smoke command, and — if anything breaks — lets the model decide whether to write a new plan or stop with a failure report. The negative result is treated as a real outcome, not a failure of the tool.

Why

The repo for a paper rarely runs end-to-end on the first try. Pinning, missing data, GPU assumptions, system packages, broken entry points. The agent's job is to handle the boring layer of that: read the repository, synthesise an environment, build it, observe what blows up, decide whether a fix is possible, and either ship a working container with INSTRUCTIONS.md or hand back an actionable report explaining why reproduction is not feasible.

Architecture

                          ┌─────────────────┐
                          │   paper2run     │
                          │     run         │
                          └────────┬────────┘
                                   │
                  ┌────────────────▼─────────────────┐
                  │   clone_repo      (git, paper)   │
                  │   extract_facts   (analyze.py)   │
                  └────────────────┬─────────────────┘
                                   │
                  ┌────────────────▼─────────────────┐
                  │   plan_environment   (LLM)       │ ← EnvPlan JSON
                  │   render_dockerfile  (templates) │
                  └────────────────┬─────────────────┘
                                   │
              (skip if --dry-run)  │
                                   │
                  ┌────────────────▼─────────────────┐
                  │   preflight   (docker info)      │
                  │   build_image (docker build)     │
                  │   smoke_run   (docker run)       │
                  └─────┬────────────────────┬───────┘
                        │ smoke ok           │ build/smoke failed
                        │                    ▼
                        │           ┌─────────────────────┐
                        │           │  diagnose_failure   │  ← LLM gets logs
                        │           │                     │     and either
                        │           │  decision="repair"  │     emits a full
                        │           │  → new_plan         │     new EnvPlan
                        │           │  decision="stop"    │     or says
                        │           └────────┬────────────┘     "no fix"
                        │                    │
                        │      repair?       │
                        │ ┌──────────────────┘
                        │ │
                        │ ▼
                        │ apply_new_plan ──► (back to build, up to --max-attempts)
                        │
                        ▼
              ┌──────────────────────────┐
              │  write_success           │
              │  ├ Dockerfile            │
              │  ├ INSTRUCTIONS.md       │
              │  └ report.md             │
              └──────────────────────────┘                or write_failure

The whole loop is a LangGraph StateGraph. Determinism lives in analyze.py (facts) and templates.py (rendering). LLM use is confined to two nodes:

plan_environment — given the structured facts (and paper text if any), produce a single EnvPlan JSON.
diagnose_failure — given the failing plan, Dockerfile, and log tails, output decision = stop | repair. On repair, return a complete replacement EnvPlan. The agent never tries to "patch" the existing plan with surgical edits — the LLM owns the plan in full.

Install

Python 3.11+ and Docker are required.

git clone https://github.com/HermanDp45/Paper2Run.git
cd Paper2Run
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .

Check the CLI:

paper2run run --help

Environment

Put NVIDIA credentials in .env (or export the same variables). The endpoint is OpenAI-compatible.

NVIDIA_API_KEY=...                                  # required for LLM use
PAPER2RUN_MODEL=deepseek-ai/deepseek-v4-flash       # optional override
PAPER2RUN_BASE_URL=https://integrate.api.nvidia.com/v1

Without NVIDIA_API_KEY the agent falls back to a heuristic plan (plain Python, deps from manifests, smoke command from README/manifests). The fallback is intentionally weak — it exists so the tool still produces something, not because it's a real replacement for the LLM path.

Run

paper2run run \
  --repo https://github.com/karpathy/micrograd \
  --out runs \
  --timeout 600 \
  --max-attempts 3

Flags:

Flag	Default	Meaning
`--repo`	required	GitHub URL (`https://github.com/...` or `git@github.com:...`)
`--paper`	none	Local `.pdf` / `.md` / `.markdown` paper
`--out`	`runs/`	Directory to put per-run artifact folders into
`--timeout`	`900`	Smoke-run timeout per attempt, seconds
`--max-attempts`	`3`	Total tries (initial + LLM-driven repairs)
`--model`	`.env` default	Override LLM model name
`--base-url`	`.env` default	Override LLM API base URL
`--dry-run`	off	Skip docker build/run; only produce Dockerfile + INSTRUCTIONS

Outputs

Each run gets a timestamped folder:

runs/<repo>-<UTC>/
  repo/                  # the git clone
  paper.json             # parsed paper (if provided / built from repo md)
  repo.json              # discovered manifests, readmes, entrypoints
  repo_facts.json        # deterministic facts (analyze.py)
  plan.json              # latest EnvPlan
  Dockerfile             # latest rendered Dockerfile
  INSTRUCTIONS.md        # how to docker build / docker run
  image.txt              # tag of the built image (success only)
  report.md              # human-readable summary (success or failure)
  metadata.json          # machine-readable summary
  run.log                # chronological event log
  logs/                  # preflight stdout/stderr, LLM prompt+result dumps
  attempts/
    01/
      Dockerfile
      build_stdout.log
      build_stderr.log
      smoke_stdout.log
      smoke_stderr.log
      diagnosis.json     # LLM decision after this attempt
    02/...

Success means: image is built and smoke_command either exited 0 or stayed alive until the timeout. The user can rebuild the image any time via docker build -t <tag> . from the run folder (instructions are in INSTRUCTIONS.md).

Failure means one of:

preflight-failed — Docker is not reachable.
stopped-by-llm — the model judged the failure unfixable and explained why.
max-attempts-exhausted — the model kept trying but ran out of attempts.
smoke-failed — fall-through for cases where no diagnosis was produced.
exception — uncaught error; full traceback in logs/exception.txt.

Modules

src/paper2run/
  cli.py          argparse → AppConfig → agent.run_agent
  config.py       AppConfig: env loading, validation, has_llm
  models.py       TextArtifact, PaperInfo, RepoInfo (the kept primitives)
  repository.py   git clone + repo walk (readmes, manifests, entrypoints)
  paper.py        PDF (pypdf) and markdown ingestion; repo-md fallback
  analyze.py      RepoFacts extractor (languages, managers, deps, smoke
                  candidates, GPU markers, existing Dockerfile/Makefile)
  schemas.py      Pydantic: EnvPlan + DiagnosisDecision
  templates.py    render_dockerfile, render_instructions
  llm.py          ChatOpenAI client, prompts, JSON parsing, heuristic
                  fallbacks for both planning and diagnosis
  docker.py       check_docker_access, build_image, run_container,
                  read_log_tail (thin subprocess wrappers)
  agent.py        LangGraph StateGraph and node implementations
  report.py       Markdown report generator
  io.py           write_json, append_run_log

LLM contract

The model only sees structured input and is required to return one of two Pydantic shapes. See src/paper2run/schemas.py.

EnvPlan:

base_image: str
system_packages: list[str]
workdir: str = "/repo"
install_steps: list[str]
env: dict[str, str]
smoke_command: str
requires_gpu: bool
notes: str

DiagnosisDecision:

decision: Literal["stop", "repair"]
reason: str
confidence: Literal["low", "medium", "high"]
new_plan: EnvPlan | None    # required when decision="repair"

Repair always replaces the plan in full. There are no typed micro-patches ("install pkg X"); the model decides what the next Dockerfile looks like.

Tests

PYTHONPATH=src python -m unittest discover -s tests

Covered:

tests/test_config.py — CLI/env config parsing
tests/test_io.py — log writer
tests/test_ingestion.py — repo inspection, paper parsing
tests/test_analyze.py — language/manager/deps/smoke detection
tests/test_templates.py — Dockerfile + INSTRUCTIONS rendering
tests/test_agent_routing.py — agent state graph with Docker and LLM mocked (success first try, repair-then-success, llm-stop, max-attempts)

Troubleshooting

Repository must be a GitHub URL — use https://github.com/... or git@github.com:.... Other forges are not supported yet.

docker info fails / preflight-failed — start Docker Desktop or the daemon. On Linux, add yourself to the docker group and re-login.

Temporary failure in name resolution during pip install — Docker Desktop's DNS is broken (often a VPN/firewall side-effect). Test with docker run --rm alpine nslookup pypi.org. Toggle the VPN, restart Docker Desktop, or change DNS in Settings → Resources → Network.

Run ends with stopped-by-llm immediately — the model judged the failure unfixable. Open attempts/01/diagnosis.json for the model's reason. Often the right move; otherwise rerun with --max-attempts higher or improve the repository's README.

No LLM available — the agent will fall back to a heuristic plan and to a one-shot heuristic diagnosis that can only say "stop". Set NVIDIA_API_KEY to get the real loop.

License

MIT (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src/paper2run		src/paper2run
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper2Run

Why

Architecture

Install

Environment

Run

Outputs

Modules

LLM contract

Tests

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paper2Run

Why

Architecture

Install

Environment

Run

Outputs

Modules

LLM contract

Tests

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages