Paper2Run takes a GitHub repository (and optionally a paper) and tries to reproduce it inside a Docker container. The CLI walks the repo, asks a model to draft an environment plan, builds the image, runs a smoke command, and — if anything breaks — lets the model decide whether to write a new plan or stop with a failure report. The negative result is treated as a real outcome, not a failure of the tool.
The repo for a paper rarely runs end-to-end on the first try. Pinning,
missing data, GPU assumptions, system packages, broken entry points. The
agent's job is to handle the boring layer of that: read the repository,
synthesise an environment, build it, observe what blows up, decide whether
a fix is possible, and either ship a working container with
INSTRUCTIONS.md or hand back an actionable report explaining why
reproduction is not feasible.
┌─────────────────┐
│ paper2run │
│ run │
└────────┬────────┘
│
┌────────────────▼─────────────────┐
│ clone_repo (git, paper) │
│ extract_facts (analyze.py) │
└────────────────┬─────────────────┘
│
┌────────────────▼─────────────────┐
│ plan_environment (LLM) │ ← EnvPlan JSON
│ render_dockerfile (templates) │
└────────────────┬─────────────────┘
│
(skip if --dry-run) │
│
┌────────────────▼─────────────────┐
│ preflight (docker info) │
│ build_image (docker build) │
│ smoke_run (docker run) │
└─────┬────────────────────┬───────┘
│ smoke ok │ build/smoke failed
│ ▼
│ ┌─────────────────────┐
│ │ diagnose_failure │ ← LLM gets logs
│ │ │ and either
│ │ decision="repair" │ emits a full
│ │ → new_plan │ new EnvPlan
│ │ decision="stop" │ or says
│ └────────┬────────────┘ "no fix"
│ │
│ repair? │
│ ┌──────────────────┘
│ │
│ ▼
│ apply_new_plan ──► (back to build, up to --max-attempts)
│
▼
┌──────────────────────────┐
│ write_success │
│ ├ Dockerfile │
│ ├ INSTRUCTIONS.md │
│ └ report.md │
└──────────────────────────┘ or write_failure
The whole loop is a LangGraph StateGraph. Determinism lives in
analyze.py (facts) and templates.py (rendering). LLM use is confined
to two nodes:
plan_environment— given the structured facts (and paper text if any), produce a singleEnvPlanJSON.diagnose_failure— given the failing plan, Dockerfile, and log tails, outputdecision = stop | repair. Onrepair, return a complete replacementEnvPlan. The agent never tries to "patch" the existing plan with surgical edits — the LLM owns the plan in full.
Python 3.11+ and Docker are required.
git clone https://github.com/HermanDp45/Paper2Run.git
cd Paper2Run
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .Check the CLI:
paper2run run --helpPut NVIDIA credentials in .env (or export the same variables). The
endpoint is OpenAI-compatible.
NVIDIA_API_KEY=... # required for LLM use
PAPER2RUN_MODEL=deepseek-ai/deepseek-v4-flash # optional override
PAPER2RUN_BASE_URL=https://integrate.api.nvidia.com/v1Without NVIDIA_API_KEY the agent falls back to a heuristic plan
(plain Python, deps from manifests, smoke command from README/manifests).
The fallback is intentionally weak — it exists so the tool still produces
something, not because it's a real replacement for the LLM path.
paper2run run \
--repo https://github.com/karpathy/micrograd \
--out runs \
--timeout 600 \
--max-attempts 3Flags:
| Flag | Default | Meaning |
|---|---|---|
--repo |
required | GitHub URL (https://github.com/... or git@github.com:...) |
--paper |
none | Local .pdf / .md / .markdown paper |
--out |
runs/ |
Directory to put per-run artifact folders into |
--timeout |
900 |
Smoke-run timeout per attempt, seconds |
--max-attempts |
3 |
Total tries (initial + LLM-driven repairs) |
--model |
.env default |
Override LLM model name |
--base-url |
.env default |
Override LLM API base URL |
--dry-run |
off | Skip docker build/run; only produce Dockerfile + INSTRUCTIONS |
Each run gets a timestamped folder:
runs/<repo>-<UTC>/
repo/ # the git clone
paper.json # parsed paper (if provided / built from repo md)
repo.json # discovered manifests, readmes, entrypoints
repo_facts.json # deterministic facts (analyze.py)
plan.json # latest EnvPlan
Dockerfile # latest rendered Dockerfile
INSTRUCTIONS.md # how to docker build / docker run
image.txt # tag of the built image (success only)
report.md # human-readable summary (success or failure)
metadata.json # machine-readable summary
run.log # chronological event log
logs/ # preflight stdout/stderr, LLM prompt+result dumps
attempts/
01/
Dockerfile
build_stdout.log
build_stderr.log
smoke_stdout.log
smoke_stderr.log
diagnosis.json # LLM decision after this attempt
02/...
Success means: image is built and smoke_command either exited 0 or
stayed alive until the timeout. The user can rebuild the image any time
via docker build -t <tag> . from the run folder (instructions are in
INSTRUCTIONS.md).
Failure means one of:
preflight-failed— Docker is not reachable.stopped-by-llm— the model judged the failure unfixable and explained why.max-attempts-exhausted— the model kept trying but ran out of attempts.smoke-failed— fall-through for cases where no diagnosis was produced.exception— uncaught error; full traceback inlogs/exception.txt.
src/paper2run/
cli.py argparse → AppConfig → agent.run_agent
config.py AppConfig: env loading, validation, has_llm
models.py TextArtifact, PaperInfo, RepoInfo (the kept primitives)
repository.py git clone + repo walk (readmes, manifests, entrypoints)
paper.py PDF (pypdf) and markdown ingestion; repo-md fallback
analyze.py RepoFacts extractor (languages, managers, deps, smoke
candidates, GPU markers, existing Dockerfile/Makefile)
schemas.py Pydantic: EnvPlan + DiagnosisDecision
templates.py render_dockerfile, render_instructions
llm.py ChatOpenAI client, prompts, JSON parsing, heuristic
fallbacks for both planning and diagnosis
docker.py check_docker_access, build_image, run_container,
read_log_tail (thin subprocess wrappers)
agent.py LangGraph StateGraph and node implementations
report.py Markdown report generator
io.py write_json, append_run_log
The model only sees structured input and is required to return one of two
Pydantic shapes. See src/paper2run/schemas.py.
EnvPlan:
base_image: str
system_packages: list[str]
workdir: str = "/repo"
install_steps: list[str]
env: dict[str, str]
smoke_command: str
requires_gpu: bool
notes: strDiagnosisDecision:
decision: Literal["stop", "repair"]
reason: str
confidence: Literal["low", "medium", "high"]
new_plan: EnvPlan | None # required when decision="repair"Repair always replaces the plan in full. There are no typed micro-patches ("install pkg X"); the model decides what the next Dockerfile looks like.
PYTHONPATH=src python -m unittest discover -s testsCovered:
tests/test_config.py— CLI/env config parsingtests/test_io.py— log writertests/test_ingestion.py— repo inspection, paper parsingtests/test_analyze.py— language/manager/deps/smoke detectiontests/test_templates.py— Dockerfile + INSTRUCTIONS renderingtests/test_agent_routing.py— agent state graph with Docker and LLM mocked (success first try, repair-then-success, llm-stop, max-attempts)
Repository must be a GitHub URL — use https://github.com/... or
git@github.com:.... Other forges are not supported yet.
docker info fails / preflight-failed — start Docker Desktop or the
daemon. On Linux, add yourself to the docker group and re-login.
Temporary failure in name resolution during pip install — Docker
Desktop's DNS is broken (often a VPN/firewall side-effect). Test with
docker run --rm alpine nslookup pypi.org. Toggle the VPN, restart
Docker Desktop, or change DNS in Settings → Resources → Network.
Run ends with stopped-by-llm immediately — the model judged the
failure unfixable. Open attempts/01/diagnosis.json for the model's
reason. Often the right move; otherwise rerun with --max-attempts
higher or improve the repository's README.
No LLM available — the agent will fall back to a heuristic plan and
to a one-shot heuristic diagnosis that can only say "stop". Set
NVIDIA_API_KEY to get the real loop.
MIT (see LICENSE).