Skip to content

HermanDp45/Paper2Run

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper2Run

Paper2Run takes a GitHub repository (and optionally a paper) and tries to reproduce it inside a Docker container. The CLI walks the repo, asks a model to draft an environment plan, builds the image, runs a smoke command, and — if anything breaks — lets the model decide whether to write a new plan or stop with a failure report. The negative result is treated as a real outcome, not a failure of the tool.

Why

The repo for a paper rarely runs end-to-end on the first try. Pinning, missing data, GPU assumptions, system packages, broken entry points. The agent's job is to handle the boring layer of that: read the repository, synthesise an environment, build it, observe what blows up, decide whether a fix is possible, and either ship a working container with INSTRUCTIONS.md or hand back an actionable report explaining why reproduction is not feasible.

Architecture

                          ┌─────────────────┐
                          │   paper2run     │
                          │     run         │
                          └────────┬────────┘
                                   │
                  ┌────────────────▼─────────────────┐
                  │   clone_repo      (git, paper)   │
                  │   extract_facts   (analyze.py)   │
                  └────────────────┬─────────────────┘
                                   │
                  ┌────────────────▼─────────────────┐
                  │   plan_environment   (LLM)       │ ← EnvPlan JSON
                  │   render_dockerfile  (templates) │
                  └────────────────┬─────────────────┘
                                   │
              (skip if --dry-run)  │
                                   │
                  ┌────────────────▼─────────────────┐
                  │   preflight   (docker info)      │
                  │   build_image (docker build)     │
                  │   smoke_run   (docker run)       │
                  └─────┬────────────────────┬───────┘
                        │ smoke ok           │ build/smoke failed
                        │                    ▼
                        │           ┌─────────────────────┐
                        │           │  diagnose_failure   │  ← LLM gets logs
                        │           │                     │     and either
                        │           │  decision="repair"  │     emits a full
                        │           │  → new_plan         │     new EnvPlan
                        │           │  decision="stop"    │     or says
                        │           └────────┬────────────┘     "no fix"
                        │                    │
                        │      repair?       │
                        │ ┌──────────────────┘
                        │ │
                        │ ▼
                        │ apply_new_plan ──► (back to build, up to --max-attempts)
                        │
                        ▼
              ┌──────────────────────────┐
              │  write_success           │
              │  ├ Dockerfile            │
              │  ├ INSTRUCTIONS.md       │
              │  └ report.md             │
              └──────────────────────────┘                or write_failure

The whole loop is a LangGraph StateGraph. Determinism lives in analyze.py (facts) and templates.py (rendering). LLM use is confined to two nodes:

  • plan_environment — given the structured facts (and paper text if any), produce a single EnvPlan JSON.
  • diagnose_failure — given the failing plan, Dockerfile, and log tails, output decision = stop | repair. On repair, return a complete replacement EnvPlan. The agent never tries to "patch" the existing plan with surgical edits — the LLM owns the plan in full.

Install

Python 3.11+ and Docker are required.

git clone https://github.com/HermanDp45/Paper2Run.git
cd Paper2Run
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .

Check the CLI:

paper2run run --help

Environment

Put NVIDIA credentials in .env (or export the same variables). The endpoint is OpenAI-compatible.

NVIDIA_API_KEY=...                                  # required for LLM use
PAPER2RUN_MODEL=deepseek-ai/deepseek-v4-flash       # optional override
PAPER2RUN_BASE_URL=https://integrate.api.nvidia.com/v1

Without NVIDIA_API_KEY the agent falls back to a heuristic plan (plain Python, deps from manifests, smoke command from README/manifests). The fallback is intentionally weak — it exists so the tool still produces something, not because it's a real replacement for the LLM path.

Run

paper2run run \
  --repo https://github.com/karpathy/micrograd \
  --out runs \
  --timeout 600 \
  --max-attempts 3

Flags:

Flag Default Meaning
--repo required GitHub URL (https://github.com/... or git@github.com:...)
--paper none Local .pdf / .md / .markdown paper
--out runs/ Directory to put per-run artifact folders into
--timeout 900 Smoke-run timeout per attempt, seconds
--max-attempts 3 Total tries (initial + LLM-driven repairs)
--model .env default Override LLM model name
--base-url .env default Override LLM API base URL
--dry-run off Skip docker build/run; only produce Dockerfile + INSTRUCTIONS

Outputs

Each run gets a timestamped folder:

runs/<repo>-<UTC>/
  repo/                  # the git clone
  paper.json             # parsed paper (if provided / built from repo md)
  repo.json              # discovered manifests, readmes, entrypoints
  repo_facts.json        # deterministic facts (analyze.py)
  plan.json              # latest EnvPlan
  Dockerfile             # latest rendered Dockerfile
  INSTRUCTIONS.md        # how to docker build / docker run
  image.txt              # tag of the built image (success only)
  report.md              # human-readable summary (success or failure)
  metadata.json          # machine-readable summary
  run.log                # chronological event log
  logs/                  # preflight stdout/stderr, LLM prompt+result dumps
  attempts/
    01/
      Dockerfile
      build_stdout.log
      build_stderr.log
      smoke_stdout.log
      smoke_stderr.log
      diagnosis.json     # LLM decision after this attempt
    02/...

Success means: image is built and smoke_command either exited 0 or stayed alive until the timeout. The user can rebuild the image any time via docker build -t <tag> . from the run folder (instructions are in INSTRUCTIONS.md).

Failure means one of:

  • preflight-failed — Docker is not reachable.
  • stopped-by-llm — the model judged the failure unfixable and explained why.
  • max-attempts-exhausted — the model kept trying but ran out of attempts.
  • smoke-failed — fall-through for cases where no diagnosis was produced.
  • exception — uncaught error; full traceback in logs/exception.txt.

Modules

src/paper2run/
  cli.py          argparse → AppConfig → agent.run_agent
  config.py       AppConfig: env loading, validation, has_llm
  models.py       TextArtifact, PaperInfo, RepoInfo (the kept primitives)
  repository.py   git clone + repo walk (readmes, manifests, entrypoints)
  paper.py        PDF (pypdf) and markdown ingestion; repo-md fallback
  analyze.py      RepoFacts extractor (languages, managers, deps, smoke
                  candidates, GPU markers, existing Dockerfile/Makefile)
  schemas.py      Pydantic: EnvPlan + DiagnosisDecision
  templates.py    render_dockerfile, render_instructions
  llm.py          ChatOpenAI client, prompts, JSON parsing, heuristic
                  fallbacks for both planning and diagnosis
  docker.py       check_docker_access, build_image, run_container,
                  read_log_tail (thin subprocess wrappers)
  agent.py        LangGraph StateGraph and node implementations
  report.py       Markdown report generator
  io.py           write_json, append_run_log

LLM contract

The model only sees structured input and is required to return one of two Pydantic shapes. See src/paper2run/schemas.py.

EnvPlan:

base_image: str
system_packages: list[str]
workdir: str = "/repo"
install_steps: list[str]
env: dict[str, str]
smoke_command: str
requires_gpu: bool
notes: str

DiagnosisDecision:

decision: Literal["stop", "repair"]
reason: str
confidence: Literal["low", "medium", "high"]
new_plan: EnvPlan | None    # required when decision="repair"

Repair always replaces the plan in full. There are no typed micro-patches ("install pkg X"); the model decides what the next Dockerfile looks like.

Tests

PYTHONPATH=src python -m unittest discover -s tests

Covered:

  • tests/test_config.py — CLI/env config parsing
  • tests/test_io.py — log writer
  • tests/test_ingestion.py — repo inspection, paper parsing
  • tests/test_analyze.py — language/manager/deps/smoke detection
  • tests/test_templates.py — Dockerfile + INSTRUCTIONS rendering
  • tests/test_agent_routing.py — agent state graph with Docker and LLM mocked (success first try, repair-then-success, llm-stop, max-attempts)

Troubleshooting

Repository must be a GitHub URL — use https://github.com/... or git@github.com:.... Other forges are not supported yet.

docker info fails / preflight-failed — start Docker Desktop or the daemon. On Linux, add yourself to the docker group and re-login.

Temporary failure in name resolution during pip install — Docker Desktop's DNS is broken (often a VPN/firewall side-effect). Test with docker run --rm alpine nslookup pypi.org. Toggle the VPN, restart Docker Desktop, or change DNS in Settings → Resources → Network.

Run ends with stopped-by-llm immediately — the model judged the failure unfixable. Open attempts/01/diagnosis.json for the model's reason. Often the right move; otherwise rerun with --max-attempts higher or improve the repository's README.

No LLM available — the agent will fall back to a heuristic plan and to a one-shot heuristic diagnosis that can only say "stop". Set NVIDIA_API_KEY to get the real loop.

License

MIT (see LICENSE).

About

Github + Paper -> Docker + Report | Error report

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages