diff --git a/.claude/commands/e2e.md b/.claude/commands/e2e.md new file mode 100644 index 00000000..a3da6ea2 --- /dev/null +++ b/.claude/commands/e2e.md @@ -0,0 +1,90 @@ +--- +description: Run a Docker-only end-to-end smoke test of Buttercup against example-libpng with a low LLM budget, and monitor the pipeline. +argument-hint: "[--budget N] [--task-duration SEC] [--image-tag TAG] [--no-pull]" +allowed-tools: Bash(./scripts/e2e.sh:*), Bash(make e2e*), Bash(docker compose:*), Bash(cd dev/docker-compose && docker compose:*), Read +--- + +# /e2e — Docker-only end-to-end Buttercup run (example-libpng) + +This command exercises the full Buttercup pipeline on the [example-libpng](https://github.com/tob-challenges/example-libpng) challenge **using Docker only — no Kubernetes/minikube**. It uses the `dev/docker-compose/` stack with the **`compose.prebuilt.yaml` overlay** — every component runs from its prebuilt GHCR image (`ghcr.io/trailofbits/buttercup/*`, tag `main` by default), so **nothing is built locally**. A LiteLLM budget cap (default **$10**) bounds the spend — a full run through patch generation costs roughly that; a lower cap stops the pipeline before patch/bundle, so `--budget 3` only exercises up to seed-gen. + +> **Image tag:** defaults to `main`. Override with `--image-tag ` or `BUTTERCUP_IMAGE_TAG=...` to test a specific build. Private images require `docker login ghcr.io` first. +> +> **Host requirement:** x86_64. The prebuilt fuzzer / patcher / seed-gen images are based on `gcr.io/oss-fuzz-base/base-runner`, which is amd64-only. On aarch64 they only run under `qemu-user-static` + `binfmt` with `DOCKER_DEFAULT_PLATFORM=linux/amd64` (and ~10× slower). + +Mirrors the milestones in `.github/workflows/system-integration.yml`, but tails `docker compose logs` instead of `kubectl logs`. + +## What it does + +1. Checks for `docker`, `docker compose`, `curl`, and at least one LLM provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GEMINI_API_KEY`) in your env (or already saved in `dev/docker-compose/.env`). +2. Writes `dev/docker-compose/.env` with the provider keys and `LITELLM_MAX_BUDGET=$BUDGET` (default `10`). The submitted task's `duration` defaults to `7200`s (2h) — the CRS discards a task's work once its deadline passes, and the full pipeline can exceed 30 min, so a short duration would expire mid-patch. +3. Pulls the prebuilt component images (`docker compose -f compose.yaml -f compose.prebuilt.yaml pull`, skippable with `--no-pull`) and starts every service (redis, dind, litellm, task-server, task-downloader, scheduler, program-model, build-bot, fuzzer-bot, coverage-bot, tracer-bot, seed-gen, patcher, buttercup-ui). No local image build. +4. POSTs the canned libpng `trigger_task` payload to `http://localhost:31323/webhook/trigger_task`. +5. Waits, in order, for these scheduler/seed-gen log markers: + - `Processing build output for type FUZZER` — fuzzer build done + - `pov_id=` — vulnerability found and POV submitted + - `Updated POV status. New status PASSED` — POV accepted by competition API + - `Copied N files to corpus` — seed-gen produced seeds + - `Appending patch for task` — patch generated + - polls for the `competition_patch_id=` summary line (logged only after the scheduler builds, verifies and submits the patch — minutes after the patch is generated), then approves via `POST /v1/task//patch//approve` + - `Patch passed` — patch accepted + - `bundle_id=` — bundle submitted +6. Prints a colored summary and tears the stack down with `docker compose down -v`. + +## Run it + +The driver is `scripts/e2e.sh`. The `Makefile` exposes `make e2e`. + +```bash +# Plain run with the $10 budget / 7200s task-duration defaults +make e2e + +# Pass flags through the Makefile +make e2e E2E_ARGS="--budget 15 --no-pull" + +# Or call the script directly +./scripts/e2e.sh --budget 10 --task-duration 7200 +./scripts/e2e.sh --image-tag my-branch --no-pull # run already-present images +./scripts/e2e.sh --budget 3 # cheap: only reaches ~seed-gen +``` + +The script writes/overwrites `dev/docker-compose/.env` on each run. + +## Monitoring while it's running + +The script already streams milestone progress to its own stdout. For finer-grained visibility while it runs: + +```bash +# All services, follow +cd dev/docker-compose && docker compose logs -f + +# Just the scheduler (most milestones live here) +cd dev/docker-compose && docker compose logs -f scheduler + +# Patcher, seed-gen, fuzzer-bot, program-model +cd dev/docker-compose && docker compose logs -f patcher seed-gen fuzzer-bot program-model + +# LiteLLM spend tracking +cd dev/docker-compose && docker compose logs -f litellm | grep -i 'spend\|budget' +``` + +The web UI is at `http://localhost:31323` (no port-forward needed — it's published on the host). + +## Tearing down + +```bash +cd dev/docker-compose && docker compose down -v --remove-orphans +``` + +`scripts/e2e.sh` does this automatically on exit. + +## When you invoke /e2e + +When the user runs `/e2e`, default behavior: + +1. Run `./scripts/e2e.sh $ARGUMENTS` (forwarding any flags the user passed). +2. While it runs, surface key transitions to the user. The script's own output already prints `[e2e] Reached: …` for each milestone — relay those as they arrive. +3. If the run fails on a milestone, fetch the last ~50 lines of the relevant service: + - `cd dev/docker-compose && docker compose logs --tail=50 ` +4. If the user asks to keep digging, expand the watch with `docker compose logs -f ` until the user is satisfied. +5. On success, summarize the milestones reached and remind the user the stack is already torn down. diff --git a/Makefile b/Makefile index fbbd49e6..a5f0d445 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ # Makefile for Trail of Bits AIxCC Finals CRS -.PHONY: help setup-local setup-azure validate deploy test undeploy install-cscope lint lint-component clean-local wait-crs check-crs crs-instance-id status send-integration-task +.PHONY: help setup-local setup-azure validate deploy test undeploy install-cscope lint lint-component clean-local wait-crs check-crs crs-instance-id status send-integration-task e2e # Default target help: @@ -23,6 +23,7 @@ help: @echo "Testing:" @echo " send-integration-task - Run integration-test task" @echo " send-libpng-task - Run libpng task" + @echo " e2e - Docker-only end-to-end smoke test against example-libpng (low LLM budget)" @echo "" @echo "Development:" @echo " install-cscope - Install cscope tool" @@ -150,6 +151,11 @@ send-libpng-task: ./orchestrator/scripts/task_crs.sh; \ kill $$PORT_FORWARD_PID 2>/dev/null || true +# Docker-only end-to-end run against example-libpng. No Kubernetes required. +# Pass extra flags via E2E_ARGS, e.g.: make e2e E2E_ARGS="--budget 5 --no-pull" +e2e: + @./scripts/e2e.sh $(E2E_ARGS) + # Development targets lint: @echo "Linting all Python code..." diff --git a/scripts/e2e.sh b/scripts/e2e.sh new file mode 100755 index 00000000..84f93799 --- /dev/null +++ b/scripts/e2e.sh @@ -0,0 +1,480 @@ +#!/usr/bin/env bash +# scripts/e2e.sh — Run the full Buttercup pipeline against example-libpng using +# the dev docker-compose stack (no Kubernetes required). +# +# Uses the prebuilt component images published to GHCR (via the +# compose.prebuilt.yaml overlay) instead of building them locally, so a run +# does not depend on a working local image build. +# +# This mirrors the milestones checked by .github/workflows/system-integration.yml +# but reads docker-compose logs instead of `kubectl logs`. + +set -u +set -o pipefail + +############################################################################### +# Config & defaults +############################################################################### + +# Resolve repo root from this script's location. +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)" +COMPOSE_DIR="${REPO_ROOT}/dev/docker-compose" +ENV_FILE="${COMPOSE_DIR}/.env" + +# Defaults — overridable via flags or environment. +# +# BUDGET: a full run through patch generation costs ~$10 of LLM spend; $3 is +# exhausted during/just after POV, so anything past seed-gen would always time +# out. Default to 10 so the whole pipeline (incl. patch+bundle) is reachable. +# +# TASK_DURATION: the CRS discards a task's work once its deadline passes. On +# normal hardware build->POV->seed-gen->patch exceeds 30 min, so an 1800s task +# expires mid-patch ("task expired/cancelled? Will discard") and never reaches +# patch/bundle. Default to 7200 (2h) so the task outlives the pipeline. +BUDGET="${LITELLM_MAX_BUDGET:-10}" +TASK_DURATION="${E2E_TASK_DURATION:-7200}" + +# Prebuilt GHCR images instead of local builds (compose.prebuilt.yaml overlay). +IMAGE_TAG="${BUTTERCUP_IMAGE_TAG:-main}" + +DO_PULL=1 + +# Internal milestone timeouts (seconds). Bundle submission is quick; the rest +# (build, vuln, seed-gen, patch) can each take a while on a low-budget run. +MILESTONE_TIMEOUT=1800 +BUNDLE_TIMEOUT=300 + +# Temp file for the trigger_task HTTP response; cleaned up on exit. +TASK_RESP="" + +############################################################################### +# Logging +############################################################################### + +if [[ -t 1 ]]; then + C_RST=$'\033[0m'; C_RED=$'\033[1;31m'; C_GRN=$'\033[1;32m' + C_YLW=$'\033[1;33m'; C_BLU=$'\033[1;36m'; C_DIM=$'\033[2m' +else + C_RST=""; C_RED=""; C_GRN=""; C_YLW=""; C_BLU=""; C_DIM="" +fi + +log() { printf '%s[e2e]%s %s\n' "$C_BLU" "$C_RST" "$*"; } +ok() { printf '%s[e2e]%s %s\n' "$C_GRN" "$C_RST" "$*"; } +warn() { printf '%s[e2e]%s %s\n' "$C_YLW" "$C_RST" "$*" >&2; } +err() { printf '%s[e2e]%s %s\n' "$C_RED" "$C_RST" "$*" >&2; } +dim() { printf '%s[e2e]%s %s%s%s\n' "$C_BLU" "$C_RST" "$C_DIM" "$*" "$C_RST"; } + +############################################################################### +# Usage +############################################################################### + +usage() { + cat </dev/null 2>&1; then + err "docker is required but not installed." + exit 1 +fi +if ! docker compose version >/dev/null 2>&1; then + err "'docker compose' v2 is required (not 'docker-compose')." + exit 1 +fi +if ! command -v curl >/dev/null 2>&1; then + err "curl is required but not installed." + exit 1 +fi + +# Read a value already present in the existing .env. Used so that variables +# not provided via the environment (e.g. LANGFUSE_*) are preserved across runs +# instead of being clobbered, since this script regenerates .env from scratch +# on every run. +prev_env() { + [[ -f "$ENV_FILE" ]] || return 0 + sed -n "s/^$1=//p" "$ENV_FILE" | head -n1 +} + +# 1) Prefer the environment; 2) fall back to whatever is already in .env. +: "${ANTHROPIC_API_KEY:=$(prev_env ANTHROPIC_API_KEY)}" +: "${OPENAI_API_KEY:=$(prev_env OPENAI_API_KEY)}" +: "${GEMINI_API_KEY:=$(prev_env GEMINI_API_KEY)}" +: "${AZURE_API_BASE:=$(prev_env AZURE_API_BASE)}" +: "${AZURE_API_KEY:=$(prev_env AZURE_API_KEY)}" +: "${LANGFUSE_HOST:=$(prev_env LANGFUSE_HOST)}" +: "${LANGFUSE_PUBLIC_KEY:=$(prev_env LANGFUSE_PUBLIC_KEY)}" +: "${LANGFUSE_SECRET_KEY:=$(prev_env LANGFUSE_SECRET_KEY)}" + +# Require at least one usable provider key. Checked *after* the .env fallback +# above so a key saved to .env on a prior run still counts. +provider_keys_set=0 +for v in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY; do + val="${!v:-}" + if [[ -n "$val" && "$val" != "" ]]; then + provider_keys_set=1 + fi +done +if [[ "$provider_keys_set" -eq 0 ]]; then + err "No LLM provider key found. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY." + err "Tip: 'export ANTHROPIC_API_KEY=...; scripts/e2e.sh' or add it to ${ENV_FILE} first." + exit 1 +fi + +# 3) Final placeholders if still unset after both env and .env. Keys left at +# the placeholder so litellm still loads its config (some models will fail at +# request time, others will succeed). LANGFUSE_* are intentionally left unset +# here: empty lines are NOT written to .env below, so a run without them set +# never clobbers LANGFUSE_* the user previously had in .env. +: "${ANTHROPIC_API_KEY:=}" +: "${OPENAI_API_KEY:=}" +: "${GEMINI_API_KEY:=}" +: "${AZURE_API_BASE:=}" +: "${AZURE_API_KEY:=}" +: "${LANGFUSE_HOST:=}" +: "${LANGFUSE_PUBLIC_KEY:=}" +: "${LANGFUSE_SECRET_KEY:=}" + +############################################################################### +# .env generation +############################################################################### + +log "Writing ${ENV_FILE} (LITELLM_MAX_BUDGET=\$${BUDGET})" +{ + echo "# Generated by scripts/e2e.sh on $(date -Is)" + # litellm master key — internal to the local stack, not user-facing. + echo "BUTTERCUP_LITELLM_KEY=sk-1234" + echo "ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}" + echo "OPENAI_API_KEY=${OPENAI_API_KEY}" + echo "GEMINI_API_KEY=${GEMINI_API_KEY}" + echo "AZURE_API_BASE=${AZURE_API_BASE}" + echo "AZURE_API_KEY=${AZURE_API_KEY}" + echo "LITELLM_MAX_BUDGET=${BUDGET}" + # Only emit LANGFUSE_* when we actually have a value, so a run without + # them set leaves no empty LANGFUSE_HOST= behind to disable telemetry. + [[ -n "$LANGFUSE_HOST" ]] && echo "LANGFUSE_HOST=${LANGFUSE_HOST}" + [[ -n "$LANGFUSE_PUBLIC_KEY" ]] && echo "LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}" + [[ -n "$LANGFUSE_SECRET_KEY" ]] && echo "LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}" + true +} > "$ENV_FILE" + +############################################################################### +# docker compose helpers +############################################################################### + +# Always run compose from the compose dir so relative includes resolve. +# The compose.prebuilt.yaml overlay swaps every locally-built service for its +# prebuilt GHCR image, so nothing is built locally. +dc() { + (cd "$COMPOSE_DIR" \ + && BUTTERCUP_IMAGE_TAG="$IMAGE_TAG" \ + docker compose -f compose.yaml -f compose.prebuilt.yaml "$@") +} + +on_exit() { + rc=$? + [[ -n "$TASK_RESP" ]] && rm -f "$TASK_RESP" + log "Tearing the stack down (docker compose down -v)" + dc down -v --remove-orphans || true + if [[ $rc -ne 0 ]]; then + err "e2e run finished with exit code $rc" + fi + exit $rc +} +trap on_exit EXIT INT TERM + +############################################################################### +# Bring the stack up +############################################################################### + +if [[ "$DO_PULL" -eq 1 ]]; then + log "Pulling prebuilt component images from GHCR (tag: ${IMAGE_TAG})" + if ! dc pull; then + err "docker compose pull failed for tag '${IMAGE_TAG}'." + err "Check that the tag exists at ghcr.io/trailofbits/buttercup/* and that" + err "you can reach GHCR (private images need 'docker login ghcr.io')." + err "Override with --image-tag or BUTTERCUP_IMAGE_TAG=..." + exit 1 + fi +fi + +log "Starting services" +if ! dc up -d; then + err "docker compose up failed. Check 'docker compose ps' / logs." + exit 1 +fi + +# Wait for the buttercup-ui task webhook to be reachable. +log "Waiting for buttercup-ui to accept connections on http://localhost:31323" +ui_up=0 +for _ in $(seq 1 120); do + if curl -sf -o /dev/null -m 2 http://localhost:31323/v1/ping/ 2>/dev/null \ + || curl -sf -o /dev/null -m 2 http://localhost:31323/ 2>/dev/null; then + ui_up=1; break + fi + sleep 2 +done +if [[ "$ui_up" -ne 1 ]]; then + err "buttercup-ui did not come up on port 31323. Check 'docker compose logs buttercup-ui'." + exit 1 +fi +ok "buttercup-ui is up." + +############################################################################### +# Submit the task +############################################################################### + +TASK_JSON=$(cat <` until a line matching PATTERN appears +# or TIMEOUT_SEC elapses. Returns 0 on success, non-zero on timeout. +wait_for() { + local service="$1" pattern="$2" timeout="$3" label="$4" + local deadline=$(( $(date +%s) + timeout )) + log "Waiting for milestone: ${label} ${C_DIM}(service=${service}, timeout=${timeout}s)${C_RST}" + + while [[ $(date +%s) -lt $deadline ]]; do + # --no-color so the grep matches plain text; --tail=all replays history. + # NOTE: capture into a var with `|| true` instead of `if cmd | grep`. + # Under `set -o pipefail`, `grep -m1` exits on the first match and the + # upstream `docker compose logs` then dies with SIGPIPE (rc 141), which + # would make the whole pipeline "fail" and the milestone never register + # for high-volume services whose match is early in the stream. + local match + match="$(dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \ + | grep -m1 -E "$pattern" || true)" + if [[ -n "$match" ]]; then + ok "Reached: ${label}" + return 0 + fi + sleep 15 + done + + err "Timed out after ${timeout}s waiting for: ${label}" + err "Recent logs from ${service}:" + dc logs --no-color --tail=50 "$service" >&2 || true + return 1 +} + +# Capture a single matching log line (returns it on stdout, empty on miss). +capture_line() { + local service="$1" pattern="$2" + dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \ + | grep -E "$pattern" | head -n1 || true +} + +# wait_capture SERVICE PATTERN TIMEOUT_SEC LABEL +# +# Like capture_line, but polls until the pattern appears or TIMEOUT_SEC +# elapses, echoing the first matching line on stdout (empty on timeout). +# Progress goes to stderr so stdout stays just the captured line. +# +# Needed because `competition_patch_id=` is logged by the scheduler only +# *after* it builds, verifies and submits the patch — minutes after the +# "Appending patch for task" milestone. A one-shot capture right after that +# milestone always races and loses, so approval would always be skipped. +wait_capture() { + local service="$1" pattern="$2" timeout="$3" label="$4" + local deadline=$(( $(date +%s) + timeout )) + log "Waiting to capture: ${label} ${C_DIM}(service=${service}, timeout=${timeout}s)${C_RST}" >&2 + while [[ $(date +%s) -lt $deadline ]]; do + local match + match="$(dc logs --no-color --no-log-prefix --tail=all "$service" 2>/dev/null \ + | grep -m1 -E "$pattern" || true)" + if [[ -n "$match" ]]; then + printf '%s\n' "$match" + return 0 + fi + sleep 15 + done + return 1 +} + +############################################################################### +# Walk through the pipeline +############################################################################### + +declare -a SUMMARY=() +record() { SUMMARY+=("$1"); } + +if wait_for scheduler \ + "Processing build output for type FUZZER" \ + "$MILESTONE_TIMEOUT" "fuzzer build processed"; then + record "fuzzer-build: ok" +else + record "fuzzer-build: TIMEOUT" +fi + +# NOTE: match the structured summary line (`[i:task] pov_id= ...`, +# logger.info), NOT the "POV submission response:" debug line whose payload is +# an API object repr that never contains a literal `pov_id=`. +if wait_for scheduler \ + "pov_id=" \ + "$MILESTONE_TIMEOUT" "vulnerability (POV) submitted"; then + record "pov-submit: ok" +else + record "pov-submit: TIMEOUT" +fi + +if wait_for scheduler \ + "Updated POV status. New status PASSED" \ + "$MILESTONE_TIMEOUT" "POV accepted by competition API"; then + record "pov-passed: ok" +else + record "pov-passed: TIMEOUT" +fi + +if wait_for seed-gen \ + "Copied [1-9][0-9]* files to corpus" \ + "$MILESTONE_TIMEOUT" "seed-gen produced seeds"; then + record "seed-gen: ok" +else + record "seed-gen: TIMEOUT" +fi + +if wait_for scheduler \ + "Appending patch for task" \ + "$MILESTONE_TIMEOUT" "patch generated"; then + record "patch-generated: ok" +else + record "patch-generated: TIMEOUT" +fi + +# Approve the patch (the local UI requires explicit approval, unlike scored +# rounds where it is automatic). competition_patch_id= only appears once the +# scheduler has built+verified+submitted the patch, well after the patch was +# generated, so poll for it rather than capturing once (which always races). +PATCH_LINE="$(wait_capture scheduler 'competition_patch_id=[0-9a-fA-F-]' \ + "$MILESTONE_TIMEOUT" "competition_patch_id (for approval)" || true)" +if [[ -n "$PATCH_LINE" ]]; then + PATCH_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*competition_patch_id=\([^ ]*\).*/\1/p') + # Task id is inside the first [...] block, after the last ':'. + TASK_ID=$(printf '%s' "$PATCH_LINE" | sed -n 's/.*\[\([^]]*\)\].*/\1/p' | sed 's/^[^:]*://') + if [[ -n "$PATCH_ID" && -n "$TASK_ID" ]]; then + log "Approving patch ${C_DIM}task=${TASK_ID} patch=${PATCH_ID}${C_RST}" + if curl -fsS -X POST \ + "http://127.0.0.1:31323/v1/task/${TASK_ID}/patch/${PATCH_ID}/approve" \ + >/dev/null; then + record "patch-approve: ok" + else + record "patch-approve: HTTP fail" + fi + else + warn "Could not extract patch/task ids from: $PATCH_LINE" + record "patch-approve: skipped (parse fail)" + fi +else + warn "No competition_patch_id= line seen; skipping approval" + record "patch-approve: skipped (no patch line)" +fi + +if wait_for scheduler \ + "Patch passed" \ + "$MILESTONE_TIMEOUT" "patch accepted by competition API"; then + record "patch-passed: ok" +else + record "patch-passed: TIMEOUT" +fi + +# NOTE: same as POV above — match the structured summary `bundle_id=` +# (logger.info), not the "Bundle submission response:" debug object repr. +if wait_for scheduler \ + "bundle_id=" \ + "$BUNDLE_TIMEOUT" "bundle submitted"; then + record "bundle-submit: ok" +else + record "bundle-submit: TIMEOUT" +fi + +############################################################################### +# Summary +############################################################################### + +printf '\n%s===================== e2e summary =====================%s\n' "$C_BLU" "$C_RST" +for line in "${SUMMARY[@]}"; do + if [[ "$line" == *": ok" ]]; then + printf ' %s✓%s %s\n' "$C_GRN" "$C_RST" "$line" + elif [[ "$line" == *": TIMEOUT" || "$line" == *"fail"* ]]; then + printf ' %s✗%s %s\n' "$C_RED" "$C_RST" "$line" + else + printf ' %s•%s %s\n' "$C_YLW" "$C_RST" "$line" + fi +done +printf '%s=======================================================%s\n' "$C_BLU" "$C_RST" + +# Exit non-zero if any milestone failed. +for line in "${SUMMARY[@]}"; do + if [[ "$line" == *": TIMEOUT" || "$line" == *"fail"* ]]; then + exit 1 + fi +done