From b4cd09c35daf6fcc2e9adc8855f16cd1b641f70e Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 11:42:45 -0800
Subject: [PATCH 01/11] Add E2E triage automation: skill, CI workflow, and
 artifact download script

Automates E2E failure triage with three new components:
- scripts/download-e2e-artifacts.sh: reusable script to download CI artifacts
- .claude/skills/e2e-triage/SKILL.md: 7-step triage skill (classify flaky vs real bug, create PRs or issues)
- .github/workflows/e2e-triage.yml: workflow_run trigger that auto-runs Claude Opus on E2E failure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 1aa72dcd8a2b
---
 .claude/skills/e2e-triage/SKILL.md | 140 +++++++++++++++++++++++++++++
 .github/workflows/e2e-triage.yml   |  64 +++++++++++++
 e2e/README.md                      |  15 +++-
 scripts/download-e2e-artifacts.sh  | 106 ++++++++++++++++++++++
 4 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 .claude/skills/e2e-triage/SKILL.md
 create mode 100644 .github/workflows/e2e-triage.yml
 create mode 100755 scripts/download-e2e-artifacts.sh
diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
new file mode 100644
index 000000000..34d54d895
--- /dev/null
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -0,0 +1,140 @@
+---
+name: e2e-triage
+description: Triage E2E test failures — download CI artifacts, classify flaky vs real bug, create PRs for flaky fixes and GitHub issues for real bugs
+---
+
+# E2E Triage
+
+Automate triage of E2E test failures. Analyze artifacts, classify each failure as **flaky** (agent non-determinism) or **real-bug** (CLI defect), then take action: batched PR for flaky fixes, GitHub issues for real bugs.
+
+## Inputs
+
+The user provides one of:
+- **`latest`** — download artifacts from the most recent failed E2E run on main
+- **A run ID or URL** — download artifacts from that specific run
+- **A local path** — use existing artifact directory (skip download)
+
+## Step 1: Download Artifacts
+
+**If given a run ID, URL, or "latest":**
+```bash
+artifact_dir=$(scripts/download-e2e-artifacts.sh <input>)
+```
+
+**If given a local path:** Use directly, skip download.
+
+## Step 2: Identify Failures
+
+For each agent subdirectory in the artifact root:
+1. Read `report.nocolor.txt` — list failed tests with error messages, file:line references
+2. Skip agents with zero failures
+3. Build failure list: `[(test_name, agent, error_line, duration, file:line)]`
+
+## Step 3: Analyze Each Failure
+
+For each failure, follow the debug-e2e methodology:
+
+1. **Read `console.log`** — what did the agent actually do? Full chronological transcript.
+2. **Read test source at file:line** — what was expected?
+3. **Read `entire-logs/entire.log`** — any CLI errors, panics, unexpected behavior?
+4. **Read `git-log.txt` / `git-tree.txt`** — repo state at failure time
+5. **Read `checkpoint-metadata/`** — corrupt or missing metadata?
+
+## Step 4: Classify Each Failure
+
+### Strong `real-bug` signals (any one is sufficient):
+
+- `entire.log` contains `"level":"ERROR"` or panic/stack traces
+- Checkpoint metadata structurally corrupt (malformed JSON, missing `checkpoint_id`/`strategy`)
+- Session state file missing or malformed when expected
+- Hooks did not fire at all (no `hook invoked` log entries)
+- Shadow/metadata branch has wrong tree structure
+- Same test fails across 3+ agents with same non-timeout symptom
+- Error references CLI code (panic in `cmd/entire/cli/`)
+
+### Strong `flaky` signals (unless overridden by real-bug):
+
+- `signal: killed` (timeout)
+- `context deadline exceeded` or `WaitForCheckpoint.*exceeded deadline`
+- Agent asked for confirmation instead of acting
+- Agent created file at wrong path / wrong name
+- Agent produced no output
+- Agent committed when it shouldn't have (or vice versa)
+- Test fails for only one agent, passes for others
+- Duration near timeout limit
+
+### Ambiguous cases:
+
+Read `entire.log` carefully:
+- If hooks fired correctly and metadata is valid -> lean **flaky**
+- If hooks fired but produced wrong results -> lean **real-bug**
+
+## Step 5: Cross-Agent Correlation
+
+Before acting, check correlations:
+- Same test fails for 3+ agents with similar errors -> override to `real-bug`
+- Same test fails for only 1 agent -> lean `flaky`
+
+## Step 6: Take Action
+
+### For `flaky` failures: Batched PR
+
+1. Create branch `fix/e2e-flaky-<run-id>`
+2. Apply fixes to ALL flaky test files (one branch, one PR):
+   - Agent asked for confirmation -> append "Do not ask for confirmation" to prompt
+   - Agent wrote to wrong path -> be more explicit about paths in prompt
+   - Agent committed when shouldn't -> add "Do not commit" to prompt
+   - Checkpoint wait timeout -> increase timeout argument
+   - Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
+3. Run verification:
+   ```bash
+   mise run test:e2e:canary   # Must pass
+   mise run fmt && mise run lint
+   ```
+4. If canary fails, investigate and adjust. If unfixable, fall back to issue creation.
+5. Commit and create PR:
+   ```bash
+   gh pr create \
+     --title "fix(e2e): make flaky tests more resilient (run <run-id>)" \
+     --body "<structured body with per-test changes, evidence, run link>"
+   ```
+
+### For `real-bug` failures: Issue (with dedup)
+
+1. **Search existing issues first:**
+   ```bash
+   gh issue list --search "is:open label:e2e <TestName>" --json number,title,url
+   ```
+
+2. **If matching issue exists:** Add a comment with new evidence:
+   ```bash
+   gh issue comment <number> --body "<verification details, run link, new evidence>"
+   ```
+   Note "Verified still failing in CI run <URL>" plus any new diagnostic details.
+
+3. **If no matching issue:** Create new:
+   ```bash
+   gh issue create \
+     --title "E2E: <TestName> fails — <brief symptom>" \
+     --label "bug,e2e" \
+     --body "<structured body>"
+   ```
+
+   Issue body includes:
+   - Test name, agent(s), CI run link, frequency
+   - Failure summary (expected vs actual)
+   - Root cause analysis (which CLI component: hooks, session, checkpoint, attribution, strategy)
+   - Key evidence: `entire.log` excerpts, `console.log` excerpts, git state
+   - Reproduction steps
+   - Suspected fix location (file, function, reason)
+
+## Step 7: Summary Report
+
+Print a summary table:
+```
+| Test | Agent(s) | Classification | Action | Link |
+|------|----------|---------------|--------|------|
+| TestFoo | claude-code | flaky | PR #123 | url |
+| TestBar | all agents | real-bug | Issue #456 (existing, commented) | url |
+| TestBaz | opencode | real-bug | Issue #457 (new) | url |
+```
diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
new file mode 100644
index 000000000..db2a73126
--- /dev/null
+++ b/.github/workflows/e2e-triage.yml
@@ -0,0 +1,64 @@
+name: E2E Triage
+
+on:
+  workflow_run:
+    workflows: ["E2E Tests"]
+    types: [completed]
+
+permissions:
+  contents: write
+  pull-requests: write
+  issues: write
+  actions: read
+
+concurrency:
+  group: e2e-triage
+  cancel-in-progress: true
+
+jobs:
+  triage:
+    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v6
+
+      - name: Setup mise
+        uses: jdx/mise-action@v3
+
+      - name: Install system dependencies
+        run: sudo apt-get update && sudo apt-get install -y tmux
+
+      - name: Install Claude Code CLI
+        run: |
+          curl -fsSL https://claude.ai/install.sh | bash
+          echo "$HOME/.local/bin" >> $GITHUB_PATH
+
+      - name: Download E2E artifacts
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+        run: |
+          chmod +x scripts/download-e2e-artifacts.sh
+          scripts/download-e2e-artifacts.sh "$WORKFLOW_RUN_ID"
+
+      - name: Run triage
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url }}
+        run: |
+          ARTIFACT_DIR="e2e/artifacts/ci-${WORKFLOW_RUN_ID}"
+          claude -p \
+            --model claude-opus-4-6 \
+            --allowedTools "Bash(mise run *),Bash(gh *),Bash(git *),Read,Write,Edit,Glob,Grep" \
+            "You are triaging E2E test failures from CI run ${WORKFLOW_RUN_ID}.
+             Follow the /e2e-triage skill instructions in .claude/skills/e2e-triage/SKILL.md.
+             Artifact directory: ${ARTIFACT_DIR}
+             CI run URL: ${WORKFLOW_RUN_URL}
+             Do NOT ask questions — act autonomously.
+             For flaky fixes: create a PR.
+             For real bugs: create or comment on GitHub issues.
+             Print a summary table at the end."
diff --git a/e2e/README.md b/e2e/README.md
index f60017c6e..d8fe7fef4 100644
--- a/e2e/README.md
+++ b/e2e/README.md
@@ -60,6 +60,8 @@ Artifacts are captured to `e2e/artifacts/` on every run (git-log, git-tree, cons
 
 Use the `debug-e2e` skill (`.claude/skills/debug-e2e/`) for a structured workflow when investigating failures.
 
+Use the `e2e-triage` skill (`.claude/skills/e2e-triage/`) to automate full triage: download CI artifacts, classify failures as flaky vs real bug, and create PRs or GitHub issues. Run locally with `/e2e-triage` or see the automated CI workflow below.
+
 ### Reading artifacts
 
 - `console.log` — full operation transcript including agent stdout/stderr
@@ -82,5 +84,16 @@ To diagnose: read `console.log` in the failing test's artifact directory. Compar
 
 - **`.github/workflows/e2e.yml`** — Runs full suite on push to main. Matrix: `[claude, opencode, gemini]`.
 - **`.github/workflows/e2e-isolated.yml`** — Manual dispatch for debugging a single test. Inputs: agent + test name filter.
+- **`.github/workflows/e2e-triage.yml`** — Auto-triggers on E2E failure via `workflow_run`. Runs Claude Code (Opus) to download artifacts, classify failures, and create PRs (flaky) or issues (real bugs).
+
+Both E2E workflows run `go run ./e2e/bootstrap` before tests to handle agent-specific CI setup (auth config, warmup).
+
+### Downloading CI Artifacts Locally
+
+```bash
+scripts/download-e2e-artifacts.sh latest              # Most recent failed run
+scripts/download-e2e-artifacts.sh 12345               # Specific run ID
+scripts/download-e2e-artifacts.sh https://github.com/entireio/cli/actions/runs/12345  # URL
+```
 
-Both workflows run `go run ./e2e/bootstrap` before tests to handle agent-specific CI setup (auth config, warmup).
+Downloads to `e2e/artifacts/ci-<run-id>/` with per-agent subdirectories and a `.run-info.json` metadata file.
diff --git a/scripts/download-e2e-artifacts.sh b/scripts/download-e2e-artifacts.sh
new file mode 100755
index 000000000..6287c5ff5
--- /dev/null
+++ b/scripts/download-e2e-artifacts.sh
@@ -0,0 +1,106 @@
+#!/usr/bin/env bash
+#
+# Download E2E test artifacts from GitHub Actions.
+#
+# Usage: scripts/download-e2e-artifacts.sh [RUN_ID | RUN_URL | "latest"]
+#   RUN_ID:  numeric GitHub Actions run ID
+#   RUN_URL: full URL like https://github.com/entireio/cli/actions/runs/12345
+#   "latest": most recent failed E2E run on main
+#
+# Outputs the absolute path of the download directory as the last line of stdout.
+# All diagnostic messages go to stderr.
+
+set -euo pipefail
+
+log() { echo "$@" >&2; }
+die() { log "ERROR: $1"; exit 1; }
+
+# --- Validate prerequisites ---
+
+command -v gh >/dev/null 2>&1 || die "'gh' CLI is not installed. Install from https://cli.github.com/"
+gh auth status >/dev/null 2>&1 || die "'gh' is not authenticated. Run 'gh auth login' first."
+
+# --- Parse input ---
+
+input="${1:-}"
+[ -z "$input" ] && die "Usage: $0 [RUN_ID | RUN_URL | \"latest\"]"
+
+run_id=""
+
+case "$input" in
+  latest)
+    log "Finding most recent failed E2E run on main..."
+    run_id=$(gh run list -w e2e.yml --status=failure -L1 --json databaseId -q '.[0].databaseId' 2>/dev/null)
+    [ -z "$run_id" ] && die "No failed E2E runs found."
+    log "Found run: $run_id"
+    ;;
+  http*)
+    # Extract run ID from URL: https://github.com/<owner>/<repo>/actions/runs/<id>
+    run_id=$(echo "$input" | grep -oE '/runs/[0-9]+' | grep -oE '[0-9]+')
+    [ -z "$run_id" ] && die "Could not extract run ID from URL: $input"
+    log "Extracted run ID: $run_id"
+    ;;
+  *[!0-9]*)
+    die "Invalid input: '$input'. Provide a numeric run ID, a GitHub Actions URL, or 'latest'."
+    ;;
+  *)
+    run_id="$input"
+    ;;
+esac
+
+# --- Fetch run metadata ---
+
+log "Fetching run metadata..."
+run_url=$(gh run view "$run_id" --json url -q '.url' 2>/dev/null) || die "Run $run_id not found."
+commit=$(gh run view "$run_id" --json headSha -q '.headSha' 2>/dev/null) || commit="unknown"
+
+log "Run URL: $run_url"
+log "Commit:  $commit"
+
+# --- Download artifacts ---
+
+dest="e2e/artifacts/ci-${run_id}"
+mkdir -p "$dest"
+
+log "Downloading artifacts to $dest/ ..."
+gh run download "$run_id" --dir "$dest" 2>/dev/null || die "Failed to download artifacts. They may have expired (retention: 7 days)."
+
+# --- Restructure: flatten e2e-artifacts-<agent>/ wrapper dirs ---
+
+cd "$dest"
+for wrapper in e2e-artifacts-*/; do
+  [ -d "$wrapper" ] || continue
+  agent="${wrapper#e2e-artifacts-}"
+  agent="${agent%/}"
+  # Move contents up: e2e-artifacts-claude-code/* -> claude-code/
+  if [ -d "$agent" ]; then
+    # Agent dir already exists (shouldn't happen, but be safe)
+    cp -r "$wrapper"/* "$agent"/ 2>/dev/null || true
+  else
+    mv "$wrapper" "$agent"
+  fi
+done
+cd - >/dev/null
+
+# --- Write run metadata ---
+
+agents_found=$(cd "$dest" && ls -d */ 2>/dev/null | tr -d '/' | tr '\n' ', ' | sed 's/,$//')
+
+cat > "$dest/.run-info.json" <<EOF
+{
+  "run_id": "$run_id",
+  "run_url": "$run_url",
+  "commit": "$commit",
+  "downloaded_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+  "agents": "$(echo "$agents_found" | sed 's/"/\\"/g')"
+}
+EOF
+
+log ""
+log "Downloaded artifacts for: $agents_found"
+log "Run info: $dest/.run-info.json"
+log ""
+
+# Last line of stdout: absolute path for callers to capture
+abs_dest="$(cd "$dest" && pwd)"
+echo "$abs_dest"

From 7121885627ef5993563d46bde83e80325cc6f756 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 12:01:57 -0800
Subject: [PATCH 02/11] Add Slack notifications to E2E triage workflow

Post "Claude is triaging..." when triage starts and a structured
summary with PR/issue links when it completes. The skill now writes
triage-summary.json which the workflow parses with jq for the Slack
message. Falls back to a warning if no summary is produced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 8e5dcc6ef8ab
---
 .claude/skills/e2e-triage/SKILL.md | 21 +++++++++
 .github/workflows/e2e-triage.yml   | 70 ++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
index 34d54d895..7d0f2b766 100644
--- a/.claude/skills/e2e-triage/SKILL.md
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -138,3 +138,24 @@ Print a summary table:
 | TestBar | all agents | real-bug | Issue #456 (existing, commented) | url |
 | TestBaz | opencode | real-bug | Issue #457 (new) | url |
 ```
+
+**Write `triage-summary.json`** in the artifact directory for Slack notifications:
+
+```bash
+cat > "${ARTIFACT_DIR}/triage-summary.json" << 'TEMPLATE'
+{
+  "actions": [
+    {
+      "test": "TestName",
+      "agents": ["claude-code", "opencode"],
+      "classification": "flaky|real-bug",
+      "action_type": "pr|issue|comment",
+      "action_description": "PR #123|Issue #456 (new)|Issue #456 (commented)",
+      "link": "https://github.com/entireio/cli/pull/123"
+    }
+  ]
+}
+TEMPLATE
+```
+
+Each entry in `actions` corresponds to one row in the summary table. Include all failures that had an action taken. The `link` field must be the full URL to the PR or issue.
diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index db2a73126..ccec896b1 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -43,6 +43,28 @@ jobs:
           chmod +x scripts/download-e2e-artifacts.sh
           scripts/download-e2e-artifacts.sh "$WORKFLOW_RUN_ID"
 
+      - name: Notify Slack - triage started
+        if: ${{ secrets.E2E_SLACK_WEBHOOK_URL != '' }}
+        uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
+        env:
+          RUN_URL: ${{ github.event.workflow_run.html_url }}
+          RUN_ID: ${{ github.event.workflow_run.id }}
+        with:
+          webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
+          webhook-type: incoming-webhook
+          payload: |
+            {
+              "blocks": [
+                {
+                  "type": "section",
+                  "text": {
+                    "type": "mrkdwn",
+                    "text": ":mag: *Claude is triaging E2E failures* from <${{ env.RUN_URL }}|run #${{ env.RUN_ID }}>..."
+                  }
+                }
+              ]
+            }
+
       - name: Run triage
         env:
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
@@ -62,3 +84,51 @@ jobs:
              For flaky fixes: create a PR.
              For real bugs: create or comment on GitHub issues.
              Print a summary table at the end."
+
+      - name: Build Slack summary
+        if: always()
+        id: slack-summary
+        env:
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url }}
+          TRIAGE_LOG_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+        run: |
+          SUMMARY_FILE="e2e/artifacts/ci-${WORKFLOW_RUN_ID}/triage-summary.json"
+          if [ -f "$SUMMARY_FILE" ]; then
+            # Build a formatted summary from triage-summary.json
+            LINES=$(jq -r '.actions[] | "• *\(.test)* (\(.agents | join(", "))): \(.classification) → <\(.link)|\(.action_description)>"' "$SUMMARY_FILE" 2>/dev/null)
+            if [ -n "$LINES" ]; then
+              ACTION_COUNT=$(jq '.actions | length' "$SUMMARY_FILE")
+              FLAKY_COUNT=$(jq '[.actions[] | select(.classification == "flaky")] | length' "$SUMMARY_FILE")
+              BUG_COUNT=$(jq '[.actions[] | select(.classification == "real-bug")] | length' "$SUMMARY_FILE")
+              SUMMARY=":white_check_mark: *E2E triage complete* for <${WORKFLOW_RUN_URL}|run #${WORKFLOW_RUN_ID}>\n${ACTION_COUNT} failures triaged: ${FLAKY_COUNT} flaky, ${BUG_COUNT} real bugs\n\n${LINES}"
+            else
+              SUMMARY=":white_check_mark: *E2E triage complete* for <${WORKFLOW_RUN_URL}|run #${WORKFLOW_RUN_ID}> — no actions in summary"
+            fi
+          else
+            SUMMARY=":warning: *E2E triage finished* for <${WORKFLOW_RUN_URL}|run #${WORKFLOW_RUN_ID}> but no summary was produced. Check the <${TRIAGE_LOG_URL}|triage workflow logs>."
+          fi
+          # Escape for GitHub Actions multiline output
+          EOF=$(dd if=/dev/urandom bs=15 count=1 status=none | base64)
+          echo "message<<$EOF" >> "$GITHUB_OUTPUT"
+          echo -e "$SUMMARY" >> "$GITHUB_OUTPUT"
+          echo "$EOF" >> "$GITHUB_OUTPUT"
+
+      - name: Notify Slack - triage complete
+        if: always()
+        uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
+        with:
+          webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
+          webhook-type: incoming-webhook
+          payload: |
+            {
+              "blocks": [
+                {
+                  "type": "section",
+                  "text": {
+                    "type": "mrkdwn",
+                    "text": "${{ steps.slack-summary.outputs.message }}"
+                  }
+                }
+              ]
+            }

From 3edf654fd228e8bb6c06b73f7f92847e21311da6 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 12:09:18 -0800
Subject: [PATCH 03/11] Fix JSON injection bug and add webhook guards in E2E
 triage Slack notifications

- Build Slack payload via jq (payload-file-path) instead of interpolating
  raw text into inline JSON, which broke on quotes/newlines in summaries
- Add secrets.E2E_SLACK_WEBHOOK_URL guard to "Build Slack summary" and
  "Notify Slack - triage complete" steps (matching the "started" step)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7c1914052967
---
 .github/workflows/e2e-triage.yml | 27 +++++++--------------------
 1 file changed, 7 insertions(+), 20 deletions(-)

diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index ccec896b1..3fcf7778d 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -86,8 +86,7 @@ jobs:
              Print a summary table at the end."
 
       - name: Build Slack summary
-        if: always()
-        id: slack-summary
+        if: ${{ always() && secrets.E2E_SLACK_WEBHOOK_URL != '' }}
         env:
           WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
           WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url }}
@@ -108,27 +107,15 @@ jobs:
           else
             SUMMARY=":warning: *E2E triage finished* for <${WORKFLOW_RUN_URL}|run #${WORKFLOW_RUN_ID}> but no summary was produced. Check the <${TRIAGE_LOG_URL}|triage workflow logs>."
           fi
-          # Escape for GitHub Actions multiline output
-          EOF=$(dd if=/dev/urandom bs=15 count=1 status=none | base64)
-          echo "message<<$EOF" >> "$GITHUB_OUTPUT"
-          echo -e "$SUMMARY" >> "$GITHUB_OUTPUT"
-          echo "$EOF" >> "$GITHUB_OUTPUT"
+          # Use jq to build the payload with proper JSON escaping
+          jq -n --arg text "$(echo -e "$SUMMARY")" \
+            '{blocks: [{type: "section", text: {type: "mrkdwn", text: $text}}]}' \
+            > slack-payload.json
 
       - name: Notify Slack - triage complete
-        if: always()
+        if: ${{ always() && secrets.E2E_SLACK_WEBHOOK_URL != '' }}
         uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
         with:
           webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
           webhook-type: incoming-webhook
-          payload: |
-            {
-              "blocks": [
-                {
-                  "type": "section",
-                  "text": {
-                    "type": "mrkdwn",
-                    "text": "${{ steps.slack-summary.outputs.message }}"
-                  }
-                }
-              ]
-            }
+          payload-file-path: slack-payload.json

From 1e50aba53dfd181e0e89e581b4e99e03febfde41 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 12:47:15 -0800
Subject: [PATCH 04/11] Add local mode and CI re-run verification to E2E triage
 skill

Rewrite SKILL.md with dual-mode support (auto-detected via WORKFLOW_RUN_ID
env var): local mode runs tests with mise and re-runs failures up to 3
times, CI mode triggers e2e-isolated.yml workflows for re-run verification.
Classification now uses re-run results as the primary signal (all fail =
real-bug, mixed results = flaky).

Workflow changes: actions permission upgraded to write for gh workflow run,
timeout increased to 60m for re-run polling, Claude prompt updated with
CI mode hint and re-run instructions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 9f75c3effd9b
---
 .claude/skills/e2e-triage/SKILL.md | 167 ++++++++++++++++++++++-------
 .github/workflows/e2e-triage.yml   |   6 +-
 2 files changed, 135 insertions(+), 38 deletions(-)

diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
index 7d0f2b766..5fd059f88 100644
--- a/.claude/skills/e2e-triage/SKILL.md
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -1,38 +1,118 @@
 ---
 name: e2e-triage
-description: Triage E2E test failures — download CI artifacts, classify flaky vs real bug, create PRs for flaky fixes and GitHub issues for real bugs
+description: Triage E2E test failures — run locally with mise or via CI re-runs, classify flaky vs real bug, create PRs for flaky fixes and GitHub issues for real bugs
 ---
 
 # E2E Triage
 
-Automate triage of E2E test failures. Analyze artifacts, classify each failure as **flaky** (agent non-determinism) or **real-bug** (CLI defect), then take action: batched PR for flaky fixes, GitHub issues for real bugs.
+Triage E2E test failures with **re-run verification**. Operates in two modes (auto-detected), analyzes artifacts, re-runs failing tests to distinguish flaky from real bugs, then takes action: batched PR for flaky fixes, GitHub issues for real bugs.
 
-## Inputs
+## Mode Detection
 
-The user provides one of:
-- **`latest`** — download artifacts from the most recent failed E2E run on main
-- **A run ID or URL** — download artifacts from that specific run
-- **A local path** — use existing artifact directory (skip download)
+- **CI mode**: `WORKFLOW_RUN_ID` env var is set (injected by `e2e-triage.yml`)
+- **Local mode**: No `WORKFLOW_RUN_ID` — user invokes `/e2e-triage` manually
 
-## Step 1: Download Artifacts
+---
+
+## Local Mode
+
+### Step L1: Parse User Input
+
+The user provides one or more of:
+- **Test name(s)** — e.g., `TestInteractiveMultiStep`
+- **`--agent <agent>`** — optional, defaults to all agents that previously failed
+- **A local artifact path** — skip straight to analysis of existing artifacts
+
+**Cost warning:** Real E2E tests consume API tokens. Before running, confirm with the user unless they provided specific test names (implying intent to run).
+
+### Step L2: First Run
+
+```bash
+mise run test:e2e --agent <agent> <TestName>
+```
+
+Capture the artifact directory from the `artifacts: <path>` output line.
+
+### Step L3: Re-run on Failure
+
+If the test **passes** on first run: report as passing, done for this test.
+
+If the test **fails**: run a **second time** with the same parameters.
+
+### Step L4: Tiebreaker (if needed)
+
+If results are **split** (1 pass, 1 fail): run a **third time** as tiebreaker.
+
+### Step L5: Collect Results
+
+For each test+agent pair, record: `(test, agent, run_1_result, run_2_result, [run_3_result])`
+
+Proceed to **Shared Analysis** (Step 1 below).
+
+---
+
+## CI Mode
+
+### Step C1: Download Artifacts
 
 **If given a run ID, URL, or "latest":**
 ```bash
 artifact_dir=$(scripts/download-e2e-artifacts.sh <input>)
 ```
 
-**If given a local path:** Use directly, skip download.
+**If `WORKFLOW_RUN_ID` is set (automated trigger):**
+```bash
+artifact_dir="e2e/artifacts/ci-${WORKFLOW_RUN_ID}"
+```
+
+The download step in the workflow has already placed artifacts there.
 
-## Step 2: Identify Failures
+### Step C2: Identify Failures
 
 For each agent subdirectory in the artifact root:
 1. Read `report.nocolor.txt` — list failed tests with error messages, file:line references
 2. Skip agents with zero failures
 3. Build failure list: `[(test_name, agent, error_line, duration, file:line)]`
 
-## Step 3: Analyze Each Failure
+### Step C3: Re-run Failing Tests via CI
 
-For each failure, follow the debug-e2e methodology:
+For each failing test+agent pair, **sequentially**:
+
+1. **Trigger re-run:**
+   ```bash
+   gh workflow run e2e-isolated.yml -f agent=<agent> -f test=<TestName>
+   ```
+
+2. **Wait for run to register** (~5s), then find the run ID:
+   ```bash
+   gh run list -w e2e-isolated.yml -L 1 --json databaseId -q '.[0].databaseId'
+   ```
+
+3. **Poll until complete** (check every 30s, timeout after 25 minutes):
+   ```bash
+   gh run view <run-id> --json status,conclusion
+   ```
+
+4. **Download re-run artifacts:**
+   ```bash
+   gh run download <run-id> --dir <rerun-artifact-dir>
+   ```
+
+5. **Repeat for a second re-run** (same test+agent).
+
+### Step C4: Collect Results
+
+For each test+agent pair, record: `(test, agent, original_result, rerun_1_result, rerun_2_result)`
+
+Proceed to **Shared Analysis** (Step 1 below).
+
+---
+
+## Shared Analysis & Classification
+
+### Step 1: Analyze Each Failure
+
+For each failure, examine available artifacts:
 
 1. **Read `console.log`** — what did the agent actually do? Full chronological transcript.
 2. **Read test source at file:line** — what was expected?
@@ -40,9 +120,21 @@ For each failure, follow the debug-e2e methodology:
 4. **Read `git-log.txt` / `git-tree.txt`** — repo state at failure time
 5. **Read `checkpoint-metadata/`** — corrupt or missing metadata?
 
-## Step 4: Classify Each Failure
+### Step 2: Classify Each Failure
+
+Use **re-run results as the primary signal**, supplemented by artifact analysis.
 
-### Strong `real-bug` signals (any one is sufficient):
+#### Re-run signals (strongest):
+
+| Original | Re-run 1 | Re-run 2 | Classification |
+|----------|----------|----------|----------------|
+| FAIL | FAIL (same error) | FAIL (same error) | **real-bug** |
+| FAIL | PASS | PASS | **flaky** |
+| FAIL | PASS | FAIL | **flaky** (non-deterministic) |
+| FAIL | FAIL | PASS | **flaky** (non-deterministic) |
+| FAIL | FAIL (different error) | FAIL (different error) | **needs deeper analysis** — examine artifacts |
+
+#### Strong `real-bug` signals (supplement re-runs):
 
 - `entire.log` contains `"level":"ERROR"` or panic/stack traces
 - Checkpoint metadata structurally corrupt (malformed JSON, missing `checkpoint_id`/`strategy`)
@@ -52,7 +144,7 @@ For each failure, follow the debug-e2e methodology:
 - Same test fails across 3+ agents with same non-timeout symptom
 - Error references CLI code (panic in `cmd/entire/cli/`)
 
-### Strong `flaky` signals (unless overridden by real-bug):
+#### Strong `flaky` signals (unless overridden by real-bug):
 
 - `signal: killed` (timeout)
 - `context deadline exceeded` or `WaitForCheckpoint.*exceeded deadline`
@@ -60,26 +152,26 @@ For each failure, follow the debug-e2e methodology:
 - Agent created file at wrong path / wrong name
 - Agent produced no output
 - Agent committed when it shouldn't have (or vice versa)
-- Test fails for only one agent, passes for others
 - Duration near timeout limit
 
-### Ambiguous cases:
+#### Ambiguous cases:
 
 Read `entire.log` carefully:
 - If hooks fired correctly and metadata is valid -> lean **flaky**
 - If hooks fired but produced wrong results -> lean **real-bug**
 
-## Step 5: Cross-Agent Correlation
+### Step 3: Cross-Agent Correlation
 
-Before acting, check correlations:
-- Same test fails for 3+ agents with similar errors -> override to `real-bug`
-- Same test fails for only 1 agent -> lean `flaky`
+Before acting, check correlations using re-run data:
+- Same test fails for 3+ agents, all re-runs also fail -> strong **real-bug**
+- Same test fails for multiple agents, but re-runs pass -> **flaky** (shared prompt issue)
+- One agent fails consistently, others pass -> agent-specific issue (still **real-bug** if re-runs confirm)
 
-## Step 6: Take Action
+### Step 4: Take Action
 
-### For `flaky` failures: Batched PR
+#### For `flaky` failures: Batched PR
 
-1. Create branch `fix/e2e-flaky-<run-id>`
+1. Create branch `fix/e2e-flaky-<run-id-or-date>`
 2. Apply fixes to ALL flaky test files (one branch, one PR):
    - Agent asked for confirmation -> append "Do not ask for confirmation" to prompt
    - Agent wrote to wrong path -> be more explicit about paths in prompt
@@ -95,11 +187,11 @@ Before acting, check correlations:
 5. Commit and create PR:
    ```bash
    gh pr create \
-     --title "fix(e2e): make flaky tests more resilient (run <run-id>)" \
-     --body "<structured body with per-test changes, evidence, run link>"
+     --title "fix(e2e): make flaky tests more resilient" \
+     --body "<structured body with per-test changes, re-run evidence, run link>"
    ```
 
-### For `real-bug` failures: Issue (with dedup)
+#### For `real-bug` failures: Issue (with dedup)
 
 1. **Search existing issues first:**
    ```bash
@@ -108,9 +200,9 @@ Before acting, check correlations:
 
 2. **If matching issue exists:** Add a comment with new evidence:
    ```bash
-   gh issue comment <number> --body "<verification details, run link, new evidence>"
+   gh issue comment <number> --body "<re-run results, verification details, run link, new evidence>"
    ```
-   Note "Verified still failing in CI run <URL>" plus any new diagnostic details.
+   Note "Verified still failing — reproduced in N/N re-runs" plus any new diagnostic details.
 
 3. **If no matching issue:** Create new:
    ```bash
@@ -121,24 +213,26 @@ Before acting, check correlations:
    ```
 
    Issue body includes:
-   - Test name, agent(s), CI run link, frequency
+   - Test name, agent(s), CI run link (if available), re-run results
    - Failure summary (expected vs actual)
    - Root cause analysis (which CLI component: hooks, session, checkpoint, attribution, strategy)
    - Key evidence: `entire.log` excerpts, `console.log` excerpts, git state
    - Reproduction steps
    - Suspected fix location (file, function, reason)
 
-## Step 7: Summary Report
+### Step 5: Summary Report
 
 Print a summary table:
 ```
-| Test | Agent(s) | Classification | Action | Link |
-|------|----------|---------------|--------|------|
-| TestFoo | claude-code | flaky | PR #123 | url |
-| TestBar | all agents | real-bug | Issue #456 (existing, commented) | url |
-| TestBaz | opencode | real-bug | Issue #457 (new) | url |
+| Test | Agent(s) | Re-runs | Classification | Action | Link |
+|------|----------|---------|---------------|--------|------|
+| TestFoo | claude-code | FAIL/PASS/PASS | flaky | PR #123 | url |
+| TestBar | all agents | FAIL/FAIL/FAIL | real-bug | Issue #456 (existing, commented) | url |
+| TestBaz | opencode | FAIL/PASS/FAIL | flaky | PR #123 | url |
 ```
 
+The "Re-runs" column shows original/rerun1/rerun2 results.
+
 **Write `triage-summary.json`** in the artifact directory for Slack notifications:
 
 ```bash
@@ -149,6 +243,7 @@ cat > "${ARTIFACT_DIR}/triage-summary.json" << 'TEMPLATE'
       "test": "TestName",
       "agents": ["claude-code", "opencode"],
       "classification": "flaky|real-bug",
+      "rerun_results": ["FAIL", "PASS", "PASS"],
       "action_type": "pr|issue|comment",
       "action_description": "PR #123|Issue #456 (new)|Issue #456 (commented)",
       "link": "https://github.com/entireio/cli/pull/123"
diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index 3fcf7778d..8574d545e 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -9,7 +9,7 @@ permissions:
   contents: write
   pull-requests: write
   issues: write
-  actions: read
+  actions: write
 
 concurrency:
   group: e2e-triage
@@ -19,7 +19,7 @@ jobs:
   triage:
     if: ${{ github.event.workflow_run.conclusion == 'failure' }}
     runs-on: ubuntu-latest
-    timeout-minutes: 30
+    timeout-minutes: 60
     steps:
       - name: Checkout repository
         uses: actions/checkout@v6
@@ -78,8 +78,10 @@ jobs:
             --allowedTools "Bash(mise run *),Bash(gh *),Bash(git *),Read,Write,Edit,Glob,Grep" \
             "You are triaging E2E test failures from CI run ${WORKFLOW_RUN_ID}.
              Follow the /e2e-triage skill instructions in .claude/skills/e2e-triage/SKILL.md.
+             You are in CI mode (WORKFLOW_RUN_ID is set). Use the CI mode steps.
              Artifact directory: ${ARTIFACT_DIR}
              CI run URL: ${WORKFLOW_RUN_URL}
+             Re-run failing tests using the e2e-isolated.yml workflow via gh workflow run.
              Do NOT ask questions — act autonomously.
              For flaky fixes: create a PR.
              For real bugs: create or comment on GitHub issues.

From 1ca90ee4f3004db7a9ede5c545bdf9db8874012b Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 14:06:06 -0800
Subject: [PATCH 05/11] Split E2E triage Take Action and Summary steps by local
 vs CI mode

Local mode now presents findings interactively and applies fixes
directly in the working tree instead of creating branches/PRs/issues:
- Step 4a: findings report, proposed fixes, user approval gate, in-place fixes
- Step 4b: unchanged CI behavior (batched PR for flaky, issues for real bugs)
- Step 5: local mode gets simpler summary table, no triage-summary.json

Entire-Checkpoint: 4e1d9cf59d52
---
 .claude/skills/e2e-triage/SKILL.md | 95 +++++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 3 deletions(-)

diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
index 5fd059f88..b8e06ded4 100644
--- a/.claude/skills/e2e-triage/SKILL.md
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -1,14 +1,16 @@
 ---
 name: e2e-triage
-description: Triage E2E test failures — run locally with mise or via CI re-runs, classify flaky vs real bug, create PRs for flaky fixes and GitHub issues for real bugs
+description: Triage E2E test failures — run locally with mise or via CI re-runs, classify flaky vs real bug. Local mode presents findings and applies fixes in-place; CI mode creates PRs for flaky fixes and GitHub issues for real bugs.
 ---
 
 # E2E Triage
 
-Triage E2E test failures with **re-run verification**. Operates in two modes (auto-detected), analyzes artifacts, re-runs failing tests to distinguish flaky from real bugs, then takes action: batched PR for flaky fixes, GitHub issues for real bugs.
+Triage E2E test failures with **re-run verification**. Operates in two modes (auto-detected), analyzes artifacts, and re-runs failing tests to distinguish flaky from real bugs. **Local mode** presents findings interactively and applies fixes directly in the working tree. **CI mode** creates batched PRs for flaky fixes and GitHub issues for real bugs.
 
 ## Mode Detection
 
+The two modes share the same analysis and classification logic but differ in how results are presented and acted upon. Local mode is interactive (user reviews findings, chooses what to fix); CI mode is automated (PRs and issues created directly).
+
 - **CI mode**: `WORKFLOW_RUN_ID` env var is set (injected by `e2e-triage.yml`)
 - **Local mode**: No `WORKFLOW_RUN_ID` — user invokes `/e2e-triage` manually
 
@@ -167,7 +169,79 @@ Before acting, check correlations using re-run data:
 - Same test fails for multiple agents, but re-runs pass -> **flaky** (shared prompt issue)
 - One agent fails consistently, others pass -> agent-specific issue (still **real-bug** if re-runs confirm)
 
-### Step 4: Take Action
+### Step 4a: Take Action — Local Mode
+
+In local mode, present findings interactively. **Do not** create branches, PRs, or GitHub issues.
+
+#### Present Findings Report
+
+For each test+agent pair, print a findings block:
+
+```
+## <TestName> (<agent>) — <classification>
+
+**Re-run results:** original=FAIL, rerun1=PASS, rerun2=PASS
+**Evidence:**
+- <1-2 sentence summary of what went wrong>
+- <key artifact evidence: entire.log excerpt, console.log excerpt, etc.>
+```
+
+#### For `flaky` failures: describe the proposed fix
+
+```
+**Proposed fix:** <description>
+  - File: <path to test file>
+  - Change: <what will be modified — e.g., append "Do not ask for confirmation" to prompt>
+```
+
+Common flaky fixes (same as CI mode):
+- Agent asked for confirmation -> append "Do not ask for confirmation" to prompt
+- Agent wrote to wrong path -> be more explicit about paths in prompt
+- Agent committed when shouldn't -> add "Do not commit" to prompt
+- Checkpoint wait timeout -> increase timeout argument
+- Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
+
+#### For `real-bug` failures: describe root cause analysis
+
+```
+**Root cause analysis:**
+  - Component: <hooks | session | checkpoint | strategy | agent>
+  - Suspected location: <file:function>
+  - Description: <what's wrong and why>
+  - Proposed fix: <what code change would address it>
+```
+
+#### Ask the user
+
+Prompt the user:
+
+> **Should I fix these?**
+> - [list of tests with classifications]
+> - You can select all, specific tests, or skip.
+
+Wait for user response before proceeding.
+
+#### Apply fixes (if user approves)
+
+For **flaky** fixes the user approved:
+1. Apply fixes directly in the working tree (no branch creation)
+2. Run verification:
+   ```bash
+   mise run test:e2e:canary   # Must pass
+   mise run fmt && mise run lint
+   ```
+3. If canary fails, investigate and adjust. Report what happened to the user.
+
+For **real-bug** fixes the user approved:
+1. Apply the fix directly in the working tree (no branch creation)
+2. Run relevant tests to verify:
+   ```bash
+   mise run test        # Unit tests
+   mise run test:e2e:canary  # Canary tests
+   ```
+3. Report results to the user.
+
+### Step 4b: Take Action — CI Mode
 
 #### For `flaky` failures: Batched PR
 
@@ -222,6 +296,21 @@ Before acting, check correlations using re-run data:
 
 ### Step 5: Summary Report
 
+#### Local mode
+
+Print a summary table:
+```
+| Test | Agent(s) | Re-runs | Classification | Action Taken |
+|------|----------|---------|---------------|-------------|
+| TestFoo | claude-code | FAIL/PASS/PASS | flaky | Fixed in working tree |
+| TestBar | all agents | FAIL/FAIL/FAIL | real-bug | Fix applied, tests passing |
+| TestBaz | opencode | FAIL/PASS/FAIL | flaky | Skipped (user declined) |
+```
+
+No `triage-summary.json` is written in local mode.
+
+#### CI mode
+
 Print a summary table:
 ```
 | Test | Agent(s) | Re-runs | Classification | Action | Link |

From 76ebb5d2ca9b7d799b3df4e95233e2de04b469c0 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 14:30:17 -0800
Subject: [PATCH 06/11] Clarify real-bug vs flaky/test-bug classification in
 E2E triage skill

Consistent test failures can be test infrastructure bugs (e2e/ code),
not product bugs (cmd/entire/cli/). Update classification signals,
fix lists, and action sections to distinguish the two.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 28c90fcc7266
---
 .claude/skills/e2e-triage/SKILL.md | 33 +++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
index b8e06ded4..8b69d67d6 100644
--- a/.claude/skills/e2e-triage/SKILL.md
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -130,15 +130,19 @@ Use **re-run results as the primary signal**, supplemented by artifact analysis.
 
 | Original | Re-run 1 | Re-run 2 | Classification |
 |----------|----------|----------|----------------|
-| FAIL | FAIL (same error) | FAIL (same error) | **real-bug** |
+| FAIL | FAIL (same error) | FAIL (same error) | **real-bug** OR **flaky (test-bug)** — see below |
 | FAIL | PASS | PASS | **flaky** |
 | FAIL | PASS | FAIL | **flaky** (non-deterministic) |
 | FAIL | FAIL | PASS | **flaky** (non-deterministic) |
 | FAIL | FAIL (different error) | FAIL (different error) | **needs deeper analysis** — examine artifacts |
 
-#### Strong `real-bug` signals (supplement re-runs):
+**Important: Consistent failures can still be `flaky` (test-bug).** When all re-runs fail, check *where* the root cause is:
+- Root cause in `cmd/entire/cli/` → **real-bug** (product code is broken)
+- Root cause in `e2e/` (test infra, test helpers, tmux setup, env propagation) → **flaky (test-bug)** — the CLI works fine, the test is broken
 
-- `entire.log` contains `"level":"ERROR"` or panic/stack traces
+#### Strong `real-bug` signals (root cause must be in `cmd/entire/cli/`, not `e2e/`):
+
+- `entire.log` contains `"level":"ERROR"` or panic/stack traces from CLI code
 - Checkpoint metadata structurally corrupt (malformed JSON, missing `checkpoint_id`/`strategy`)
 - Session state file missing or malformed when expected
 - Hooks did not fire at all (no `hook invoked` log entries)
@@ -146,8 +150,11 @@ Use **re-run results as the primary signal**, supplemented by artifact analysis.
 - Same test fails across 3+ agents with same non-timeout symptom
 - Error references CLI code (panic in `cmd/entire/cli/`)
 
+**Key question:** Is the bug in `cmd/entire/cli/` (product code) or in `e2e/` (test code)? Only the former is a `real-bug`.
+
 #### Strong `flaky` signals (unless overridden by real-bug):
 
+**Agent behavior (non-deterministic):**
 - `signal: killed` (timeout)
 - `context deadline exceeded` or `WaitForCheckpoint.*exceeded deadline`
 - Agent asked for confirmation instead of acting
@@ -156,6 +163,14 @@ Use **re-run results as the primary signal**, supplemented by artifact analysis.
 - Agent committed when it shouldn't have (or vice versa)
 - Duration near timeout limit
 
+**Test-bug (consistent failure, but root cause is in `e2e/` not `cmd/entire/cli/`):**
+- Agent "Not logged in" / auth errors → test env setup doesn't propagate auth credentials
+- Env vars not propagated to agent session → tmux/test harness bug
+- Error references test code (`e2e/`) not CLI code (`cmd/entire/cli/`)
+- Test helper logic errors (wrong assertions, bad globs, incorrect expected values)
+- Consistent failure BUT root cause traced to `e2e/` code, not `cmd/entire/cli/`
+- Test setup/teardown issues (missing git config, temp dir cleanup, port conflicts)
+
 #### Ambiguous cases:
 
 Read `entire.log` carefully:
@@ -188,10 +203,12 @@ For each test+agent pair, print a findings block:
 
 #### For `flaky` failures: describe the proposed fix
 
+For agent-behavior flaky issues, fixes typically modify test prompts. For test-bug flaky issues, fixes target `e2e/` infrastructure code (harness setup, helpers, env propagation).
+
 ```
 **Proposed fix:** <description>
-  - File: <path to test file>
-  - Change: <what will be modified — e.g., append "Do not ask for confirmation" to prompt>
+  - File: <path to test file or e2e infrastructure file>
+  - Change: <what will be modified — e.g., append "Do not ask for confirmation" to prompt, or fix env propagation in NewTmuxSession>
 ```
 
 Common flaky fixes (same as CI mode):
@@ -200,6 +217,9 @@ Common flaky fixes (same as CI mode):
 - Agent committed when shouldn't -> add "Do not commit" to prompt
 - Checkpoint wait timeout -> increase timeout argument
 - Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
+- Auth/env not propagated -> fix test harness env setup in `e2e/` code
+- Test helper bug (wrong assertion, bad glob) -> fix test helper in `e2e/`
+- tmux session setup issue -> fix `NewTmuxSession` or session config in `e2e/`
 
 #### For `real-bug` failures: describe root cause analysis
 
@@ -252,6 +272,9 @@ For **real-bug** fixes the user approved:
    - Agent committed when shouldn't -> add "Do not commit" to prompt
    - Checkpoint wait timeout -> increase timeout argument
    - Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
+   - Auth/env not propagated -> fix test harness env setup in `e2e/` code
+   - Test helper bug (wrong assertion, bad glob) -> fix test helper in `e2e/`
+   - tmux session setup issue -> fix `NewTmuxSession` or session config in `e2e/`
 3. Run verification:
    ```bash
    mise run test:e2e:canary   # Must pass

From 6bdf24de262d291e74c772d7e5d137de66832752 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 14:39:48 -0800
Subject: [PATCH 07/11] Add README to E2E triage skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: ba6877944a6c
---
 .claude/skills/e2e-triage/README.md | 58 +++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)
 create mode 100644 .claude/skills/e2e-triage/README.md

diff --git a/.claude/skills/e2e-triage/README.md b/.claude/skills/e2e-triage/README.md
new file mode 100644
index 000000000..f0b7cd5ea
--- /dev/null
+++ b/.claude/skills/e2e-triage/README.md
@@ -0,0 +1,58 @@
+# E2E Triage Skill
+
+Triage E2E test failures by re-running tests, classifying failures as flaky vs real-bug, and taking appropriate action (fixes in local mode, PRs/issues in CI mode).
+
+## Two Modes
+
+| Mode | Trigger | Action on flaky | Action on real-bug |
+|------|---------|----------------|--------------------|
+| **Local** | User invokes `/e2e-triage` | Presents findings, applies fixes in working tree | Presents root cause analysis, applies fix if approved |
+| **CI** | `WORKFLOW_RUN_ID` env var set (via `e2e-triage.yml`) | Creates batched PR | Creates or comments on GitHub issue |
+
+## Local Usage
+
+```
+# Triage a specific test
+/e2e-triage TestInteractiveMultiStep
+
+# Triage a specific test for one agent
+/e2e-triage TestInteractiveMultiStep --agent claude-code
+
+# Triage multiple tests
+/e2e-triage TestInteractiveMultiStep TestCheckpointRewind
+
+# Analyze existing artifacts (skip re-running)
+/e2e-triage /path/to/artifact/dir
+```
+
+The skill will:
+1. Run the test(s) up to 3 times (first run, re-run on failure, tiebreaker if split)
+2. Analyze artifacts (`console.log`, `entire.log`, `git-log.txt`, checkpoint metadata)
+3. Classify each failure and present findings
+4. Ask before applying any fixes
+
+## CI Mode
+
+Triggered automatically by the `e2e-triage.yml` workflow. Downloads artifacts, re-runs failures via `e2e-isolated.yml`, then:
+- **Flaky fixes** — batched into a single PR (`fix/e2e-flaky-<id>`)
+- **Real bugs** — filed as GitHub issues (with dedup check)
+- **Summary** — writes `triage-summary.json` for Slack notifications
+
+## Classification Logic
+
+Re-run results are the primary signal:
+
+| Pattern | Classification |
+|---------|---------------|
+| FAIL / PASS / PASS | Flaky |
+| FAIL / PASS / FAIL | Flaky (non-deterministic) |
+| FAIL / FAIL / PASS | Flaky (non-deterministic) |
+| FAIL / FAIL / FAIL | Real-bug OR flaky (test-bug) — depends on root cause location |
+
+**Key distinction for consistent failures:** if the root cause is in `cmd/entire/cli/` (product code), it's a **real-bug**. If it's in `e2e/` (test infra), it's **flaky (test-bug)**.
+
+## Key Files
+
+- `SKILL.md` — Full skill definition with all steps, classification rules, and action templates
+- `../../.github/workflows/e2e-triage.yml` — CI workflow that triggers CI mode
+- `../../scripts/download-e2e-artifacts.sh` — Downloads artifacts from CI runs

From fad002b6efa66622e500b77819cee19c7774f932 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 14:50:36 -0800
Subject: [PATCH 08/11] Delegate artifact analysis from e2e-triage to debug-e2e
 skill

Replace duplicated artifact-reading steps in e2e-triage Step 1 with a
reference to debug-e2e's Debugging Workflow (steps 2-5), keeping the
collect list so classification inputs remain clear. Add Related Skills
section to README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: eb14496bde1e
---
 .claude/skills/e2e-triage/README.md |  8 ++++++++
 .claude/skills/e2e-triage/SKILL.md  | 12 +++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/.claude/skills/e2e-triage/README.md b/.claude/skills/e2e-triage/README.md
index f0b7cd5ea..00d101aad 100644
--- a/.claude/skills/e2e-triage/README.md
+++ b/.claude/skills/e2e-triage/README.md
@@ -23,6 +23,9 @@ Triage E2E test failures by re-running tests, classifying failures as flaky vs r
 
 # Analyze existing artifacts (skip re-running)
 /e2e-triage /path/to/artifact/dir
+
+# Download lastest CI run artifacts and triage
+/e2e-triage get latest CI run
 ```
 
 The skill will:
@@ -51,6 +54,11 @@ Re-run results are the primary signal:
 
 **Key distinction for consistent failures:** if the root cause is in `cmd/entire/cli/` (product code), it's a **real-bug**. If it's in `e2e/` (test infra), it's **flaky (test-bug)**.
 
+## Related Skills
+
+- `/debug-e2e` — Standalone artifact analysis for diagnosing a specific failure. Use when you already have artifacts and want to understand *what went wrong* without re-running or classifying.
+- `/e2e-triage` uses debug-e2e's workflow internally for its analysis step, then adds classification (flaky vs real-bug) and automated action (fixes/PRs/issues).
+
 ## Key Files
 
 - `SKILL.md` — Full skill definition with all steps, classification rules, and action templates
diff --git a/.claude/skills/e2e-triage/SKILL.md b/.claude/skills/e2e-triage/SKILL.md
index 8b69d67d6..a1616db0c 100644
--- a/.claude/skills/e2e-triage/SKILL.md
+++ b/.claude/skills/e2e-triage/SKILL.md
@@ -114,13 +114,11 @@ Proceed to **Shared Analysis** (Step 1 below).
 
 ### Step 1: Analyze Each Failure
 
-For each failure, examine available artifacts:
-
-1. **Read `console.log`** — what did the agent actually do? Full chronological transcript.
-2. **Read test source at file:line** — what was expected?
-3. **Read `entire-logs/entire.log`** — any CLI errors, panics, unexpected behavior?
-4. **Read `git-log.txt` / `git-tree.txt`** — repo state at failure time
-5. **Read `checkpoint-metadata/`** — corrupt or missing metadata?
+For each failure, follow the **Debugging Workflow** in `.claude/skills/debug-e2e/SKILL.md` (steps 2-5: console.log → test source → entire.log → deep dive). Collect:
+- What the agent actually did (from console.log)
+- What was expected (from test source)
+- CLI-level errors or anomalies (from entire.log)
+- Repo/checkpoint state (from git-log.txt, git-tree.txt, checkpoint-metadata/)
 
 ### Step 2: Classify Each Failure
 

From 39d1d8f8c29f6da9cdce9b057d6c76d487bc0a8e Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 14:55:36 -0800
Subject: [PATCH 09/11] For testing purposes only

---
 .github/workflows/e2e-triage.yml | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index 8574d545e..3389544bd 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -4,6 +4,11 @@ on:
   workflow_run:
     workflows: ["E2E Tests"]
     types: [completed]
+  workflow_dispatch:
+    inputs:
+      run_id:
+        description: "E2E workflow run ID to triage"
+        required: true
 
 permissions:
   contents: write
@@ -17,7 +22,7 @@ concurrency:
 
 jobs:
   triage:
-    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
+    if: ${{ github.event.workflow_run.conclusion == 'failure' || github.event_name == 'workflow_dispatch' }}
     runs-on: ubuntu-latest
     timeout-minutes: 60
     steps:
@@ -38,7 +43,7 @@ jobs:
       - name: Download E2E artifacts
         env:
           GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
         run: |
           chmod +x scripts/download-e2e-artifacts.sh
           scripts/download-e2e-artifacts.sh "$WORKFLOW_RUN_ID"
@@ -47,8 +52,8 @@ jobs:
         if: ${{ secrets.E2E_SLACK_WEBHOOK_URL != '' }}
         uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
         env:
-          RUN_URL: ${{ github.event.workflow_run.html_url }}
-          RUN_ID: ${{ github.event.workflow_run.id }}
+          RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
+          RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
         with:
           webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
           webhook-type: incoming-webhook
@@ -69,8 +74,8 @@ jobs:
         env:
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
           GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
-          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
+          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
         run: |
           ARTIFACT_DIR="e2e/artifacts/ci-${WORKFLOW_RUN_ID}"
           claude -p \
@@ -90,8 +95,8 @@ jobs:
       - name: Build Slack summary
         if: ${{ always() && secrets.E2E_SLACK_WEBHOOK_URL != '' }}
         env:
-          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
-          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
+          WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
           TRIAGE_LOG_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
         run: |
           SUMMARY_FILE="e2e/artifacts/ci-${WORKFLOW_RUN_ID}/triage-summary.json"

From 4f5494703285039e0100df31741aff5d7dc78c50 Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 15:13:37 -0800
Subject: [PATCH 10/11] For testing purposes only

---
 .github/workflows/e2e-triage.yml | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index 3389544bd..a506d27cf 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -25,6 +25,8 @@ jobs:
     if: ${{ github.event.workflow_run.conclusion == 'failure' || github.event_name == 'workflow_dispatch' }}
     runs-on: ubuntu-latest
     timeout-minutes: 60
+    env:
+      SLACK_WEBHOOK_URL: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
     steps:
       - name: Checkout repository
         uses: actions/checkout@v6
@@ -49,13 +51,13 @@ jobs:
           scripts/download-e2e-artifacts.sh "$WORKFLOW_RUN_ID"
 
       - name: Notify Slack - triage started
-        if: ${{ secrets.E2E_SLACK_WEBHOOK_URL != '' }}
+        if: ${{ env.SLACK_WEBHOOK_URL != '' }}
         uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
         env:
           RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
           RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
         with:
-          webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
+          webhook: ${{ env.SLACK_WEBHOOK_URL }}
           webhook-type: incoming-webhook
           payload: |
             {
@@ -93,7 +95,7 @@ jobs:
              Print a summary table at the end."
 
       - name: Build Slack summary
-        if: ${{ always() && secrets.E2E_SLACK_WEBHOOK_URL != '' }}
+        if: ${{ always() && env.SLACK_WEBHOOK_URL != '' }}
         env:
           WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
           WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
@@ -120,9 +122,9 @@ jobs:
             > slack-payload.json
 
       - name: Notify Slack - triage complete
-        if: ${{ always() && secrets.E2E_SLACK_WEBHOOK_URL != '' }}
+        if: ${{ always() && env.SLACK_WEBHOOK_URL != '' }}
         uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
         with:
-          webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }}
+          webhook: ${{ env.SLACK_WEBHOOK_URL }}
           webhook-type: incoming-webhook
           payload-file-path: slack-payload.json

From 270d48dc7ca81c458ab5b6a066c1c5f5c6943dcb Mon Sep 17 00:00:00 2001
From: Alisha Kawaguchi <alisha@entire.io>
Date: Fri, 6 Mar 2026 15:41:42 -0800
Subject: [PATCH 11/11] Fix allowed tools

Entire-Checkpoint: bb778fbab533
---
 .github/workflows/e2e-triage.yml | 81 ++++++++++++++++----------------
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
index a506d27cf..b6a0c75f0 100644
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -50,27 +50,27 @@ jobs:
           chmod +x scripts/download-e2e-artifacts.sh
           scripts/download-e2e-artifacts.sh "$WORKFLOW_RUN_ID"
 
-      - name: Notify Slack - triage started
-        if: ${{ env.SLACK_WEBHOOK_URL != '' }}
-        uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
-        env:
-          RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
-          RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
-        with:
-          webhook: ${{ env.SLACK_WEBHOOK_URL }}
-          webhook-type: incoming-webhook
-          payload: |
-            {
-              "blocks": [
-                {
-                  "type": "section",
-                  "text": {
-                    "type": "mrkdwn",
-                    "text": ":mag: *Claude is triaging E2E failures* from <${{ env.RUN_URL }}|run #${{ env.RUN_ID }}>..."
-                  }
-                }
-              ]
-            }
+      # - name: Notify Slack - triage started
+      #   if: ${{ env.SLACK_WEBHOOK_URL != '' }}
+      #   uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
+      #   env:
+      #     RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
+      #     RUN_ID: ${{ github.event.workflow_run.id || inputs.run_id }}
+      #   with:
+      #     webhook: ${{ env.SLACK_WEBHOOK_URL }}
+      #     webhook-type: incoming-webhook
+      #     payload: |
+      #       {
+      #         "blocks": [
+      #           {
+      #             "type": "section",
+      #             "text": {
+      #               "type": "mrkdwn",
+      #               "text": ":mag: *Claude is triaging E2E failures* from <${{ env.RUN_URL }}|run #${{ env.RUN_ID }}>..."
+      #             }
+      #           }
+      #         ]
+      #       }
 
       - name: Run triage
         env:
@@ -80,19 +80,20 @@ jobs:
           WORKFLOW_RUN_URL: ${{ github.event.workflow_run.html_url || format('https://github.com/{0}/actions/runs/{1}', github.repository, inputs.run_id) }}
         run: |
           ARTIFACT_DIR="e2e/artifacts/ci-${WORKFLOW_RUN_ID}"
-          claude -p \
+          cat <<PROMPT | claude -p \
             --model claude-opus-4-6 \
-            --allowedTools "Bash(mise run *),Bash(gh *),Bash(git *),Read,Write,Edit,Glob,Grep" \
-            "You are triaging E2E test failures from CI run ${WORKFLOW_RUN_ID}.
-             Follow the /e2e-triage skill instructions in .claude/skills/e2e-triage/SKILL.md.
-             You are in CI mode (WORKFLOW_RUN_ID is set). Use the CI mode steps.
-             Artifact directory: ${ARTIFACT_DIR}
-             CI run URL: ${WORKFLOW_RUN_URL}
-             Re-run failing tests using the e2e-isolated.yml workflow via gh workflow run.
-             Do NOT ask questions — act autonomously.
-             For flaky fixes: create a PR.
-             For real bugs: create or comment on GitHub issues.
-             Print a summary table at the end."
+            --allowedTools "Read,Glob,Grep,Write,Edit,Bash(mise run test:e2e*),Bash(mise run fmt*),Bash(mise run lint*),Bash(gh workflow run *),Bash(gh run list *),Bash(gh run view *),Bash(gh run download *),Bash(gh pr create *),Bash(gh issue list *),Bash(gh issue create *),Bash(gh issue comment *),Bash(git checkout *),Bash(git add *),Bash(git commit *),Bash(git push *),Bash(git diff *),Bash(git log *),Bash(git status*)"
+          You are triaging E2E test failures from CI run ${WORKFLOW_RUN_ID}.
+          Follow the /e2e-triage skill instructions in .claude/skills/e2e-triage/SKILL.md.
+          You are in CI mode (WORKFLOW_RUN_ID is set). Use the CI mode steps.
+          Artifact directory: ${ARTIFACT_DIR}
+          CI run URL: ${WORKFLOW_RUN_URL}
+          Re-run failing tests using the e2e-isolated.yml workflow via gh workflow run.
+          Do NOT ask questions — act autonomously.
+          For flaky fixes: create a PR.
+          For real bugs: create or comment on GitHub issues.
+          Print a summary table at the end.
+          PROMPT
 
       - name: Build Slack summary
         if: ${{ always() && env.SLACK_WEBHOOK_URL != '' }}
@@ -121,10 +122,10 @@ jobs:
             '{blocks: [{type: "section", text: {type: "mrkdwn", text: $text}}]}' \
             > slack-payload.json
 
-      - name: Notify Slack - triage complete
-        if: ${{ always() && env.SLACK_WEBHOOK_URL != '' }}
-        uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
-        with:
-          webhook: ${{ env.SLACK_WEBHOOK_URL }}
-          webhook-type: incoming-webhook
-          payload-file-path: slack-payload.json
+      # - name: Notify Slack - triage complete
+      #   if: ${{ always() && env.SLACK_WEBHOOK_URL != '' }}
+      #   uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a
+      #   with:
+      #     webhook: ${{ env.SLACK_WEBHOOK_URL }}
+      #     webhook-type: incoming-webhook
+      #     payload-file-path: slack-payload.json