Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Single HTTP server (`src/app.ts`) using `octokit` App class. Webhook events arri

**Event handler** (`src/webhook/events/`): parse event → unified `BotContext` → check for `@chrisleekr-bot` trigger → fire-and-forget `processRequest()`

**Router** (`src/webhook/router.ts`): idempotency (in-memory `Map` + durable tracking-comment check), owner allowlist, concurrency guard, triage, and scale-up decision. On heavy/overflow it spawns an ephemeral daemon K8s Pod (`src/k8s/ephemeral-daemon-spawner.ts`). The job is then enqueued for any daemon in the fleet to claim over WebSocket. The webhook server never executes the pipeline in-process.
**Router** (`src/webhook/router.ts`): owner allowlist, concurrency guard, triage, and scale-up decision (the `processRequest` entry point is dev-test-only; production handlers in `events/` own dispatch + idempotency). On heavy/overflow it spawns an ephemeral daemon K8s Pod (`src/k8s/ephemeral-daemon-spawner.ts`). The job is then enqueued for any daemon in the fleet to claim over WebSocket. The webhook server never executes the pipeline in-process.

**Pipeline** (`src/core/pipeline.ts`, executed by the daemon):

Expand Down Expand Up @@ -62,7 +62,7 @@ Single HTTP server (`src/app.ts`) using `octokit` App class. Webhook events arri
## Key Concepts

- **Async processing**: Webhook must respond within 10 seconds. All heavy work runs asynchronously after 200 OK.
- **Idempotency**: GitHub webhooks are at-least-once, a delivery (auto-retry or operator redelivery) replays with the SAME `X-GitHub-Delivery` header for up to 3 days. The four side-effecting event handlers (`events/issue-comment.ts`, `events/review-comment.ts`, and the label branches of `events/issues.ts` + `events/pull-request.ts`) call `claimDelivery(deliveryId)` (`src/webhook/idempotency.ts`) at the very top of their dispatch path, before any LLM call, `workflow_runs` insert, or GitHub write. `claimDelivery` is a Valkey `SET key 1 NX EX 259200` claim: it returns `true` exactly once per `deliveryId` within the 3-day window (the redelivery gets `false` and the handler returns early). It is **fail-OPEN**, when Valkey is unconfigured or errors it returns `true`, degrading to at-least-once rather than dropping webhooks. `events/review.ts` is intentionally NOT gated: it fires only an idempotent reactor wake (no dispatch/write). Durable backstop behind the best-effort Valkey layer: the `idx_workflow_runs_inflight` partial-unique index, which makes the dispatcher reject a second in-flight run for the same workflow+target even if the Valkey claim was skipped (fail-open). `claimDelivery` also fails open when Valkey is configured-but-disconnected (gated on `isValkeyHealthy()` so a down connection skips the SET rather than blocking on Bun's offline queue). (The legacy in-memory `Map` + `isAlreadyProcessed()` tracking-comment scan lives on the `router.ts` `processRequest` path, which the production handlers bypass, issue #202.)
- **Idempotency**: GitHub webhooks are at-least-once, a delivery (auto-retry or operator redelivery) replays with the SAME `X-GitHub-Delivery` header for up to 3 days. The four side-effecting event handlers (`events/issue-comment.ts`, `events/review-comment.ts`, and the label branches of `events/issues.ts` + `events/pull-request.ts`) call `claimDelivery(deliveryId)` (`src/webhook/idempotency.ts`) at the very top of their dispatch path, before any LLM call, `workflow_runs` insert, or GitHub write. `claimDelivery` is a Valkey `SET key 1 NX EX 259200` claim: it returns `true` exactly once per `deliveryId` within the 3-day window (the redelivery gets `false` and the handler returns early). It is **fail-OPEN**, when Valkey is unconfigured or errors it returns `true`, degrading to at-least-once rather than dropping webhooks. `events/review.ts` is intentionally NOT gated: it fires only an idempotent reactor wake (no dispatch/write). Durable backstop behind the best-effort Valkey layer: the `idx_workflow_runs_inflight` partial-unique index, which makes the dispatcher reject a second in-flight run for the same workflow+target even if the Valkey claim was skipped (fail-open). `claimDelivery` also fails open when Valkey is configured-but-disconnected (gated on `isValkeyHealthy()` so a down connection skips the SET rather than blocking on Bun's offline queue). (The legacy in-memory `Map` + `isAlreadyProcessed()` tracking-comment scan was retired in issue #211; it only ever guarded the dev-test-only `router.ts` `processRequest` path, which production handlers bypass.)
- **Repo checkout**: Each request clones the repo to a unique temp dir. Claude operates on local files via `cwd`.
- **MCP servers**: Comment updates, inline reviews, and Context7 for library docs. Git changes are made via git CLI (Bash tool) on the cloned repo.
- **Scheduled actions**: a repo may ship a `.github-app.yaml` at its default-branch root declaring prompt-based actions on a cron schedule. The internal scheduler (`src/scheduler/`, gated by `SCHEDULER_ENABLED` + `DATABASE_URL` + non-empty `ALLOWED_OWNERS`) enqueues a `scheduled-action` job, a new job kind on the scoped-job rail, that the daemon runs as one agent session via `src/daemon/scheduled-action-executor.ts`. Missed cron slots are skipped, not backfilled. The prompt is owner-trusted config. Cron parsing uses the `cron-parser` dependency. The bot-provided `merge_readiness` MCP tool is exposed only when `SCHEDULER_ALLOW_AUTO_MERGE` env AND per-action `auto_merge` are both true; `allowed_tools` is owner-trusted config, so an action granted a merge-capable Bash tool can still merge regardless. `resolve.ts` FR-017 is untouched.
Expand Down
4 changes: 2 additions & 2 deletions docs/build/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ flowchart TD

## Key concepts

- **Async processing.** The webhook handler responds within ten seconds, so the router fires `processRequest` with fire-and-forget semantics after the 200 OK is queued. Every box downstream of `ACK` runs after the HTTP response is on the wire.
- **Webhook delivery idempotency (issue #202).** GitHub is at-least-once: a delivery (auto-retry or operator redelivery) replays with the same `X-GitHub-Delivery` for up to 3 days. The four side-effecting handlers (`events/issue-comment.ts`, `events/review-comment.ts`, the label branches of `events/issues.ts` + `events/pull-request.ts`) call `claimDelivery(deliveryId)` (`src/webhook/idempotency.ts`) at the top of their dispatch path, before any LLM call, `workflow_runs` insert, or GitHub write. It is a Valkey `SET key 1 NX EX 259200` claim: `true` exactly once per delivery, `false` (and an early return) on a redelivery. It is **fail-open**, when Valkey is unconfigured or disconnected (gated on `isValkeyHealthy()`) it returns `true`, degrading to at-least-once rather than dropping or blocking webhooks. `events/review.ts` is exempt (idempotent reactor wake only). The durable backstop behind the best-effort Valkey layer is the `idx_workflow_runs_inflight` partial-unique index: the dispatcher rejects a second in-flight run for the same workflow+target even when the Valkey claim was skipped. The legacy in-memory `Map` + `isAlreadyProcessed` tracking-comment scan (`src/core/tracking-comment.ts`) now runs only on the `router.ts processRequest` path, which production handlers bypass. `DATABASE_URL` is required to persist execution / dispatch history and the in-flight guard across restarts.
- **Async processing.** The webhook handler responds within ten seconds, so the side-effecting `events/*` handlers fire their dispatch with fire-and-forget semantics after the 200 OK is queued. Every box downstream of `ACK` runs after the HTTP response is on the wire. (`router.ts processRequest` is the equivalent path for the dev-only `/api/test/webhook` endpoint, not production.)
- **Webhook delivery idempotency (issue #202).** GitHub is at-least-once: a delivery (auto-retry or operator redelivery) replays with the same `X-GitHub-Delivery` for up to 3 days. The four side-effecting handlers (`events/issue-comment.ts`, `events/review-comment.ts`, the label branches of `events/issues.ts` + `events/pull-request.ts`) call `claimDelivery(deliveryId)` (`src/webhook/idempotency.ts`) at the top of their dispatch path, before any LLM call, `workflow_runs` insert, or GitHub write. It is a Valkey `SET key 1 NX EX 259200` claim: `true` exactly once per delivery, `false` (and an early return) on a redelivery. It is **fail-open**, when Valkey is unconfigured or disconnected (gated on `isValkeyHealthy()`) it returns `true`, degrading to at-least-once rather than dropping or blocking webhooks. `events/review.ts` is exempt (idempotent reactor wake only). The durable backstop behind the best-effort Valkey layer is the `idx_workflow_runs_inflight` partial-unique index: the dispatcher rejects a second in-flight run for the same workflow+target even when the Valkey claim was skipped. The legacy in-memory `Map` + `isAlreadyProcessed` tracking-comment scan was retired in issue #211 (it only ever guarded the dev-test-only `router.ts processRequest` path, which production handlers bypass). `DATABASE_URL` is required to persist execution / dispatch history and the in-flight guard across restarts.
- **One request, one clone.** Each delivery clones the repo into a unique temp directory under `CLONE_BASE_DIR` **on the daemon host**. Claude operates on local files via `cwd`. On PR events the checkout supplementally fetches `origin/<baseBranch>` (when it differs from the head ref) so the agent's `git diff origin/<baseBranch>...HEAD` and `git rebase origin/<baseBranch>` directives resolve first try. A sibling `${workDir}-artifacts` directory is created outside the checkout and exposed to the agent as `BOT_ARTIFACT_DIR`: workflow summary files (IMPLEMENT.md / REVIEW.md / RESOLVE.md) are written there so they can never be picked up by a `git add` inside the clone. Both directories are removed in the pipeline's `finally` block regardless of outcome.
- **GitHub credential resolution.** `src/core/github-token.ts:resolveGithubToken()` is the single source of the GitHub credential the daemon uses. Default is an App installation token minted just-in-time from the cached `App` singleton in `src/orchestrator/connection-handler.ts`. When `GITHUB_PERSONAL_ACCESS_TOKEN` is set, the helper short-circuits and returns the PAT instead, API/git authentication runs as the PAT owner. Commit author/committer metadata is **not** affected; `src/core/checkout.ts` hard-pins git `user.name`/`user.email` to `chrisleekr-bot[bot]` so commit objects still carry the bot identity. The git credential helper, executor `GH_TOKEN`/`GITHUB_TOKEN` env vars, and MCP server env all consume the resolved string without caring about its source.
- **The webhook server never runs the pipeline.** Only daemons execute `runPipeline`. The webhook server is the orchestrator: it enqueues jobs and optionally spawns ephemeral daemons.
Expand Down
8 changes: 4 additions & 4 deletions docs/use/invoking.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,12 @@ flowchart TD

## Idempotency

A duplicate webhook delivery never spawns a duplicate job. The router checks two layers:
A duplicate webhook delivery never spawns a duplicate job (issue #202). Two layers:

1. **Fast in-memory**: a `Map` keyed by the `X-GitHub-Delivery` header, swept every 60 minutes (`src/webhook/router.ts`). Lost on restart.
2. **Durable**, `isAlreadyProcessed` looks for the hidden delivery marker that the bot embeds in its tracking comment. Survives pod restarts, OOM kills, and crash loops; works without `DATABASE_URL`.
1. **Best-effort claim**: the side-effecting event handlers call `claimDelivery(deliveryId)` (`src/webhook/idempotency.ts`), a Valkey `SET key 1 NX EX 259200` that returns `true` exactly once per `X-GitHub-Delivery` within GitHub's 3-day redelivery window; a redelivery gets `false` and returns early. Fail-open: a Valkey outage degrades to at-least-once rather than dropping webhooks.
2. **Durable backstop**: the `idx_workflow_runs_inflight` partial-unique index makes the dispatcher reject a second in-flight run for the same workflow+target even if the Valkey claim was skipped.

If both miss, the request proceeds to the allowlist + concurrency guard.
The legacy in-memory `Map` + tracking-comment marker scan was retired in issue #211.

## What you see while it runs

Expand Down
2 changes: 1 addition & 1 deletion docs/use/safety.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ The shepherding probe detects when a non-bot principal has pushed to the PR's he

## Idempotency

A duplicate webhook delivery, same `X-GitHub-Delivery` header or same tracking-comment marker, is dropped before any work runs. The fast in-memory `Map` is lost on restart; the durable check (looking for the bot's hidden delivery marker in existing tracking comments) survives crash loops.
A duplicate webhook delivery (same `X-GitHub-Delivery` header) is dropped before any work runs: the side-effecting handlers claim each delivery once via a Valkey `SET NX` with a 3-day TTL (`claimDelivery`), and the `idx_workflow_runs_inflight` index is the durable backstop that rejects a second in-flight run for the same workflow+target if the claim was skipped (issue #202).

## Fork PRs

Expand Down
2 changes: 1 addition & 1 deletion src/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ async function handleTestWebhook(
const deliveryId = `test-${randomUUID()}`;

// Mock Octokit that logs instead of calling GitHub API.
// Only used by the router path (isAlreadyProcessed, concurrency comment).
// Only used by the router path (capacity / spawn-failed comments).
// The daemon creates its own real Octokit from the installation token.
const mockOctokit = buildMockOctokit();

Expand Down
51 changes: 12 additions & 39 deletions src/core/tracking-comment.ts
Original file line number Diff line number Diff line change
Expand Up @@ -75,45 +75,16 @@ export function renderTriageSection(triage: TriageCommentSection): string {
}

/**
* Build the hidden HTML marker used for durable idempotency.
* Embedded in the tracking comment body so the marker survives pod restarts
* and can be detected on webhook retries.
* Build the hidden HTML marker embedded in the tracking comment body so the
* bot's own tracking comment can be located and updated in place (see the
* `comment` MCP server). Redelivery idempotency lives elsewhere now
* (`claimDelivery` + `idx_workflow_runs_inflight`, issue #202); the in-memory
* Map + marker-scan idempotency check were retired in issue #211.
*/
export function deliveryMarker(deliveryId: string): string {
return `<!-- delivery:${deliveryId} -->`;
}

/**
* Check if a bot comment for this delivery already exists.
* Used as a durable idempotency check that survives pod restarts, complementing
* the in-memory processed Map which only covers the current process lifetime.
*
* Per: https://docs.github.com/en/webhooks/using-webhooks/best-practices-for-using-webhooks
*/
export async function isAlreadyProcessed(ctx: BotContext): Promise<boolean> {
const { octokit, owner, repo, entityNumber, deliveryId, triggerTimestamp } = ctx;

const marker = deliveryMarker(deliveryId);

// Scope the scan with `since=triggerTimestamp` so we only see comments at-or-after the
// webhook trigger. The per-issue listComments endpoint orders strictly by ascending ID
// and accepts only `since`, `per_page`, `page`, `direction`/`sort` are silently dropped
// (only the repo-level sibling endpoint honours them). Without `since`, page 1 would be
// the OLDEST 100 comments, leaving the tracking marker stranded on page 2+ for any
// PR/issue with >100 prior comments and breaking the durable idempotency check on retry.
// The tracking comment is posted seconds after the trigger, so even ~24h of retries land
// inside the `since` window and the marker is guaranteed to be on page 1.
const comments = await octokit.rest.issues.listComments({
owner,
repo,
issue_number: entityNumber,
per_page: 100,
since: triggerTimestamp,
});

return comments.data.some((c) => c.body?.includes(marker) === true);
}

/**
* Create the initial tracking comment ("Working...").
* Returns the comment ID for future updates.
Expand All @@ -123,9 +94,10 @@ export async function isAlreadyProcessed(ctx: BotContext): Promise<boolean> {
export async function createTrackingComment(ctx: BotContext): Promise<number> {
const { octokit, owner, repo, entityNumber, log } = ctx;

// Embed the deliveryId marker for durable idempotency, survives pod restarts.
// The in-memory processed Map in router.ts is the fast-path check;
// this marker is the durable fallback.
// Embed the deliveryId marker so the bot can locate and update its own tracking
// comment in place (see the `comment` MCP server). Not an idempotency mechanism
// anymore (claimDelivery + idx_workflow_runs_inflight own that, #202; the Map +
// marker-scan check were retired in #211).
const body = `${deliveryMarker(ctx.deliveryId)}\n${SPINNER_HTML} **${config.triggerPhrase}** is working on this...\n\n_Analyzing your request..._`;

const guarded = await safePostToGitHub({
Expand Down Expand Up @@ -245,8 +217,9 @@ export async function finalizeTrackingComment(

const errorSection = error !== undefined && error !== "" ? `\n\n---\n**Error:** ${error}` : "";

// Re-prepend the delivery marker so the durable idempotency check survives even if
// Claude's update_claude_comment call (which runs sanitizeContent) previously stripped it.
// Re-prepend the delivery marker so the tracking comment keeps its stable hidden marker
// even if Claude's update_claude_comment call (which runs sanitizeContent) previously
// stripped it. The marker locates the bot's comment, not idempotency (#202/#211).
const finalBody = `${deliveryMarker(ctx.deliveryId)}\n${header}\n\n---\n${cleanedBody}${errorSection}`;

await updateTrackingComment(ctx, trackingCommentId, finalBody);
Expand Down
6 changes: 4 additions & 2 deletions src/mcp/servers/comment.ts
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,10 @@ server.tool(
"secret redacted from comment body",
);
}
// Re-prepend the delivery marker after sanitizeContent strips it (stripHtmlComments).
// The marker is required for the durable idempotency check in isAlreadyProcessed().
// Re-prepend the delivery marker after sanitizeContent strips it (stripHtmlComments)
// so the tracking comment keeps the stable hidden marker createTrackingComment wrote
// (see `deliveryMarker`). The idempotency check that used to read it was retired in
// issue #211; the marker is preserved for tracking-comment format consistency.
// DELIVERY_ID is validated non-empty at process startup so the cast is safe.
const markerPrefix = `<!-- delivery:${DELIVERY_ID} -->`;
const bodyWithMarker = `${markerPrefix}\n${guarded.body}`;
Expand Down
8 changes: 4 additions & 4 deletions src/webhook/idempotency.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ const KEY_PREFIX = "idemp:webhook:";
* degrades to at-least-once processing rather than dropping every webhook. The
* `idx_workflow_runs_inflight` partial-unique index remains the durable backstop
* against duplicate work when this best-effort layer is skipped: the dispatcher
* rejects a second in-flight run for the same workflow+target. (The
* tracking-comment marker scan via `isAlreadyProcessed` is NOT a backstop here:
* it runs only on the legacy `router.ts processRequest` path that production
* handlers bypass, issue #202.)
* rejects a second in-flight run for the same workflow+target. (The legacy
* tracking-comment marker scan that used to guard the `router.ts processRequest`
* path was retired in issue #211; that path is dev-test-only and production
* handlers never used it, issue #202.)
*/
export async function claimDelivery(deliveryId: string, log: Logger): Promise<boolean> {
const client = getValkeyClient();
Expand Down
Loading
Loading