panel-review skill v0.1 with braintrust integration by mollyretter · Pull Request #1 · mollyretter/forward-deployed-engineer-toolkit

mollyretter · 2026-05-11T00:36:59Z

Summary

New panel-review skill: multi-agent code review (Opus + Sonnet on regression/security/bugs, Haiku on convention/naming/dead-code/comment-quality), discrete severity levels, Sonnet confidence recheck and comment polish merged into one pass, dev approval gate before posting via gh pr review. Worktree-isolated. Build log committed alongside the PR.
Real braintrust SDK integration at skills/panel-review/traces.ts. Env-var opt-in (BRAINTRUST_API_KEY), lazy init, silent no-op when unset, graceful degrade on failure.
New CI job Braintrust integration smoke-tests the SDK end-to-end (scripts/braintrust-smoke-test.ts).
Fixes a same-input-twice bug in parseNorthStar and parseReviewLog (gray-matter cache drops non-enumerable fields on the shallow copy); switched check to Object.keys(data).length. Includes regression test.
Dogfooded on itself. /panel-review was run against this PR with a 3-agent fan-out. The panel surfaced 1 blocker (precision.eval.ts:148 leaking raw SDK error text to public CI stdout) and 2 questions (promote-to-dataset.ts:120 single-slash replace; traces.ts:112-113 flushTraces race past !logger guard). All three were resolved in-branch with a new regression test for the flushTraces race. Full audit trail at docs/build-logs/panel-review-pr-1.md (294,979 tokens across 5 agent calls).

What you need to do before merging

Sign up at braintrust.dev and create a project named fdet-panel-review. Grab the API key.
Set the local env var (e.g. in ~/.bashrc): export BRAINTRUST_API_KEY=...
Set the GitHub Actions secret: gh secret set BRAINTRUST_API_KEY --repo mollyretter/forward-deployed-engineer-toolkit

Until step 3 is done, the Braintrust integration CI job will fail with a clear error message instructing how to set the secret.

Test plan

npm run ci passes locally (skill-shape validator + 48 unit tests across 3 files)
parseNorthStar regression test for the same-input-twice gray-matter cache bug
parseReviewLog and validateReviewLog cover happy path, missing fields, wrong-typed fields, allowed verdict set, total_tokens edge cases, schema id mismatch
logTrace covers: silent no-op without key, lazy init, no re-init, metadata merge, error handling at init and log
flushTraces covers: no-op without init, calls braintrust.flush after init, short-circuits on initFailed after a logged async failure (F3 regression)
CI passes on this PR (after BRAINTRUST_API_KEY secret is set)
First dogfood run completed on this PR. Findings + resolutions in docs/build-logs/panel-review-pr-1.md.

Open follow-ups (from the dogfood run)

Tracked in the build log's Retro section:

Lint rule flagging (err as Error).message interpolation in console.warn/console.error under scripts/ and skills/*/evals/ (F1 class is easy to reintroduce).
Document the replace vs replaceAll gotcha for stable-id construction in CLAUDE.md.
Formalize a fix-review phase in panel-review v0.2 (this run added one ad-hoc).
Refine Phase 4 cull prompts if reruns keep generating debuggability/error-message nits (F4/F5 class).
Pin Node ≥ 20 in package.json engines + .nvmrc/.tool-versions. flush-traces and promote-to-dataset failed under Node 19.0.0 with TypeError: tracingChannelFn is not a function because the braintrust SDK uses node:diagnostics_channel.tracingChannel (added in 19.9, shipped in 20+).

Honest caveats

v0.1 was designed, cross-model reviewed, and now exercised on itself: one blocker, two questions, two nits surfaced, three of those resolved in-branch with the fix-review pass verifying no new issues. Tracking-quality questions documented in SKILL.md ## Status; next dogfood will likely sharpen the Phase 4 cull rubric further.
Braintrust integration is real and has now flushed 6 real traces from this run into the fdet-panel-review project, with 1 labeled trace promoted to the panel-review-labeled Dataset. No badge yet. v0.2+ will add eval suites once there's enough labeled data (~10-20 reviews) to grade against; badge added then if it means something.

🤖 Generated with Claude Code

A multi-agent code review skill plus the trace pipeline, eval framework, and developer guardrails that make it operationally honest. **Core skill (panel-review):** - Three-model fan-out: Opus + Sonnet on regression / security / bugs; Haiku on convention drift, naming, dead code, comment quality - Discrete severity (blocker | question | nit); blocker reserved for actual bad bugs - Fresh-context Sonnet does confidence recheck plus aggressive cull plus comment polish in one pass - Posted comment is one of two shapes (approve-with-summary or request-changes-with-blockers); questions and nits stay in the build log unless dev opts in - Worktree-isolated; build log committed alongside the PR **Trace pipeline (real-run logging):** - Phase-by-phase JSONL accumulation in /tmp/, written via scripts/append-trace.ts wrapper (with shape validation; replaces inline JSON construction so the agent cannot silently skip a trace) - Phase 6 pre-flight verification: file must have exactly 5 lines (3 fan-out + 1 recheck + 1 dev-gate) before flush; abort and reconstruct otherwise - scripts/flush-traces.ts pushes JSONL to braintrust Logs - scripts/promote-to-dataset.ts inserts labeled dev-gate trace into panel-review-labeled Dataset (stable id keys; upserts on rerun) - Skill version sourced from SKILL.md frontmatter; every trace tagged with skill_version **Evals via braintrustdata/eval-action:** - skills/panel-review/evals/precision.eval.ts: real Eval() with per-severity precision scorers; reads synthetic baseline fixtures plus the panel-review-labeled Dataset - .github/workflows/braintrust-evals.yml: runs eval-action on every PR; results post as PR comment with regression detection vs prior runs **Guardrails:** - CLAUDE.md privacy rule (CI logs public, braintrust private; aggregate counts only in CI scripts) - Tightened agent prompts: no stylistic preferences, no hypothetical SDK behavior; only verifiable findings - North-star pre-flight as Phase 0 of panel-review - Schema regex format checks on panel-review build log frontmatter - traces.ts hardens against sync and async log failures - vitest.config.ts excludes .claude/skills symlinks - Lightweight usage counters at docs/eval-runs/<skill>.md - gray-matter cache bug fix in parseNorthStar and parseReviewLog (same-input-twice returns shallow copy dropping non-enumerable fields; switched check to Object.keys(data).length) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings surfaced by the panel review of this PR, all resolved in-branch with a regression test for the most subtle of the three. F1 (blocker, precision.eval.ts:148) — raw SDK error message was being interpolated into a console.warn that runs in public GitHub Actions logs. Drops the message entirely; uses a bare catch; pins the rationale to the CLAUDE.md privacy rule so future maintainers don't reintroduce it. F2 (question, promote-to-dataset.ts:120) — stable-id construction used String.replace, which only replaces the first occurrence. A repo value with more than one slash would silently collide ids across runs. Switched to replaceAll with an inline note on the collision risk. F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but not on initFailed, so a synchronous flushTraces call after a logTrace whose async .catch microtask had not yet drained would proceed past the guard and flush a known-broken logger. Added initFailed to the guard, matching the existing initFailed-first check in getLogger. New vitest regression case "short-circuits after a logged async failure before the .catch microtask drains" exercises the race window with setImmediate. Tests: 48/48 pass (was 47). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full audit trail of the v0.1 dogfood panel run against this PR. Six phases captured: pre-flight (no north-star, toolkit-legitimate), raw findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh- context Sonnet recheck + cull + polish, dev triage + in-branch fixes, fresh-context fix-review verifying F1/F2/F3 resolved with no new issues. Verdict: approved (Shape A — no outstanding blockers after fixes). Posted: false (gh pr review deferred; build log is the durable artifact). Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1 recheck + 1 fix-review), 89 tool uses, 492s wall time. Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1, runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes the confirmed.length + rejected.length >= 1 predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T02:28:59Z

Braintrust eval report

fdet-panel-review (panel-review-precision-b3d7ac23)

Score	Average	Improvements	Regressions
Precision_at_blocker	75% (+0pp)	-	-
Precision_at_nit	0% (+0pp)	-	-
Precision_at_question	100% (+0pp)	-	-
Llm_calls	0 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	0tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	0tok (+0tok)	-	-
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	0tok (+0tok)	-	-
Duration	0s (0s)	3 🟢	1 🔴

mollyretter

Panel review (3-agent fan-out)

Verdict: Looks good. No outstanding blockers — all 3 surfaced findings resolved in this branch.

Findings surfaced and resolved (in-branch)

ID	Sev (initial)	Location	Resolution
F1	blocker	[`skills/panel-review/evals/precision.eval.ts:148`]	Sanitized: dropped raw SDK error from `console.warn`; bare `catch`; comment pinning the CLAUDE.md privacy rule.
F2	question	[`scripts/promote-to-dataset.ts:120`]	`replace` → `replaceAll` so multi-slash repo values can't silently collide stable IDs.
F3	question	[`skills/panel-review/traces.ts:112-113`]	Added `initFailed` to the `flushTraces` guard + new vitest regression for the async-failure-before-microtask-drain race.

What we checked

Regression risk, security, bugs (Opus + Sonnet, parallel): CI privacy rule compliance across scripts and eval files; braintrust SDK init/logging error paths in traces.ts; stable ID construction in promote-to-dataset.ts; async microtask ordering in logTrace/flushTraces; fork-PR secret gating in both workflow files; smoke-test coverage for the braintrust integration job.
Convention drift, naming, dead code, comment quality (Haiku): error message accuracy in skills/north-star/schema.ts; silent-skip behavior in scripts/flush-traces.ts; inline comment quality across new scripts.
Confidence and polish (fresh-context Sonnet): blocker recheck, cull of low-signal items, comment polish.
Fix-review (fresh-context Sonnet, after fixes applied): each of F1/F2/F3 verified resolved against the live code; no new issues introduced.

Summary

One blocker and two questions surfaced across the three-agent fan-out — all three resolved in this branch with a new regression test for the flushTraces race condition. Test suite: 48/48 pass. Two nits were culled as not worth surfacing on this PR. No false positives.

Build log: docs/build-logs/panel-review-pr-1.md

🤖 Sent by Claude Code via panel-review v0.1.0.

Comment posted at 2026-05-14T02:33:14Z; closes the loop on the Phase 6 ops log so the build log accurately reflects final state. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings surfaced by the panel review of this PR, all resolved in-branch with a regression test for the most subtle of the three. F1 (blocker, precision.eval.ts:148) — raw SDK error message was being interpolated into a console.warn that runs in public GitHub Actions logs. Drops the message entirely; uses a bare catch; pins the rationale to the CLAUDE.md privacy rule so future maintainers don't reintroduce it. F2 (question, promote-to-dataset.ts:120) — stable-id construction used String.replace, which only replaces the first occurrence. A repo value with more than one slash would silently collide ids across runs. Switched to replaceAll with an inline note on the collision risk. F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but not on initFailed, so a synchronous flushTraces call after a logTrace whose async .catch microtask had not yet drained would proceed past the guard and flush a known-broken logger. Added initFailed to the guard, matching the existing initFailed-first check in getLogger. New vitest regression case "short-circuits after a logged async failure before the .catch microtask drains" exercises the race window with setImmediate. Tests: 48/48 pass (was 47). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full audit trail of the v0.1 dogfood panel run against this PR. Six phases captured: pre-flight (no north-star, toolkit-legitimate), raw findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh- context Sonnet recheck + cull + polish, dev triage + in-branch fixes, fresh-context fix-review verifying F1/F2/F3 resolved with no new issues. Verdict: approved (Shape A — no outstanding blockers after fixes). Posted: false (gh pr review deferred; build log is the durable artifact). Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1 recheck + 1 fix-review), 89 tool uses, 492s wall time. Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1, runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes the confirmed.length + rejected.length >= 1 predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T02:44:13Z

Braintrust eval report

fdet-panel-review (panel-review-precision-b701a6af)

Score	Average	Improvements	Regressions
Precision_at_blocker	75% (+0pp)	-	-
Precision_at_nit	0% (+0pp)	-	-
Precision_at_question	100% (+0pp)	-	-
Llm_calls	0 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	0tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	0tok (+0tok)	-	-
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	0tok (+0tok)	-	-
Duration	0s (0s)	-	-

mollyretter and others added 3 commits May 13, 2026 21:38

mollyretter force-pushed the add-panel-review-skill branch from e648e9d to 989e50b Compare May 14, 2026 02:28

mollyretter commented May 14, 2026

View reviewed changes

📋 panel-review build log: mark posted=true after gh pr review

dfe6b29

Comment posted at 2026-05-14T02:33:14Z; closes the loop on the Phase 6 ops log so the build log accurately reflects final state. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mollyretter merged commit 79ff51d into main May 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panel-review skill v0.1 with braintrust integration#1

panel-review skill v0.1 with braintrust integration#1
mollyretter merged 4 commits into
mainfrom
add-panel-review-skill

mollyretter commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

mollyretter left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mollyretter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What you need to do before merging

Test plan

Open follow-ups (from the dogfood run)

Honest caveats

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

mollyretter left a comment

Choose a reason for hiding this comment

Panel review (3-agent fan-out)

Findings surfaced and resolved (in-branch)

What we checked

Summary

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mollyretter commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading