panel-review skill v0.1 with braintrust integration#1
Merged
Conversation
A multi-agent code review skill plus the trace pipeline, eval framework, and developer guardrails that make it operationally honest. **Core skill (panel-review):** - Three-model fan-out: Opus + Sonnet on regression / security / bugs; Haiku on convention drift, naming, dead code, comment quality - Discrete severity (blocker | question | nit); blocker reserved for actual bad bugs - Fresh-context Sonnet does confidence recheck plus aggressive cull plus comment polish in one pass - Posted comment is one of two shapes (approve-with-summary or request-changes-with-blockers); questions and nits stay in the build log unless dev opts in - Worktree-isolated; build log committed alongside the PR **Trace pipeline (real-run logging):** - Phase-by-phase JSONL accumulation in /tmp/, written via scripts/append-trace.ts wrapper (with shape validation; replaces inline JSON construction so the agent cannot silently skip a trace) - Phase 6 pre-flight verification: file must have exactly 5 lines (3 fan-out + 1 recheck + 1 dev-gate) before flush; abort and reconstruct otherwise - scripts/flush-traces.ts pushes JSONL to braintrust Logs - scripts/promote-to-dataset.ts inserts labeled dev-gate trace into panel-review-labeled Dataset (stable id keys; upserts on rerun) - Skill version sourced from SKILL.md frontmatter; every trace tagged with skill_version **Evals via braintrustdata/eval-action:** - skills/panel-review/evals/precision.eval.ts: real Eval() with per-severity precision scorers; reads synthetic baseline fixtures plus the panel-review-labeled Dataset - .github/workflows/braintrust-evals.yml: runs eval-action on every PR; results post as PR comment with regression detection vs prior runs **Guardrails:** - CLAUDE.md privacy rule (CI logs public, braintrust private; aggregate counts only in CI scripts) - Tightened agent prompts: no stylistic preferences, no hypothetical SDK behavior; only verifiable findings - North-star pre-flight as Phase 0 of panel-review - Schema regex format checks on panel-review build log frontmatter - traces.ts hardens against sync and async log failures - vitest.config.ts excludes .claude/skills symlinks - Lightweight usage counters at docs/eval-runs/<skill>.md - gray-matter cache bug fix in parseNorthStar and parseReviewLog (same-input-twice returns shallow copy dropping non-enumerable fields; switched check to Object.keys(data).length) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings surfaced by the panel review of this PR, all resolved in-branch with a regression test for the most subtle of the three. F1 (blocker, precision.eval.ts:148) — raw SDK error message was being interpolated into a console.warn that runs in public GitHub Actions logs. Drops the message entirely; uses a bare catch; pins the rationale to the CLAUDE.md privacy rule so future maintainers don't reintroduce it. F2 (question, promote-to-dataset.ts:120) — stable-id construction used String.replace, which only replaces the first occurrence. A repo value with more than one slash would silently collide ids across runs. Switched to replaceAll with an inline note on the collision risk. F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but not on initFailed, so a synchronous flushTraces call after a logTrace whose async .catch microtask had not yet drained would proceed past the guard and flush a known-broken logger. Added initFailed to the guard, matching the existing initFailed-first check in getLogger. New vitest regression case "short-circuits after a logged async failure before the .catch microtask drains" exercises the race window with setImmediate. Tests: 48/48 pass (was 47). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full audit trail of the v0.1 dogfood panel run against this PR. Six phases captured: pre-flight (no north-star, toolkit-legitimate), raw findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh- context Sonnet recheck + cull + polish, dev triage + in-branch fixes, fresh-context fix-review verifying F1/F2/F3 resolved with no new issues. Verdict: approved (Shape A — no outstanding blockers after fixes). Posted: false (gh pr review deferred; build log is the durable artifact). Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1 recheck + 1 fix-review), 89 tool uses, 492s wall time. Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1, runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes the confirmed.length + rejected.length >= 1 predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e648e9d to
989e50b
Compare
Braintrust eval reportfdet-panel-review (panel-review-precision-b3d7ac23)
|
mollyretter
commented
May 14, 2026
mollyretter
left a comment
Owner
Author
There was a problem hiding this comment.
Panel review (3-agent fan-out)
Verdict: Looks good. No outstanding blockers — all 3 surfaced findings resolved in this branch.
Findings surfaced and resolved (in-branch)
| ID | Sev (initial) | Location | Resolution |
|---|---|---|---|
| F1 | blocker | [skills/panel-review/evals/precision.eval.ts:148] |
Sanitized: dropped raw SDK error from console.warn; bare catch; comment pinning the CLAUDE.md privacy rule. |
| F2 | question | [scripts/promote-to-dataset.ts:120] |
replace → replaceAll so multi-slash repo values can't silently collide stable IDs. |
| F3 | question | [skills/panel-review/traces.ts:112-113] |
Added initFailed to the flushTraces guard + new vitest regression for the async-failure-before-microtask-drain race. |
What we checked
- Regression risk, security, bugs (Opus + Sonnet, parallel): CI privacy rule compliance across scripts and eval files; braintrust SDK init/logging error paths in
traces.ts; stable ID construction inpromote-to-dataset.ts; async microtask ordering inlogTrace/flushTraces; fork-PR secret gating in both workflow files; smoke-test coverage for the braintrust integration job. - Convention drift, naming, dead code, comment quality (Haiku): error message accuracy in
skills/north-star/schema.ts; silent-skip behavior inscripts/flush-traces.ts; inline comment quality across new scripts. - Confidence and polish (fresh-context Sonnet): blocker recheck, cull of low-signal items, comment polish.
- Fix-review (fresh-context Sonnet, after fixes applied): each of F1/F2/F3 verified resolved against the live code; no new issues introduced.
Summary
One blocker and two questions surfaced across the three-agent fan-out — all three resolved in this branch with a new regression test for the flushTraces race condition. Test suite: 48/48 pass. Two nits were culled as not worth surfacing on this PR. No false positives.
Build log: docs/build-logs/panel-review-pr-1.md
🤖 Sent by Claude Code via panel-review v0.1.0.
Comment posted at 2026-05-14T02:33:14Z; closes the loop on the Phase 6 ops log so the build log accurately reflects final state. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mollyretter
added a commit
that referenced
this pull request
May 14, 2026
Three findings surfaced by the panel review of this PR, all resolved in-branch with a regression test for the most subtle of the three. F1 (blocker, precision.eval.ts:148) — raw SDK error message was being interpolated into a console.warn that runs in public GitHub Actions logs. Drops the message entirely; uses a bare catch; pins the rationale to the CLAUDE.md privacy rule so future maintainers don't reintroduce it. F2 (question, promote-to-dataset.ts:120) — stable-id construction used String.replace, which only replaces the first occurrence. A repo value with more than one slash would silently collide ids across runs. Switched to replaceAll with an inline note on the collision risk. F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but not on initFailed, so a synchronous flushTraces call after a logTrace whose async .catch microtask had not yet drained would proceed past the guard and flush a known-broken logger. Added initFailed to the guard, matching the existing initFailed-first check in getLogger. New vitest regression case "short-circuits after a logged async failure before the .catch microtask drains" exercises the race window with setImmediate. Tests: 48/48 pass (was 47). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mollyretter
added a commit
that referenced
this pull request
May 14, 2026
Full audit trail of the v0.1 dogfood panel run against this PR. Six phases captured: pre-flight (no north-star, toolkit-legitimate), raw findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh- context Sonnet recheck + cull + polish, dev triage + in-branch fixes, fresh-context fix-review verifying F1/F2/F3 resolved with no new issues. Verdict: approved (Shape A — no outstanding blockers after fixes). Posted: false (gh pr review deferred; build log is the durable artifact). Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1 recheck + 1 fix-review), 89 tool uses, 492s wall time. Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1, runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes the confirmed.length + rejected.length >= 1 predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Braintrust eval reportfdet-panel-review (panel-review-precision-b701a6af)
|
This was referenced May 14, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
panel-reviewskill: multi-agent code review (Opus + Sonnet on regression/security/bugs, Haiku on convention/naming/dead-code/comment-quality), discrete severity levels, Sonnet confidence recheck and comment polish merged into one pass, dev approval gate before posting viagh pr review. Worktree-isolated. Build log committed alongside the PR.skills/panel-review/traces.ts. Env-var opt-in (BRAINTRUST_API_KEY), lazy init, silent no-op when unset, graceful degrade on failure.Braintrust integrationsmoke-tests the SDK end-to-end (scripts/braintrust-smoke-test.ts).parseNorthStarandparseReviewLog(gray-matter cache drops non-enumerable fields on the shallow copy); switched check toObject.keys(data).length. Includes regression test./panel-reviewwas run against this PR with a 3-agent fan-out. The panel surfaced 1 blocker (precision.eval.ts:148leaking raw SDK error text to public CI stdout) and 2 questions (promote-to-dataset.ts:120single-slash replace;traces.ts:112-113flushTracesrace past!loggerguard). All three were resolved in-branch with a new regression test for theflushTracesrace. Full audit trail atdocs/build-logs/panel-review-pr-1.md(294,979 tokens across 5 agent calls).What you need to do before merging
fdet-panel-review. Grab the API key.~/.bashrc):export BRAINTRUST_API_KEY=...gh secret set BRAINTRUST_API_KEY --repo mollyretter/forward-deployed-engineer-toolkitUntil step 3 is done, the
Braintrust integrationCI job will fail with a clear error message instructing how to set the secret.Test plan
npm run cipasses locally (skill-shape validator + 48 unit tests across 3 files)parseNorthStarregression test for the same-input-twice gray-matter cache bugparseReviewLogandvalidateReviewLogcover happy path, missing fields, wrong-typed fields, allowed verdict set, total_tokens edge cases, schema id mismatchlogTracecovers: silent no-op without key, lazy init, no re-init, metadata merge, error handling at init and logflushTracescovers: no-op without init, calls braintrust.flush after init, short-circuits oninitFailedafter a logged async failure (F3 regression)BRAINTRUST_API_KEYsecret is set)docs/build-logs/panel-review-pr-1.md.Open follow-ups (from the dogfood run)
Tracked in the build log's Retro section:
(err as Error).messageinterpolation inconsole.warn/console.errorunderscripts/andskills/*/evals/(F1 class is easy to reintroduce).replacevsreplaceAllgotcha for stable-id construction inCLAUDE.md.fix-reviewphase inpanel-reviewv0.2 (this run added one ad-hoc).package.jsonengines+.nvmrc/.tool-versions.flush-tracesandpromote-to-datasetfailed under Node 19.0.0 withTypeError: tracingChannelFn is not a functionbecause the braintrust SDK usesnode:diagnostics_channel.tracingChannel(added in 19.9, shipped in 20+).Honest caveats
## Status; next dogfood will likely sharpen the Phase 4 cull rubric further.fdet-panel-reviewproject, with 1 labeled trace promoted to thepanel-review-labeledDataset. No badge yet. v0.2+ will add eval suites once there's enough labeled data (~10-20 reviews) to grade against; badge added then if it means something.🤖 Generated with Claude Code