Skip to content

panel-review skill v0.1 with braintrust integration#1

Merged
mollyretter merged 4 commits into
mainfrom
add-panel-review-skill
May 14, 2026
Merged

panel-review skill v0.1 with braintrust integration#1
mollyretter merged 4 commits into
mainfrom
add-panel-review-skill

Conversation

@mollyretter

@mollyretter mollyretter commented May 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • New panel-review skill: multi-agent code review (Opus + Sonnet on regression/security/bugs, Haiku on convention/naming/dead-code/comment-quality), discrete severity levels, Sonnet confidence recheck and comment polish merged into one pass, dev approval gate before posting via gh pr review. Worktree-isolated. Build log committed alongside the PR.
  • Real braintrust SDK integration at skills/panel-review/traces.ts. Env-var opt-in (BRAINTRUST_API_KEY), lazy init, silent no-op when unset, graceful degrade on failure.
  • New CI job Braintrust integration smoke-tests the SDK end-to-end (scripts/braintrust-smoke-test.ts).
  • Fixes a same-input-twice bug in parseNorthStar and parseReviewLog (gray-matter cache drops non-enumerable fields on the shallow copy); switched check to Object.keys(data).length. Includes regression test.
  • Dogfooded on itself. /panel-review was run against this PR with a 3-agent fan-out. The panel surfaced 1 blocker (precision.eval.ts:148 leaking raw SDK error text to public CI stdout) and 2 questions (promote-to-dataset.ts:120 single-slash replace; traces.ts:112-113 flushTraces race past !logger guard). All three were resolved in-branch with a new regression test for the flushTraces race. Full audit trail at docs/build-logs/panel-review-pr-1.md (294,979 tokens across 5 agent calls).

What you need to do before merging

  1. Sign up at braintrust.dev and create a project named fdet-panel-review. Grab the API key.
  2. Set the local env var (e.g. in ~/.bashrc): export BRAINTRUST_API_KEY=...
  3. Set the GitHub Actions secret: gh secret set BRAINTRUST_API_KEY --repo mollyretter/forward-deployed-engineer-toolkit

Until step 3 is done, the Braintrust integration CI job will fail with a clear error message instructing how to set the secret.

Test plan

  • npm run ci passes locally (skill-shape validator + 48 unit tests across 3 files)
  • parseNorthStar regression test for the same-input-twice gray-matter cache bug
  • parseReviewLog and validateReviewLog cover happy path, missing fields, wrong-typed fields, allowed verdict set, total_tokens edge cases, schema id mismatch
  • logTrace covers: silent no-op without key, lazy init, no re-init, metadata merge, error handling at init and log
  • flushTraces covers: no-op without init, calls braintrust.flush after init, short-circuits on initFailed after a logged async failure (F3 regression)
  • CI passes on this PR (after BRAINTRUST_API_KEY secret is set)
  • First dogfood run completed on this PR. Findings + resolutions in docs/build-logs/panel-review-pr-1.md.

Open follow-ups (from the dogfood run)

Tracked in the build log's Retro section:

  • Lint rule flagging (err as Error).message interpolation in console.warn/console.error under scripts/ and skills/*/evals/ (F1 class is easy to reintroduce).
  • Document the replace vs replaceAll gotcha for stable-id construction in CLAUDE.md.
  • Formalize a fix-review phase in panel-review v0.2 (this run added one ad-hoc).
  • Refine Phase 4 cull prompts if reruns keep generating debuggability/error-message nits (F4/F5 class).
  • Pin Node ≥ 20 in package.json engines + .nvmrc/.tool-versions. flush-traces and promote-to-dataset failed under Node 19.0.0 with TypeError: tracingChannelFn is not a function because the braintrust SDK uses node:diagnostics_channel.tracingChannel (added in 19.9, shipped in 20+).

Honest caveats

  • v0.1 was designed, cross-model reviewed, and now exercised on itself: one blocker, two questions, two nits surfaced, three of those resolved in-branch with the fix-review pass verifying no new issues. Tracking-quality questions documented in SKILL.md ## Status; next dogfood will likely sharpen the Phase 4 cull rubric further.
  • Braintrust integration is real and has now flushed 6 real traces from this run into the fdet-panel-review project, with 1 labeled trace promoted to the panel-review-labeled Dataset. No badge yet. v0.2+ will add eval suites once there's enough labeled data (~10-20 reviews) to grade against; badge added then if it means something.

🤖 Generated with Claude Code

mollyretter and others added 3 commits May 13, 2026 21:38
A multi-agent code review skill plus the trace pipeline, eval framework, and developer guardrails that make it operationally honest.

**Core skill (panel-review):**
- Three-model fan-out: Opus + Sonnet on regression / security / bugs; Haiku on convention drift, naming, dead code, comment quality
- Discrete severity (blocker | question | nit); blocker reserved for actual bad bugs
- Fresh-context Sonnet does confidence recheck plus aggressive cull plus comment polish in one pass
- Posted comment is one of two shapes (approve-with-summary or request-changes-with-blockers); questions and nits stay in the build log unless dev opts in
- Worktree-isolated; build log committed alongside the PR

**Trace pipeline (real-run logging):**
- Phase-by-phase JSONL accumulation in /tmp/, written via scripts/append-trace.ts wrapper (with shape validation; replaces inline JSON construction so the agent cannot silently skip a trace)
- Phase 6 pre-flight verification: file must have exactly 5 lines (3 fan-out + 1 recheck + 1 dev-gate) before flush; abort and reconstruct otherwise
- scripts/flush-traces.ts pushes JSONL to braintrust Logs
- scripts/promote-to-dataset.ts inserts labeled dev-gate trace into panel-review-labeled Dataset (stable id keys; upserts on rerun)
- Skill version sourced from SKILL.md frontmatter; every trace tagged with skill_version

**Evals via braintrustdata/eval-action:**
- skills/panel-review/evals/precision.eval.ts: real Eval() with per-severity precision scorers; reads synthetic baseline fixtures plus the panel-review-labeled Dataset
- .github/workflows/braintrust-evals.yml: runs eval-action on every PR; results post as PR comment with regression detection vs prior runs

**Guardrails:**
- CLAUDE.md privacy rule (CI logs public, braintrust private; aggregate counts only in CI scripts)
- Tightened agent prompts: no stylistic preferences, no hypothetical SDK behavior; only verifiable findings
- North-star pre-flight as Phase 0 of panel-review
- Schema regex format checks on panel-review build log frontmatter
- traces.ts hardens against sync and async log failures
- vitest.config.ts excludes .claude/skills symlinks
- Lightweight usage counters at docs/eval-runs/<skill>.md
- gray-matter cache bug fix in parseNorthStar and parseReviewLog (same-input-twice returns shallow copy dropping non-enumerable fields; switched check to Object.keys(data).length)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings surfaced by the panel review of this PR, all resolved
in-branch with a regression test for the most subtle of the three.

F1 (blocker, precision.eval.ts:148) — raw SDK error message was being
interpolated into a console.warn that runs in public GitHub Actions
logs. Drops the message entirely; uses a bare catch; pins the rationale
to the CLAUDE.md privacy rule so future maintainers don't reintroduce it.

F2 (question, promote-to-dataset.ts:120) — stable-id construction used
String.replace, which only replaces the first occurrence. A repo value
with more than one slash would silently collide ids across runs.
Switched to replaceAll with an inline note on the collision risk.

F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but
not on initFailed, so a synchronous flushTraces call after a logTrace
whose async .catch microtask had not yet drained would proceed past the
guard and flush a known-broken logger. Added initFailed to the guard,
matching the existing initFailed-first check in getLogger. New vitest
regression case "short-circuits after a logged async failure before the
.catch microtask drains" exercises the race window with setImmediate.

Tests: 48/48 pass (was 47).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full audit trail of the v0.1 dogfood panel run against this PR. Six
phases captured: pre-flight (no north-star, toolkit-legitimate), raw
findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh-
context Sonnet recheck + cull + polish, dev triage + in-branch fixes,
fresh-context fix-review verifying F1/F2/F3 resolved with no new issues.

Verdict: approved (Shape A — no outstanding blockers after fixes).
Posted: false (gh pr review deferred; build log is the durable artifact).

Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1
recheck + 1 fix-review), 89 tool uses, 492s wall time.

Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1,
runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes
the confirmed.length + rejected.length >= 1 predicate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mollyretter mollyretter force-pushed the add-panel-review-skill branch from e648e9d to 989e50b Compare May 14, 2026 02:28
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Braintrust eval report

fdet-panel-review (panel-review-precision-b3d7ac23)

Score Average Improvements Regressions
Precision_at_blocker 75% (+0pp) - -
Precision_at_nit 0% (+0pp) - -
Precision_at_question 100% (+0pp) - -
Llm_calls 0 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 0tok (+0tok) - -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 0tok (+0tok) - -
Completion_reasoning_tokens 0tok (+0tok) - -
Total_tokens 0tok (+0tok) - -
Duration 0s (0s) 3 🟢 1 🔴

@mollyretter mollyretter left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panel review (3-agent fan-out)

Verdict: Looks good. No outstanding blockers — all 3 surfaced findings resolved in this branch.

Findings surfaced and resolved (in-branch)

ID Sev (initial) Location Resolution
F1 blocker [skills/panel-review/evals/precision.eval.ts:148] Sanitized: dropped raw SDK error from console.warn; bare catch; comment pinning the CLAUDE.md privacy rule.
F2 question [scripts/promote-to-dataset.ts:120] replacereplaceAll so multi-slash repo values can't silently collide stable IDs.
F3 question [skills/panel-review/traces.ts:112-113] Added initFailed to the flushTraces guard + new vitest regression for the async-failure-before-microtask-drain race.

What we checked

  • Regression risk, security, bugs (Opus + Sonnet, parallel): CI privacy rule compliance across scripts and eval files; braintrust SDK init/logging error paths in traces.ts; stable ID construction in promote-to-dataset.ts; async microtask ordering in logTrace/flushTraces; fork-PR secret gating in both workflow files; smoke-test coverage for the braintrust integration job.
  • Convention drift, naming, dead code, comment quality (Haiku): error message accuracy in skills/north-star/schema.ts; silent-skip behavior in scripts/flush-traces.ts; inline comment quality across new scripts.
  • Confidence and polish (fresh-context Sonnet): blocker recheck, cull of low-signal items, comment polish.
  • Fix-review (fresh-context Sonnet, after fixes applied): each of F1/F2/F3 verified resolved against the live code; no new issues introduced.

Summary

One blocker and two questions surfaced across the three-agent fan-out — all three resolved in this branch with a new regression test for the flushTraces race condition. Test suite: 48/48 pass. Two nits were culled as not worth surfacing on this PR. No false positives.

Build log: docs/build-logs/panel-review-pr-1.md


🤖 Sent by Claude Code via panel-review v0.1.0.

Comment posted at 2026-05-14T02:33:14Z; closes the loop on the Phase 6
ops log so the build log accurately reflects final state. No code
changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mollyretter mollyretter merged commit 79ff51d into main May 14, 2026
4 checks passed
mollyretter added a commit that referenced this pull request May 14, 2026
Three findings surfaced by the panel review of this PR, all resolved
in-branch with a regression test for the most subtle of the three.

F1 (blocker, precision.eval.ts:148) — raw SDK error message was being
interpolated into a console.warn that runs in public GitHub Actions
logs. Drops the message entirely; uses a bare catch; pins the rationale
to the CLAUDE.md privacy rule so future maintainers don't reintroduce it.

F2 (question, promote-to-dataset.ts:120) — stable-id construction used
String.replace, which only replaces the first occurrence. A repo value
with more than one slash would silently collide ids across runs.
Switched to replaceAll with an inline note on the collision risk.

F3 (question, traces.ts:112-113) — flushTraces guarded on !logger but
not on initFailed, so a synchronous flushTraces call after a logTrace
whose async .catch microtask had not yet drained would proceed past the
guard and flush a known-broken logger. Added initFailed to the guard,
matching the existing initFailed-first check in getLogger. New vitest
regression case "short-circuits after a logged async failure before the
.catch microtask drains" exercises the race window with setImmediate.

Tests: 48/48 pass (was 47).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mollyretter added a commit that referenced this pull request May 14, 2026
Full audit trail of the v0.1 dogfood panel run against this PR. Six
phases captured: pre-flight (no north-star, toolkit-legitimate), raw
findings from Opus/Sonnet/Haiku in parallel, deduped aggregate, fresh-
context Sonnet recheck + cull + polish, dev triage + in-branch fixes,
fresh-context fix-review verifying F1/F2/F3 resolved with no new issues.

Verdict: approved (Shape A — no outstanding blockers after fixes).
Posted: false (gh pr review deferred; build log is the durable artifact).

Token usage across the run: 294,979 across 5 agent calls (3 fan-out + 1
recheck + 1 fix-review), 89 tool uses, 492s wall time.

Counter (docs/eval-runs/panel-review.md): runs_total 0 → 1,
runs_labeled 0 → 1 (dev decision had 3 confirmed, 0 rejected; passes
the confirmed.length + rejected.length >= 1 predicate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Braintrust eval report

fdet-panel-review (panel-review-precision-b701a6af)

Score Average Improvements Regressions
Precision_at_blocker 75% (+0pp) - -
Precision_at_nit 0% (+0pp) - -
Precision_at_question 100% (+0pp) - -
Llm_calls 0 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 0tok (+0tok) - -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 0tok (+0tok) - -
Completion_reasoning_tokens 0tok (+0tok) - -
Total_tokens 0tok (+0tok) - -
Duration 0s (0s) - -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant