Harden review prompts for consistency and noise reduction#579
Draft
mariusvniekerk wants to merge 6 commits intomainfrom
Draft
Harden review prompts for consistency and noise reduction#579mariusvniekerk wants to merge 6 commits intomainfrom
mariusvniekerk wants to merge 6 commits intomainfrom
Conversation
Bare "high/medium/low" labels give agents no shared calibration standard, leading to inconsistent severity across reviews. Defining each level in terms of real-world impact (data loss, exploitability, blast radius) aligns all agents on the same scale and naturally prevents low-value findings from being over-rated. Inspired by the impact × breadth scoring pattern from research-oriented analysis skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "brief explanation of the problem" with "what specifically goes wrong if this is not fixed." This is the articulation test pattern from research-oriented analysis skills — every finding must justify itself with concrete impact reasoning, not just pattern-matching against a checklist. Findings like "this violates best practices" become impossible to write when the prompt demands specific harm. This is the single most effective noise reduction technique across the mop-mapping skill set. Applied to all review types: standard, dirty, range, security, and design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /rethink skill uses explicit evidence thresholds — "1 observation is a data point, 3+ is a pattern worth investigating." The /verify skill grounds every check in specific data. Applied here as negative prompt instructions that suppress the most common false positive categories: hypothetical issues in unseen code, style preferences, unfounded "missing tests" claims, and flagging patterns that match existing codebase conventions. Security reviews get a lighter version — they should still err toward reporting, but not flag theoretical vulnerabilities in untouched code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /verify skill's "recite" phase is its most powerful technique: read only the title, predict what the content should be, then check alignment. Applied here by reversing the old instruction "Do not review the commit message" — the commit message now becomes the primary lens for evaluating the diff. When a commit says "fix race condition" but the diff adds a mutex on the wrong resource, that's a high-value finding that pure diff-scanning misses. Intent-implementation gaps are now the first check category, above bugs and security, because they catch the class of errors where the code is internally consistent but doesn't do what the developer intended. The dirty-changes prompt is unchanged since uncommitted changes have no commit message to analyze. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /verify and /synthesize skills both enforce quality gates — checks that must pass before output is considered complete. Applied here as a final self-verification instruction: every finding must reference a specific diff location, severity must match the described impact, and no two findings may contradict each other. Findings that fail these checks are dropped. This catches the most embarrassing review failures (high-severity verdict with no actual line references, "pass" with critical findings listed) at near-zero cost since the model performs the check during the same generation. Applied to all review types: standard, dirty, range, and security. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /rethink skill's evidence accumulation pattern — "1 observation is a data point, 3+ is a pattern worth investigating" — directly applies to the insights system. Without explicit thresholds, the insights agent may recommend guideline changes from 1-2 occurrences (noise) or hesitate on strong 6+ patterns. Added tiered thresholds to the recurring patterns section and gated guideline suggestions on minimum 3 occurrences. This helps close the feedback loop between review noise and guideline refinement with appropriate confidence levels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
high/medium/lowlabels with concrete definitions tied to real-world impact (data loss, exploitability, blast radius). Gives all agents a shared calibration standard so severity is consistent across reviews.🤖 Generated with Claude Code