Add adversarial-verification plugin by wizardengineer · Pull Request #148 · trailofbits/skills

wizardengineer · 2026-04-17T07:34:15Z

Summary

Adds a new adversarial-verification plugin that stress-tests claims, designs, and bug findings by dispatching two isolated sub-agents (advocate + skeptic) and synthesizing their arguments into a structured verdict. Counters sycophancy and single-agent agreement bias by forcing maximal disagreement before the caller commits.

What it does

Two modes:

Decision mode — free-form arguments organized by evaluation dimensions (for approach/design choices)
Proof mode — N null hypotheses the skeptic tries to prove and the advocate tries to refute (for verifying bug findings and security claims)

Core principle: isolated sub-agent contexts are non-negotiable. An agent that sees the other side's arguments will soften to accommodate them. The adversarial value comes from each agent arguing without knowledge of the counter-argument, with synthesis happening separately.

Structure

SKILL.md (main entry point with decision tree for mode selection)
references/decision-mode.md (structure for approach selection)
references/proof-mode.md (N-null-hypothesis structure for finding verification)
references/prompt-templates.md (advocate/skeptic templates enforcing anti-balance)
references/synthesis.md (verdict table format + recommendation structure)
references/anti-patterns.md (10 common failure modes with diagnoses)

When to use

Choosing between competing technical approaches
Verifying a bug finding is real (not a false positive)
Reviewing a design decision before commit
Any claim the caller is inclined to agree with by default

Test plan

python3 .github/scripts/validate_codex_skills.py passes (verified locally)
Plugin registered in .claude-plugin/marketplace.json
CODEOWNERS entry added
README table entry added under Verification section
Codex symlink created at .codex/skills/adversarial-verification
Install the plugin locally and invoke the skill on a real claim
Verify both decision-mode and proof-mode paths produce useful verdicts

🤖 Generated with Claude Code

CLAassistant · 2026-04-17T07:34:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

Stress-tests claims, designs, and bug findings by dispatching two isolated sub-agents (advocate + skeptic) and synthesizing their arguments into a structured verdict. Counters sycophancy and single- agent agreement bias by forcing maximal disagreement before commit. Two modes: - Decision mode: free-form arguments organized by evaluation dimensions for approach/design choices - Proof mode: N null hypotheses the skeptic proves and advocate refutes, for verifying bug findings and security claims Includes SKILL.md + 5 reference docs: - anti-patterns.md (10 common failure modes with diagnoses) - decision-mode.md (structure for approach selection) - proof-mode.md (N-null-hypothesis structure for finding verification) - prompt-templates.md (advocate/skeptic templates enforcing anti-balance) - synthesis.md (verdict table format and recommendation structure) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dguido · 2026-05-08T17:01:46Z

@claude review once

claude · 2026-05-08T17:09:49Z

+
+**Why it happens:** Weak prompt. The agent defaulted to balanced reasoning because nothing stopped it.
+
+**Fix:** Re-dispatch with a stronger prompt. See [prompt-templates.md](prompt-templates.md#re-dispatching-on-hedged-agents). The key phrase: "Do not acknowledge merit in the opposing position. Do not hedge." Do not accept a hedged response as valid output.


🟡 Nit: the four reference files chain to each other, which CLAUDE.md:178 explicitly disallows ("SKILL.md links to files, files don't chain to more files"). Specifically: anti-patterns.md:19 and synthesis.md:87 link to prompt-templates.md, and decision-mode.md:13 ↔ proof-mode.md:13 cross-link. Either inline the pointed-to content or have SKILL.md own the cross-doc nav. (Up-link to ../SKILL.md is fine.)

Extended reasoning...

What the bug is

CLAUDE.md:174-180 documents the progressive-disclosure rule for skills:

One level deep — SKILL.md links to files, files don't chain to more files.

Note: Directory depth is fine (references/guides/topic.md). Reference chains are not (SKILL.md → file1.md → file2.md where file1 references file2). The problem is chained references, not nested folders.

This new plugin has four sibling-to-sibling links inside references/ that form exactly the prohibited shape:

references/anti-patterns.md:19 → prompt-templates.md#re-dispatching-on-hedged-agents (chain: SKILL.md → anti-patterns.md → prompt-templates.md)

references/synthesis.md:87 → prompt-templates.md#re-dispatching-on-hedged-agents (chain: SKILL.md → synthesis.md → prompt-templates.md)

references/decision-mode.md:13 → proof-mode.md (chain: SKILL.md → decision-mode.md → proof-mode.md)

references/proof-mode.md:13 → decision-mode.md (chain: SKILL.md → proof-mode.md → decision-mode.md)

The fifth link, anti-patterns.md:49 → ../SKILL.md, is upward navigation back to the root and is fine.

Why this matters

A reader pulled into anti-patterns.md for a specific failure mode now has to follow another link to get the actionable prompt-fix in prompt-templates.md. Same for a reader in either mode doc who needs to switch modes. This is the second-level descent the rule is meant to prevent: the reader can't tell from SKILL.md alone what the full graph of mandatory navigation looks like.

Addressing the refutation

A counter-argument is that this pattern is already used in plugins/agentic-actions-auditor/ (multiple vector-*.md files reference foundations.md) and plugins/semgrep-rule-creator/ (workflow.md references quick-reference.md), and that CLAUDE.md's intent is really about progressive-disclosure depth (deeper-and-deeper learning chains), not navigational sideways links.

That argument has some merit — the cited examples are real, and the mutual decision-mode ↔ proof-mode link is genuinely "sideways" rather than "deeper." But it doesn't fully overcome the rule as written. CLAUDE.md doesn't say "don't make the reader descend to learn more" — it says "files don't chain to more files," with the explicit canonical example being "SKILL.md → file1.md → file2.md where file1 references file2." That is exactly what is happening in all four cases here. A new skill is the right place to apply the rule cleanly rather than perpetuate a pattern that drifted from spec elsewhere.

Step-by-step proof of the chain

Concrete walkthrough for chain #1:

User invokes the skill. SKILL.md:106-110 says: see anti-patterns.md for full failure modes.

User opens references/anti-patterns.md. Reads section "2. Hedged agents" at line 13-19.

The "Fix" line at anti-patterns.md:19 points the user to prompt-templates.md#re-dispatching-on-hedged-agents for the actual remedial prompt phrasing.

User now has to open a third file (a sibling of anti-patterns.md) to complete the task that started at SKILL.md.

That's the SKILL.md → file1.md → file2.md chain CLAUDE.md prohibits.

How to fix

Two clean options:

Inline the cross-references. The mode-switching nudge in decision-mode.md:13/proof-mode.md:13 ("if claim type is X, use the other mode") is a one-liner; SKILL.md's Step 2 (lines 35-43) already has the mode-selection table, so the inline note in each mode doc is redundant and can be deleted. The "see prompt-templates.md for the re-dispatch prompt" pointers can either inline a one-sentence summary of the re-dispatch phrasing, or be removed (the prompt-templates file is already linked from SKILL.md:55).

Route nav through SKILL.md. Add a "Related references" sub-section in SKILL.md that maps situations to the right reference doc, so the references themselves never need to point at siblings.

Severity

Nit / convention. No functional impact; the skill works fine. Worth fixing on a fresh skill where the cleanup is cheap.

claude · 2026-05-08T17:09:49Z

+## When to use
+
+| Situation | Use this skill? |
+|-----------|----------------|
+| Choosing between 2+ technical approaches | YES |
+| Verifying a bug finding is real (not false positive) | YES |
+| Reviewing a design decision before commit | YES |
+| User asks "is this correct?" on non-trivial claim | YES |
+| Any claim you're inclined to agree with by default | YES — that's the tell |
+| Simple factual lookup ("what version is X?") | NO |
+| Obvious syntax error fix | NO |
+| User has already made the decision and is executing | NO |


🟡 The new SKILL.md has a single combined ## When to use section (line 16) with a YES/NO table instead of the two separate ## When to Use and ## When NOT to Use sections that CLAUDE.md (lines 153-163) requires for every SKILL.md. This is a pre-existing convention that ~60% of existing skills follow; the YES/NO table covers the same ground informationally, so this is a style/convention nit only — split it into two sections to match the house standard.

Extended reasoning...

The convention. CLAUDE.md lines 153-163 has a ### Required Sections subsection that explicitly states "Every SKILL.md must include" both ## When to Use and ## When NOT to Use as separate top-level headings. The new SKILL.md at plugins/adversarial-verification/skills/adversarial-verification/SKILL.md line 16 has ## When to use (note the lowercase "to use") containing a single YES/NO table — there is no separate ## When NOT to Use heading.

The content is present, just structured differently. The NO rows in the table ("Simple factual lookup", "Obvious syntax error fix", "User has already made the decision and is executing") cover the When-NOT-to-Use scenarios informationally. So this is purely a structural/heading convention violation, not a missing-information issue.

Why it matters anyway. The two-heading convention is the house standard documented in the contributor guide. Splitting them out makes it easier to grep/skim across the marketplace's many skills (grep -A5 "When NOT to Use" works uniformly), and roughly 44 of the 74 existing SKILL.md files (~60%) follow the explicit two-heading convention — including recent additions like zeroize-audit, sharp-edges, and mutation-testing. New additions should match.

Why this is a nit, not a blocker. (1) The information is fully present, just in table form. (2) The CLAUDE.md convention is not validated by CI — validate_codex_skills.py does not enforce section structure, so this slipped past the test plan's validation step. (3) About 30 of the 74 existing SKILL.md files (~40%) also lack a separate When-NOT-to-Use section, so the convention is unevenly enforced in the existing tree. (4) The submitter themselves marked the severity as nit.

Step-by-step proof.

Open CLAUDE.md and read lines 153-163. The ### Required Sections block contains a markdown code block showing both ## When to Use and ## When NOT to Use as separate sections preceded by "Every SKILL.md must include:".

Open plugins/adversarial-verification/skills/adversarial-verification/SKILL.md. Search for ## When NOT to Use — no match.

Search for ## When — only one match: line 16, ## When to use (lowercase "to use").

Lines 17-27 are a single table containing a YES/NO column. The NO rows are the When-NOT-to-Use content but live under the same heading as the YES rows.

Conclusion: the convention requires two separate headings; the file has one combined heading. Convention violated.

How to fix. Cheap edit. Rename ## When to use → ## When to Use (also fixes a casing inconsistency with the convention), keep the YES rows under it, then add a sibling ## When NOT to Use section with the NO rows (or convert them to a short bulleted list / brief prose). Approximately a 5-line restructure with no semantic change.

claude · 2026-05-08T17:09:49Z

+## Verdict rule for proof mode
+
+| All Ps REFUTED by advocate, NOT PROVED by skeptic | **CONFIRMED** — the finding is real |
+| Any P clearly PROVED by skeptic | **DISMISSED** — the finding is a false positive |


🟡 The 'Verdict rule for proof mode' block at proof-mode.md:47-51 uses GFM table syntax but is missing the required header row and |---|---| delimiter row, so it will render as plain text with literal pipe characters instead of a table. The fix is trivial: add a header line like | Condition | Verdict | followed by |-----------|---------| above the three existing rows. Severity nit — purely a documentation rendering issue, no functional impact.

Extended reasoning...

What the bug is

In plugins/adversarial-verification/skills/adversarial-verification/references/proof-mode.md, the section ## Verdict rule for proof mode (lines 47-51) contains three rows that begin with pipe characters but has no header row and no |---|---| delimiter row. Per the GFM spec, a valid table requires both a header line and a delimiter line — without them, the content is parsed as plain text and the pipes render literally.

How it manifests

When viewed on GitHub, in any standard CommonMark+GFM renderer, or in a Markdown previewer, the section displays as three lines of raw text with visible | characters rather than as a table with two columns. This makes the verdict rule visually broken and inconsistent with the well-formed tables elsewhere in the same file.

Why existing code does not prevent it

There is no Markdown linter wired into the repo to catch malformed GFM table syntax, and the file passed review because the content is still semantically parseable by an LLM consumer (which is the primary audience for skill reference files). The issue only surfaces on rendered output.

Step-by-step proof

Open plugins/adversarial-verification/skills/adversarial-verification/references/proof-mode.md on GitHub (or any GFM renderer).

Scroll to line 47, the heading ## Verdict rule for proof mode.

The next non-blank lines are:
| All Ps REFUTED by advocate, NOT PROVED by skeptic | **CONFIRMED** — the finding is real | | Any P clearly PROVED by skeptic | **DISMISSED** — the finding is a false positive | | Any P in dispute (both sides plausible) | **UNCERTAIN** — gather more evidence for that specific P before committing |

There is no preceding header row (e.g. | Condition | Verdict |) and no delimiter row (e.g. |-----------|---------|).

Per GFM, this fails the table production rule, so the renderer treats the three lines as a paragraph and shows the pipes literally.

Compare to the well-formed table at lines 21-22 of the same file — | # | Null hypothesis | What proves it | followed by |---|----------------|----------------| — and at lines 57-58 — | Null hypothesis | Skeptic says | Advocate says | Outcome | followed by |----------------|--------------|---------------|---------|. Both render correctly because they include both required rows.

Impact

Low — purely cosmetic. The reference file is consumed primarily by an LLM that can extract the verdict rule regardless of rendering, so skill behavior is unaffected. However, when a human reads the file on GitHub (e.g. during PR review or when authoring a related skill) the section appears broken and inconsistent with the rest of the document.

Fix

Add the missing header and delimiter rows immediately after the heading on line 47:

## Verdict rule for proof mode | Condition | Verdict | |-----------|---------| | All Ps REFUTED by advocate, NOT PROVED by skeptic | **CONFIRMED** — the finding is real | | Any P clearly PROVED by skeptic | **DISMISSED** — the finding is a false positive | | Any P in dispute (both sides plausible) | **UNCERTAIN** — gather more evidence for that specific P before committing |

wizardengineer marked this pull request as ready for review April 17, 2026 07:35

wizardengineer requested a review from dguido as a code owner April 17, 2026 07:35

claude Bot reviewed Apr 17, 2026

View reviewed changes

wizardengineer marked this pull request as draft April 17, 2026 17:31

wizardengineer force-pushed the adversarial-verification-plugin branch from 7f05877 to 833dbef Compare April 17, 2026 17:46

Merge branch 'main' into adversarial-verification-plugin

593b27a

claude Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adversarial-verification plugin#148

Add adversarial-verification plugin#148
wizardengineer wants to merge 2 commits intomainfrom
adversarial-verification-plugin

wizardengineer commented Apr 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

dguido commented May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Why it happens: Weak prompt. The agent defaulted to balanced reasoning because nothing stopped it.

		Fix: Re-dispatch with a stronger prompt. See [prompt-templates.md](prompt-templates.md#re-dispatching-on-hedged-agents). The key phrase: "Do not acknowledge merit in the opposing position. Do not hedge." Do not accept a hedged response as valid output.

Conversation

wizardengineer commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

Structure

When to use

Test plan

Uh oh!

CLAassistant commented Apr 17, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

dguido commented May 8, 2026

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters

Addressing the refutation

Step-by-step proof of the chain

How to fix

Severity

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Why existing code does not prevent it

Step-by-step proof

Impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wizardengineer commented Apr 17, 2026 •

edited

Loading