Skip to content

Harden security + add validator, critique, chunking, caching#2

Merged
starc007 merged 29 commits into
mainfrom
feat/harden-and-improve
May 12, 2026
Merged

Harden security + add validator, critique, chunking, caching#2
starc007 merged 29 commits into
mainfrom
feat/harden-and-improve

Conversation

@starc007

Copy link
Copy Markdown
Owner

Summary

Hardens crev against common LLM-reviewer failure modes and 5 concrete security/correctness issues. 22 commits, one per change.

Security + correctness fixes

  • Gemini API key moved from URL query to x-goog-api-key header
  • OPENAI_BASE_URL validated (https only, rejects loopback/private hosts; override via CREV_ALLOW_INSECURE_BASE_URL)
  • crev update prompts before piping install script to sh, downloads fully first; CREV_UPDATE_YES=1 for non-interactive
  • Dead --file flag removed (was silently no-op)
  • Diff truncation now warns which files were dropped
  • Linter paths canonicalized before diff-match (no cross-matching src/main.rs vs tests/main.rs)
  • Ignore-glob patterns reject .. and absolute paths
  • Context walker capped at 1 MiB/file and 5,000 files
  • UTF-8-safe line-ref strip in history
  • API error bodies surface read-failures instead of empty strings
  • find_repo_root error includes searched path
  • Removed dead code paths

Review-quality features

  • validate::DiffIndex: drops or re-anchors findings whose line/file isn't in the diff; prefers Added over Context when re-anchoring; flags context-only findings
  • critique::run_critique: second LLM pass marks each finding KEEP/DROP and removes low-signal ones; opt-out --no-critique
  • Anthropic prompt caching: system + stable user-block marked cache_control: ephemeral; ~90% cost discount on repeated reviews of the same diff
  • prompt::chunk_diff_by_files: splits oversized diffs by file and reviews each chunk separately instead of dropping the biggest files
  • Source quotes on every accepted finding (terminal + JSON)
  • Prompt gets explicit grounding rules and severity rubric

Test plan

  • cargo build --release succeeds
  • crev review --staged on a small change works end-to-end with Ollama
  • crev review --staged --model claude-sonnet-4-6 exercises the cached Anthropic path
  • crev review --staged --no-critique skips the second pass and resumes live-streaming output
  • OPENAI_BASE_URL=http://localhost:8080 crev review --staged errors out; setting CREV_ALLOW_INSECURE_BASE_URL=1 accepts it
  • A diff larger than max_tokens triggers "diff too large for a single review pass; splitting into N chunks" and every file in the diff appears in the merged finding set
  • A finding citing a non-existent line gets dropped with a "line not in diff" note instead of being shown

🤖 Generated with Claude Code

starc007 and others added 22 commits May 12, 2026 11:38
Why: the API key was embedded in the URL query string, which leaks
into proxy logs, error traces, and request middleware. Google's API
supports header-based auth via x-goog-api-key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: OPENAI_BASE_URL was read from env with no validation, so a
compromised env could redirect prompts (containing source code) to
a malicious endpoint. Now reject non-https URLs and loopback/private
hosts. Override with CREV_ALLOW_INSECURE_BASE_URL=1 for local proxies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when an API call returned non-2xx and the body read also failed,
the error message was "Anthropic API returned 401: " with no clue
why. Now include the body-read error so transient network issues
during error reporting don't swallow the real cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: --file was accepted by the CLI but the value was discarded
(let _file = ...) and never passed to run_review. Users may have
assumed per-file review worked when it silently didn't. Drop the
flag rather than ship a misleading no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: 'crev update' piped a curl response straight into sh with no
confirmation and no pinning to a release tag. A compromised main
branch (or a transient curl failure mid-download) could execute
arbitrary code. Now: download fully first, prompt unless stdin is
non-interactive or CREV_UPDATE_YES is set, and show the URL up
front so users can inspect it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when a diff exceeded max_tokens, the truncation pass dropped
the largest files silently. A user could ship a PR believing the
whole change was reviewed when an important file was never sent to
the LLM. Now print the dropped file list to stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: two adjacent sections in build_review_prompt_ctx were both
labelled "7." — minor, but the comments are the only structural
documentation of how the prompt is assembled, and the typo invites
mistakes when adding new sections later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: filter_to_diff used ends_with/substring fallbacks to match
linter output paths against diff file paths. A file named main.rs
appearing in both src/main.rs and tests/main.rs could cross-match.
Now resolve both to repo-relative canonical paths before equality
comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: glob patterns in .reviewrc were applied without validation, so
a pattern like '../../.ssh/*' or '/etc/*' would match outside the
repo. This is committed-to-disk team config, so the blast radius is
limited, but a single mistake shouldn't be able to traverse out of
the repo at all. Now patterns must be repo-relative and free of '..'
segments — invalid patterns are skipped with a warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: build() walked the entire repo to find called-function defs
and related tests, with no upper bound. A pathological repo (huge
generated files, hundreds of thousands of source files) could make
a single review run for minutes. Now skip files > 1 MiB and stop
walking after 5,000 source files have been considered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: regex_strip_line_refs indexed bytes directly and cast each
byte to a char, which corrupted any multi-byte UTF-8 in a finding
message before it was stored as a pattern. The loop was bounded so
it didn't panic, but the result was garbled, which made pattern
dedup miss recurrences whose only difference was a non-ASCII glyph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: find_repo_root canonicalizes start then walks up looking for
a .git entry. The fallback used start.to_path_buf() when
canonicalize failed, but the loop's .git check is still what
ultimately decides — without that check the fallback would still
be safe today, but the error message gave no hint where it looked.
Include the original path in the error so users know what was
searched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: parse_findings, print_findings, and compress_function were
never reached at runtime — try_parse_finding_line and print_finding
handle streaming output, and the prompt builder no longer compresses
function bodies through a separate helper. Carrying dead branches
makes the file harder to read and tempts future edits to revive
half-supported paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: LLM reviewers regularly cite line numbers that don't exist in
the diff — off-by-one from header lines, completely fabricated, or
on the wrong file entirely. A single bad citation makes a finding
useless because the user clicks through to unrelated code, distrust
grows, and the recurring-pattern detector logs phantom entries.

How: a new validate::DiffIndex builds a {file -> {visible lines}}
map from the parsed diff (added + context lines, since both are
shown to the model). Each parsed finding goes through the index:
exact match accepted, within ±3 lines re-anchored to the nearest
real line, anything further away dropped. Drop and re-anchor counts
appear in the run summary so users can see how noisy a model is on
their codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the diff carries both Added lines (the actual change) and
Context lines (surrounding code shown for orientation). A finding
that lands on a Context line usually describes pre-existing code
that the change merely touches — useful, but not what the user is
asking to review. The old re-anchor logic treated both equally and
could "fix" a hallucinated line by snapping it to neighbouring
unchanged code, hiding the fact that the model missed the change.

How: split the diff index into added/context sets per file. Re-
anchor preference is now nearest Added line first, fall back to
visible Context only if no Added line is in tolerance. Findings
that finally land on Context are counted separately and surfaced
in the run summary so the user knows they aren't about the change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: even strong models invent line numbers and reference symbols
that aren't in the diff. The validator now catches the worst of
these at runtime, but prevention is cheaper than cleanup — telling
the model up front that fabricated citations are worse than silence
shifts the distribution of its output. The severity rubric also
gives the model concrete examples of what counts as HIGH vs MED so
findings calibrate more consistently across runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: even with a tight system prompt and diff-aware validation, the
first review pass still produces vague, generic, or duplicate
findings — especially on Ollama models. Running the findings back
through the model with a sharper "is each one specific, grounded,
and actionable?" prompt cheaply trims that noise.

How: a new critique module builds a follow-up prompt listing the
parsed findings (skipping LGTM) and asks the same backend to emit
KEEP/DROP per line. Unparsed decisions default to KEEP so a broken
critique response can't hide a real finding. With critique enabled,
streaming live findings is suppressed in favour of a single batch
print at the end — otherwise the user would see a finding scroll by
that gets retroactively dropped, which is more confusing than
helpful. --no-critique restores the previous live-stream UX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: every review on the same repo replays the same system prompt,
team rules, and output-format trailer — currently ~2k tokens that
the cloud provider re-bills on every call. Pre-commit hooks
amplify this: a developer who staged, reviewed, fixed, and
re-staged a few times in the same session paid full price for the
identical prefix every time.

How: introduce PromptParts { system, cacheable, dynamic }. Default
LlmBackend impl concatenates and falls back to the existing complete
path, so Ollama/OpenAI/Gemini keep working unchanged. Anthropic
overrides complete_parts to emit a structured request with
cache_control: { type: "ephemeral" } on the system prompt and the
cacheable user block. The dynamic block (diff + context + linter
findings) stays uncached. Provider returns 90% cost discount on
the cached portion within a 5-minute window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when a diff is larger than max_tokens the truncator drops the
biggest files entirely — they're silently never reviewed. For a
sprawling PR that's the opposite of what the user wants. Splitting
the diff into per-file chunks and reviewing each chunk separately
keeps every file covered at the cost of one extra LLM call per
chunk.

How: prompt::chunk_diff_by_files greedily packs files into chunks
that each render under budget; files that exceed budget on their
own become their own chunk and the existing context truncator trims
them further. run_review now loops over chunks, builds a context
and prompt for each, streams the completion, parses findings, and
merges them into a single validated + critiqued list. Linters run
once on the whole diff and their findings are filtered to each
chunk's files. Live streaming is suppressed in chunked mode for the
same reason as critique-on mode: a finding from chunk 1 that gets
dropped by critique after chunk 4 finishes shouldn't already be
visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: a finding like "[HIGH] src/foo.rs:42 — overflow risk" forces
the user to alt-tab to the editor just to see what line 42 is. With
the diff already in memory, attaching the line content costs almost
nothing and makes the report self-contained — the user can judge
correctness without leaving the terminal. JSON output picks up the
same field for downstream tools (the GitHub Actions workflow can
quote the line in its PR comments).

How: the validator now indexes line content alongside line numbers
and writes it onto each accepted/re-anchored finding. print_finding
renders the quote dimmed under the message with a left bar so it
reads as supporting evidence, not as part of the description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: build_review_prompt_parts_ctx superseded it once chunking and
caching landed — every call site now goes through the parts builder
and concatenates only when a backend without cache support needs a
plain string. The old function was a stale copy of the same logic
and would drift out of sync if either path were edited.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update docs/context.md and README.md to cover the new pipeline:
- module map adds validate.rs and critique.rs
- data flow shows the ignore filter, linters, chunking, validator,
  critique, and final print steps
- Anthropic prompt-caching strategy explained alongside PromptParts
- README gets a "Review quality" section covering diff-aware
  validation, self-critique, and source quotes, plus a
  "Performance" section for chunking + prompt caching, and the new
  --no-critique flag in the options listing
- env vars (CREV_ALLOW_INSECURE_BASE_URL, CREV_UPDATE_YES) and the
  Gemini header-auth change are noted in the relevant sections
- example output shows the new quote line and dropped-finding notes
- JSON example includes the quote field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown

crev ✅ — no issues found

starc007 and others added 7 commits May 12, 2026 14:40
Why: with chunking, ContextBuilder::build runs once per chunk, and
each call re-walked the entire src/lib/pkg/internal/cmd tree to
resolve called-function definitions and find related tests. A
5-chunk review on a moderate repo paid 5x the parsing cost for the
same data.

How: RepoIndex walks the repo once (bounded by the same 1MiB/file
and 5,000-file caps), groups every defined function by name, and
stores an indexed list with file paths. ContextBuilder::build now
takes &RepoIndex and resolves called fns and related tests via
HashMap lookups — no further IO. The index is built once at the
top of run_review and shared by reference across every chunk's
context build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: lookup_called_function_defs returned the first definition of a
name found anywhere in the repo. For ubiquitous names like new,
default, or build, that's a coin flip — the model often saw a
totally unrelated implementation as "the function being called",
which made findings about the change harder to ground.

How: every recorded call now carries the repo-relative path of the
file it was called from. When a name has multiple definitions, the
picker prefers same-file over same-directory over first-found. This
is not real type resolution, but it kills the most common
mis-attribution without paying for a full symbol table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: every called function was shipped to the model as its full
body, even if it was 200 lines long. For most review questions the
signature and the first ~12 lines plus the return are enough to
answer "what does this called function do?" — the rest just burns
budget and pushes the diff further down the prompt where the model
attends less.

How: ast::distill_function keeps short bodies as-is and replaces
the middle of long ones with a "// ... N lines omitted ..."
marker. The cost estimator in fit_to_budget uses the same distilled
form so the budget pass agrees with what the prompt actually sends.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: types_used was rendered with every field of every struct,
even when the diff only touched two of them. For type-heavy
codebases this dominated the dynamic block — a 30-field DTO ate
hundreds of tokens to tell the reviewer about fields it would
never look at.

How: collect the set of identifier-like tokens from the diff text,
then keep only fields whose names appear in that set. Fully unused
types collapse to `name { … }` so the model still knows the type
exists; partially-used types render `kept, kept, /* +N more */`
to signal there's more without listing it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the old fit_to_budget filled greedily: types first, then
called fns, then tests, with no per-category cap. On a diff that
touched many types the budget was exhausted before any called
function bodies were included, leaving the model without the
single highest-signal piece of context. The reverse failure was
also possible — many called fns starved out the type info.

How: split the post-diff budget into 25/60/15 caps for types /
called fns / tests. Each tier first spends from its own cap;
whatever it doesn't use stays in a shared pool that the next tier
can draw from. No category can starve the others, and underused
tiers don't waste budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: estimate_tokens used `len/4`, but OpenAI/Anthropic tokenizers
sit closer to 3.3-3.5 chars/token on source code. The under-
estimate let the prompt builder pack ~14% more context than the
model actually accepted, occasionally tripping the API's context
limit and getting the response truncated.

How: switch to `chars * 2 / 7` (≈3.5). Add a helper for callers
that already have a char count so context.rs doesn't have to fake
a string just to share the formula.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two findings from the PR's own crev review of itself:

1. Spinner task could outlive its parent on the error path. When
   backend.complete*()? returned an Err, run_review unwound via
   ? without stopping the spinner — the JoinHandle was dropped
   without abort and the task kept printing frames until process
   exit. Wrap spinner state in a Spinner struct whose Drop calls
   abort() so any early return reaps it.

2. validate::DiffIndex::line_content fell back to a linear scan
   over every file when the LLM's path didn't match exactly. For
   diffs with many files this was O(n) per finding. Add a
   basename → key index so the common "model stripped or added a
   leading directory" case resolves in O(1); keep the linear
   suffix scan only for deeper mismatches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@starc007

Copy link
Copy Markdown
Owner Author

/crev

1 similar comment
@starc007

Copy link
Copy Markdown
Owner Author

/crev

@starc007 starc007 merged commit 00c6651 into main May 12, 2026
@starc007 starc007 deleted the feat/harden-and-improve branch May 12, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant