Harden security + add validator, critique, chunking, caching by starc007 · Pull Request #2 · starc007/crev

starc007 · 2026-05-12T08:53:00Z

Summary

Hardens crev against common LLM-reviewer failure modes and 5 concrete security/correctness issues. 22 commits, one per change.

Security + correctness fixes

Gemini API key moved from URL query to x-goog-api-key header
OPENAI_BASE_URL validated (https only, rejects loopback/private hosts; override via CREV_ALLOW_INSECURE_BASE_URL)
crev update prompts before piping install script to sh, downloads fully first; CREV_UPDATE_YES=1 for non-interactive
Dead --file flag removed (was silently no-op)
Diff truncation now warns which files were dropped
Linter paths canonicalized before diff-match (no cross-matching src/main.rs vs tests/main.rs)
Ignore-glob patterns reject .. and absolute paths
Context walker capped at 1 MiB/file and 5,000 files
UTF-8-safe line-ref strip in history
API error bodies surface read-failures instead of empty strings
find_repo_root error includes searched path
Removed dead code paths

Review-quality features

validate::DiffIndex: drops or re-anchors findings whose line/file isn't in the diff; prefers Added over Context when re-anchoring; flags context-only findings
critique::run_critique: second LLM pass marks each finding KEEP/DROP and removes low-signal ones; opt-out --no-critique
Anthropic prompt caching: system + stable user-block marked cache_control: ephemeral; ~90% cost discount on repeated reviews of the same diff
prompt::chunk_diff_by_files: splits oversized diffs by file and reviews each chunk separately instead of dropping the biggest files
Source quotes on every accepted finding (terminal + JSON)
Prompt gets explicit grounding rules and severity rubric

Test plan

cargo build --release succeeds
crev review --staged on a small change works end-to-end with Ollama
crev review --staged --model claude-sonnet-4-6 exercises the cached Anthropic path
crev review --staged --no-critique skips the second pass and resumes live-streaming output
OPENAI_BASE_URL=http://localhost:8080 crev review --staged errors out; setting CREV_ALLOW_INSECURE_BASE_URL=1 accepts it
A diff larger than max_tokens triggers "diff too large for a single review pass; splitting into N chunks" and every file in the diff appears in the merged finding set
A finding citing a non-existent line gets dropped with a "line not in diff" note instead of being shown

🤖 Generated with Claude Code

Why: the API key was embedded in the URL query string, which leaks into proxy logs, error traces, and request middleware. Google's API supports header-based auth via x-goog-api-key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: OPENAI_BASE_URL was read from env with no validation, so a compromised env could redirect prompts (containing source code) to a malicious endpoint. Now reject non-https URLs and loopback/private hosts. Override with CREV_ALLOW_INSECURE_BASE_URL=1 for local proxies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: when an API call returned non-2xx and the body read also failed, the error message was "Anthropic API returned 401: " with no clue why. Now include the body-read error so transient network issues during error reporting don't swallow the real cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: --file was accepted by the CLI but the value was discarded (let _file = ...) and never passed to run_review. Users may have assumed per-file review worked when it silently didn't. Drop the flag rather than ship a misleading no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: 'crev update' piped a curl response straight into sh with no confirmation and no pinning to a release tag. A compromised main branch (or a transient curl failure mid-download) could execute arbitrary code. Now: download fully first, prompt unless stdin is non-interactive or CREV_UPDATE_YES is set, and show the URL up front so users can inspect it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: when a diff exceeded max_tokens, the truncation pass dropped the largest files silently. A user could ship a PR believing the whole change was reviewed when an important file was never sent to the LLM. Now print the dropped file list to stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: two adjacent sections in build_review_prompt_ctx were both labelled "7." — minor, but the comments are the only structural documentation of how the prompt is assembled, and the typo invites mistakes when adding new sections later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: filter_to_diff used ends_with/substring fallbacks to match linter output paths against diff file paths. A file named main.rs appearing in both src/main.rs and tests/main.rs could cross-match. Now resolve both to repo-relative canonical paths before equality comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: glob patterns in .reviewrc were applied without validation, so a pattern like '../../.ssh/*' or '/etc/*' would match outside the repo. This is committed-to-disk team config, so the blast radius is limited, but a single mistake shouldn't be able to traverse out of the repo at all. Now patterns must be repo-relative and free of '..' segments — invalid patterns are skipped with a warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: build() walked the entire repo to find called-function defs and related tests, with no upper bound. A pathological repo (huge generated files, hundreds of thousands of source files) could make a single review run for minutes. Now skip files > 1 MiB and stop walking after 5,000 source files have been considered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: regex_strip_line_refs indexed bytes directly and cast each byte to a char, which corrupted any multi-byte UTF-8 in a finding message before it was stored as a pattern. The loop was bounded so it didn't panic, but the result was garbled, which made pattern dedup miss recurrences whose only difference was a non-ASCII glyph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: find_repo_root canonicalizes start then walks up looking for a .git entry. The fallback used start.to_path_buf() when canonicalize failed, but the loop's .git check is still what ultimately decides — without that check the fallback would still be safe today, but the error message gave no hint where it looked. Include the original path in the error so users know what was searched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: parse_findings, print_findings, and compress_function were never reached at runtime — try_parse_finding_line and print_finding handle streaming output, and the prompt builder no longer compresses function bodies through a separate helper. Carrying dead branches makes the file harder to read and tempts future edits to revive half-supported paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: LLM reviewers regularly cite line numbers that don't exist in the diff — off-by-one from header lines, completely fabricated, or on the wrong file entirely. A single bad citation makes a finding useless because the user clicks through to unrelated code, distrust grows, and the recurring-pattern detector logs phantom entries. How: a new validate::DiffIndex builds a {file -> {visible lines}} map from the parsed diff (added + context lines, since both are shown to the model). Each parsed finding goes through the index: exact match accepted, within ±3 lines re-anchored to the nearest real line, anything further away dropped. Drop and re-anchor counts appear in the run summary so users can see how noisy a model is on their codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: the diff carries both Added lines (the actual change) and Context lines (surrounding code shown for orientation). A finding that lands on a Context line usually describes pre-existing code that the change merely touches — useful, but not what the user is asking to review. The old re-anchor logic treated both equally and could "fix" a hallucinated line by snapping it to neighbouring unchanged code, hiding the fact that the model missed the change. How: split the diff index into added/context sets per file. Re- anchor preference is now nearest Added line first, fall back to visible Context only if no Added line is in tolerance. Findings that finally land on Context are counted separately and surfaced in the run summary so the user knows they aren't about the change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: even strong models invent line numbers and reference symbols that aren't in the diff. The validator now catches the worst of these at runtime, but prevention is cheaper than cleanup — telling the model up front that fabricated citations are worse than silence shifts the distribution of its output. The severity rubric also gives the model concrete examples of what counts as HIGH vs MED so findings calibrate more consistently across runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: even with a tight system prompt and diff-aware validation, the first review pass still produces vague, generic, or duplicate findings — especially on Ollama models. Running the findings back through the model with a sharper "is each one specific, grounded, and actionable?" prompt cheaply trims that noise. How: a new critique module builds a follow-up prompt listing the parsed findings (skipping LGTM) and asks the same backend to emit KEEP/DROP per line. Unparsed decisions default to KEEP so a broken critique response can't hide a real finding. With critique enabled, streaming live findings is suppressed in favour of a single batch print at the end — otherwise the user would see a finding scroll by that gets retroactively dropped, which is more confusing than helpful. --no-critique restores the previous live-stream UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: every review on the same repo replays the same system prompt, team rules, and output-format trailer — currently ~2k tokens that the cloud provider re-bills on every call. Pre-commit hooks amplify this: a developer who staged, reviewed, fixed, and re-staged a few times in the same session paid full price for the identical prefix every time. How: introduce PromptParts { system, cacheable, dynamic }. Default LlmBackend impl concatenates and falls back to the existing complete path, so Ollama/OpenAI/Gemini keep working unchanged. Anthropic overrides complete_parts to emit a structured request with cache_control: { type: "ephemeral" } on the system prompt and the cacheable user block. The dynamic block (diff + context + linter findings) stays uncached. Provider returns 90% cost discount on the cached portion within a 5-minute window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: when a diff is larger than max_tokens the truncator drops the biggest files entirely — they're silently never reviewed. For a sprawling PR that's the opposite of what the user wants. Splitting the diff into per-file chunks and reviewing each chunk separately keeps every file covered at the cost of one extra LLM call per chunk. How: prompt::chunk_diff_by_files greedily packs files into chunks that each render under budget; files that exceed budget on their own become their own chunk and the existing context truncator trims them further. run_review now loops over chunks, builds a context and prompt for each, streams the completion, parses findings, and merges them into a single validated + critiqued list. Linters run once on the whole diff and their findings are filtered to each chunk's files. Live streaming is suppressed in chunked mode for the same reason as critique-on mode: a finding from chunk 1 that gets dropped by critique after chunk 4 finishes shouldn't already be visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: a finding like "[HIGH] src/foo.rs:42 — overflow risk" forces the user to alt-tab to the editor just to see what line 42 is. With the diff already in memory, attaching the line content costs almost nothing and makes the report self-contained — the user can judge correctness without leaving the terminal. JSON output picks up the same field for downstream tools (the GitHub Actions workflow can quote the line in its PR comments). How: the validator now indexes line content alongside line numbers and writes it onto each accepted/re-anchored finding. print_finding renders the quote dimmed under the message with a left bar so it reads as supporting evidence, not as part of the description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: build_review_prompt_parts_ctx superseded it once chunking and caching landed — every call site now goes through the parts builder and concatenates only when a backend without cache support needs a plain string. The old function was a stale copy of the same logic and would drift out of sync if either path were edited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update docs/context.md and README.md to cover the new pipeline: - module map adds validate.rs and critique.rs - data flow shows the ignore filter, linters, chunking, validator, critique, and final print steps - Anthropic prompt-caching strategy explained alongside PromptParts - README gets a "Review quality" section covering diff-aware validation, self-critique, and source quotes, plus a "Performance" section for chunking + prompt caching, and the new --no-critique flag in the options listing - env vars (CREV_ALLOW_INSECURE_BASE_URL, CREV_UPDATE_YES) and the Gemini header-auth change are noted in the relevant sections - example output shows the new quote line and dropped-finding notes - JSON example includes the quote field Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-12T08:54:38Z

crev ✅ — no issues found

Why: with chunking, ContextBuilder::build runs once per chunk, and each call re-walked the entire src/lib/pkg/internal/cmd tree to resolve called-function definitions and find related tests. A 5-chunk review on a moderate repo paid 5x the parsing cost for the same data. How: RepoIndex walks the repo once (bounded by the same 1MiB/file and 5,000-file caps), groups every defined function by name, and stores an indexed list with file paths. ContextBuilder::build now takes &RepoIndex and resolves called fns and related tests via HashMap lookups — no further IO. The index is built once at the top of run_review and shared by reference across every chunk's context build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: lookup_called_function_defs returned the first definition of a name found anywhere in the repo. For ubiquitous names like new, default, or build, that's a coin flip — the model often saw a totally unrelated implementation as "the function being called", which made findings about the change harder to ground. How: every recorded call now carries the repo-relative path of the file it was called from. When a name has multiple definitions, the picker prefers same-file over same-directory over first-found. This is not real type resolution, but it kills the most common mis-attribution without paying for a full symbol table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: every called function was shipped to the model as its full body, even if it was 200 lines long. For most review questions the signature and the first ~12 lines plus the return are enough to answer "what does this called function do?" — the rest just burns budget and pushes the diff further down the prompt where the model attends less. How: ast::distill_function keeps short bodies as-is and replaces the middle of long ones with a "// ... N lines omitted ..." marker. The cost estimator in fit_to_budget uses the same distilled form so the budget pass agrees with what the prompt actually sends. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: types_used was rendered with every field of every struct, even when the diff only touched two of them. For type-heavy codebases this dominated the dynamic block — a 30-field DTO ate hundreds of tokens to tell the reviewer about fields it would never look at. How: collect the set of identifier-like tokens from the diff text, then keep only fields whose names appear in that set. Fully unused types collapse to `name { … }` so the model still knows the type exists; partially-used types render `kept, kept, /* +N more */` to signal there's more without listing it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: the old fit_to_budget filled greedily: types first, then called fns, then tests, with no per-category cap. On a diff that touched many types the budget was exhausted before any called function bodies were included, leaving the model without the single highest-signal piece of context. The reverse failure was also possible — many called fns starved out the type info. How: split the post-diff budget into 25/60/15 caps for types / called fns / tests. Each tier first spends from its own cap; whatever it doesn't use stays in a shared pool that the next tier can draw from. No category can starve the others, and underused tiers don't waste budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Why: estimate_tokens used `len/4`, but OpenAI/Anthropic tokenizers sit closer to 3.3-3.5 chars/token on source code. The under- estimate let the prompt builder pack ~14% more context than the model actually accepted, occasionally tripping the API's context limit and getting the response truncated. How: switch to `chars * 2 / 7` (≈3.5). Add a helper for callers that already have a char count so context.rs doesn't have to fake a string just to share the formula. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses two findings from the PR's own crev review of itself: 1. Spinner task could outlive its parent on the error path. When backend.complete*()? returned an Err, run_review unwound via ? without stopping the spinner — the JoinHandle was dropped without abort and the task kept printing frames until process exit. Wrap spinner state in a Spinner struct whose Drop calls abort() so any early return reaps it. 2. validate::DiffIndex::line_content fell back to a linear scan over every file when the LLM's path didn't match exactly. For diffs with many files this was O(n) per finding. Add a basename → key index so the common "model stripped or added a leading directory" case resolves in O(1); keep the linear suffix scan only for deeper mismatches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

starc007 · 2026-05-12T09:22:03Z

/crev

starc007 · 2026-05-12T09:22:37Z

/crev

starc007 and others added 22 commits May 12, 2026 11:38

starc007 and others added 7 commits May 12, 2026 14:40

starc007 merged commit 00c6651 into main May 12, 2026

starc007 deleted the feat/harden-and-improve branch May 12, 2026 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden security + add validator, critique, chunking, caching#2

Harden security + add validator, critique, chunking, caching#2
starc007 merged 29 commits into
mainfrom
feat/harden-and-improve

starc007 commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

starc007 commented May 12, 2026

Uh oh!

starc007 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

starc007 commented May 12, 2026

Summary

Security + correctness fixes

Review-quality features

Test plan

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

starc007 commented May 12, 2026

Uh oh!

starc007 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 12, 2026 •

edited

Loading