Harden security + add validator, critique, chunking, caching#2
Merged
Conversation
Why: the API key was embedded in the URL query string, which leaks into proxy logs, error traces, and request middleware. Google's API supports header-based auth via x-goog-api-key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: OPENAI_BASE_URL was read from env with no validation, so a compromised env could redirect prompts (containing source code) to a malicious endpoint. Now reject non-https URLs and loopback/private hosts. Override with CREV_ALLOW_INSECURE_BASE_URL=1 for local proxies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when an API call returned non-2xx and the body read also failed, the error message was "Anthropic API returned 401: " with no clue why. Now include the body-read error so transient network issues during error reporting don't swallow the real cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: --file was accepted by the CLI but the value was discarded (let _file = ...) and never passed to run_review. Users may have assumed per-file review worked when it silently didn't. Drop the flag rather than ship a misleading no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: 'crev update' piped a curl response straight into sh with no confirmation and no pinning to a release tag. A compromised main branch (or a transient curl failure mid-download) could execute arbitrary code. Now: download fully first, prompt unless stdin is non-interactive or CREV_UPDATE_YES is set, and show the URL up front so users can inspect it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when a diff exceeded max_tokens, the truncation pass dropped the largest files silently. A user could ship a PR believing the whole change was reviewed when an important file was never sent to the LLM. Now print the dropped file list to stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: two adjacent sections in build_review_prompt_ctx were both labelled "7." — minor, but the comments are the only structural documentation of how the prompt is assembled, and the typo invites mistakes when adding new sections later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: filter_to_diff used ends_with/substring fallbacks to match linter output paths against diff file paths. A file named main.rs appearing in both src/main.rs and tests/main.rs could cross-match. Now resolve both to repo-relative canonical paths before equality comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: glob patterns in .reviewrc were applied without validation, so a pattern like '../../.ssh/*' or '/etc/*' would match outside the repo. This is committed-to-disk team config, so the blast radius is limited, but a single mistake shouldn't be able to traverse out of the repo at all. Now patterns must be repo-relative and free of '..' segments — invalid patterns are skipped with a warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: build() walked the entire repo to find called-function defs and related tests, with no upper bound. A pathological repo (huge generated files, hundreds of thousands of source files) could make a single review run for minutes. Now skip files > 1 MiB and stop walking after 5,000 source files have been considered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: regex_strip_line_refs indexed bytes directly and cast each byte to a char, which corrupted any multi-byte UTF-8 in a finding message before it was stored as a pattern. The loop was bounded so it didn't panic, but the result was garbled, which made pattern dedup miss recurrences whose only difference was a non-ASCII glyph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: find_repo_root canonicalizes start then walks up looking for a .git entry. The fallback used start.to_path_buf() when canonicalize failed, but the loop's .git check is still what ultimately decides — without that check the fallback would still be safe today, but the error message gave no hint where it looked. Include the original path in the error so users know what was searched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: parse_findings, print_findings, and compress_function were never reached at runtime — try_parse_finding_line and print_finding handle streaming output, and the prompt builder no longer compresses function bodies through a separate helper. Carrying dead branches makes the file harder to read and tempts future edits to revive half-supported paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: LLM reviewers regularly cite line numbers that don't exist in
the diff — off-by-one from header lines, completely fabricated, or
on the wrong file entirely. A single bad citation makes a finding
useless because the user clicks through to unrelated code, distrust
grows, and the recurring-pattern detector logs phantom entries.
How: a new validate::DiffIndex builds a {file -> {visible lines}}
map from the parsed diff (added + context lines, since both are
shown to the model). Each parsed finding goes through the index:
exact match accepted, within ±3 lines re-anchored to the nearest
real line, anything further away dropped. Drop and re-anchor counts
appear in the run summary so users can see how noisy a model is on
their codebase.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the diff carries both Added lines (the actual change) and Context lines (surrounding code shown for orientation). A finding that lands on a Context line usually describes pre-existing code that the change merely touches — useful, but not what the user is asking to review. The old re-anchor logic treated both equally and could "fix" a hallucinated line by snapping it to neighbouring unchanged code, hiding the fact that the model missed the change. How: split the diff index into added/context sets per file. Re- anchor preference is now nearest Added line first, fall back to visible Context only if no Added line is in tolerance. Findings that finally land on Context are counted separately and surfaced in the run summary so the user knows they aren't about the change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: even strong models invent line numbers and reference symbols that aren't in the diff. The validator now catches the worst of these at runtime, but prevention is cheaper than cleanup — telling the model up front that fabricated citations are worse than silence shifts the distribution of its output. The severity rubric also gives the model concrete examples of what counts as HIGH vs MED so findings calibrate more consistently across runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: even with a tight system prompt and diff-aware validation, the first review pass still produces vague, generic, or duplicate findings — especially on Ollama models. Running the findings back through the model with a sharper "is each one specific, grounded, and actionable?" prompt cheaply trims that noise. How: a new critique module builds a follow-up prompt listing the parsed findings (skipping LGTM) and asks the same backend to emit KEEP/DROP per line. Unparsed decisions default to KEEP so a broken critique response can't hide a real finding. With critique enabled, streaming live findings is suppressed in favour of a single batch print at the end — otherwise the user would see a finding scroll by that gets retroactively dropped, which is more confusing than helpful. --no-critique restores the previous live-stream UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: every review on the same repo replays the same system prompt,
team rules, and output-format trailer — currently ~2k tokens that
the cloud provider re-bills on every call. Pre-commit hooks
amplify this: a developer who staged, reviewed, fixed, and
re-staged a few times in the same session paid full price for the
identical prefix every time.
How: introduce PromptParts { system, cacheable, dynamic }. Default
LlmBackend impl concatenates and falls back to the existing complete
path, so Ollama/OpenAI/Gemini keep working unchanged. Anthropic
overrides complete_parts to emit a structured request with
cache_control: { type: "ephemeral" } on the system prompt and the
cacheable user block. The dynamic block (diff + context + linter
findings) stays uncached. Provider returns 90% cost discount on
the cached portion within a 5-minute window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: when a diff is larger than max_tokens the truncator drops the biggest files entirely — they're silently never reviewed. For a sprawling PR that's the opposite of what the user wants. Splitting the diff into per-file chunks and reviewing each chunk separately keeps every file covered at the cost of one extra LLM call per chunk. How: prompt::chunk_diff_by_files greedily packs files into chunks that each render under budget; files that exceed budget on their own become their own chunk and the existing context truncator trims them further. run_review now loops over chunks, builds a context and prompt for each, streams the completion, parses findings, and merges them into a single validated + critiqued list. Linters run once on the whole diff and their findings are filtered to each chunk's files. Live streaming is suppressed in chunked mode for the same reason as critique-on mode: a finding from chunk 1 that gets dropped by critique after chunk 4 finishes shouldn't already be visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: a finding like "[HIGH] src/foo.rs:42 — overflow risk" forces the user to alt-tab to the editor just to see what line 42 is. With the diff already in memory, attaching the line content costs almost nothing and makes the report self-contained — the user can judge correctness without leaving the terminal. JSON output picks up the same field for downstream tools (the GitHub Actions workflow can quote the line in its PR comments). How: the validator now indexes line content alongside line numbers and writes it onto each accepted/re-anchored finding. print_finding renders the quote dimmed under the message with a left bar so it reads as supporting evidence, not as part of the description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: build_review_prompt_parts_ctx superseded it once chunking and caching landed — every call site now goes through the parts builder and concatenates only when a backend without cache support needs a plain string. The old function was a stale copy of the same logic and would drift out of sync if either path were edited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update docs/context.md and README.md to cover the new pipeline: - module map adds validate.rs and critique.rs - data flow shows the ignore filter, linters, chunking, validator, critique, and final print steps - Anthropic prompt-caching strategy explained alongside PromptParts - README gets a "Review quality" section covering diff-aware validation, self-critique, and source quotes, plus a "Performance" section for chunking + prompt caching, and the new --no-critique flag in the options listing - env vars (CREV_ALLOW_INSECURE_BASE_URL, CREV_UPDATE_YES) and the Gemini header-auth change are noted in the relevant sections - example output shows the new quote line and dropped-finding notes - JSON example includes the quote field Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
crev ✅ — no issues found |
Why: with chunking, ContextBuilder::build runs once per chunk, and each call re-walked the entire src/lib/pkg/internal/cmd tree to resolve called-function definitions and find related tests. A 5-chunk review on a moderate repo paid 5x the parsing cost for the same data. How: RepoIndex walks the repo once (bounded by the same 1MiB/file and 5,000-file caps), groups every defined function by name, and stores an indexed list with file paths. ContextBuilder::build now takes &RepoIndex and resolves called fns and related tests via HashMap lookups — no further IO. The index is built once at the top of run_review and shared by reference across every chunk's context build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: lookup_called_function_defs returned the first definition of a name found anywhere in the repo. For ubiquitous names like new, default, or build, that's a coin flip — the model often saw a totally unrelated implementation as "the function being called", which made findings about the change harder to ground. How: every recorded call now carries the repo-relative path of the file it was called from. When a name has multiple definitions, the picker prefers same-file over same-directory over first-found. This is not real type resolution, but it kills the most common mis-attribution without paying for a full symbol table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: every called function was shipped to the model as its full body, even if it was 200 lines long. For most review questions the signature and the first ~12 lines plus the return are enough to answer "what does this called function do?" — the rest just burns budget and pushes the diff further down the prompt where the model attends less. How: ast::distill_function keeps short bodies as-is and replaces the middle of long ones with a "// ... N lines omitted ..." marker. The cost estimator in fit_to_budget uses the same distilled form so the budget pass agrees with what the prompt actually sends. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: types_used was rendered with every field of every struct,
even when the diff only touched two of them. For type-heavy
codebases this dominated the dynamic block — a 30-field DTO ate
hundreds of tokens to tell the reviewer about fields it would
never look at.
How: collect the set of identifier-like tokens from the diff text,
then keep only fields whose names appear in that set. Fully unused
types collapse to `name { … }` so the model still knows the type
exists; partially-used types render `kept, kept, /* +N more */`
to signal there's more without listing it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the old fit_to_budget filled greedily: types first, then called fns, then tests, with no per-category cap. On a diff that touched many types the budget was exhausted before any called function bodies were included, leaving the model without the single highest-signal piece of context. The reverse failure was also possible — many called fns starved out the type info. How: split the post-diff budget into 25/60/15 caps for types / called fns / tests. Each tier first spends from its own cap; whatever it doesn't use stays in a shared pool that the next tier can draw from. No category can starve the others, and underused tiers don't waste budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: estimate_tokens used `len/4`, but OpenAI/Anthropic tokenizers sit closer to 3.3-3.5 chars/token on source code. The under- estimate let the prompt builder pack ~14% more context than the model actually accepted, occasionally tripping the API's context limit and getting the response truncated. How: switch to `chars * 2 / 7` (≈3.5). Add a helper for callers that already have a char count so context.rs doesn't have to fake a string just to share the formula. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two findings from the PR's own crev review of itself: 1. Spinner task could outlive its parent on the error path. When backend.complete*()? returned an Err, run_review unwound via ? without stopping the spinner — the JoinHandle was dropped without abort and the task kept printing frames until process exit. Wrap spinner state in a Spinner struct whose Drop calls abort() so any early return reaps it. 2. validate::DiffIndex::line_content fell back to a linear scan over every file when the LLM's path didn't match exactly. For diffs with many files this was O(n) per finding. Add a basename → key index so the common "model stripped or added a leading directory" case resolves in O(1); keep the linear suffix scan only for deeper mismatches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
/crev |
1 similar comment
Owner
Author
|
/crev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens crev against common LLM-reviewer failure modes and 5 concrete security/correctness issues. 22 commits, one per change.
Security + correctness fixes
x-goog-api-keyheaderOPENAI_BASE_URLvalidated (https only, rejects loopback/private hosts; override viaCREV_ALLOW_INSECURE_BASE_URL)crev updateprompts before piping install script tosh, downloads fully first;CREV_UPDATE_YES=1for non-interactive--fileflag removed (was silently no-op)src/main.rsvstests/main.rs)..and absolute pathsfind_repo_rooterror includes searched pathReview-quality features
validate::DiffIndex: drops or re-anchors findings whose line/file isn't in the diff; prefers Added over Context when re-anchoring; flags context-only findingscritique::run_critique: second LLM pass marks each finding KEEP/DROP and removes low-signal ones; opt-out--no-critiquecache_control: ephemeral; ~90% cost discount on repeated reviews of the same diffprompt::chunk_diff_by_files: splits oversized diffs by file and reviews each chunk separately instead of dropping the biggest filesTest plan
cargo build --releasesucceedscrev review --stagedon a small change works end-to-end with Ollamacrev review --staged --model claude-sonnet-4-6exercises the cached Anthropic pathcrev review --staged --no-critiqueskips the second pass and resumes live-streaming outputOPENAI_BASE_URL=http://localhost:8080 crev review --stagederrors out; settingCREV_ALLOW_INSECURE_BASE_URL=1accepts itmax_tokenstriggers"diff too large for a single review pass; splitting into N chunks"and every file in the diff appears in the merged finding set"line not in diff"note instead of being shown🤖 Generated with Claude Code