Evaluating the most popular AI Code Review Tools

In this organization, we evaluate the review quality of the seven most popular tools: Augment Code Review, Cursor Bugbot, OpenAI Codex Code Review, CodeRabbit, Greptile, Github Copilot, and Graphite in their default setting as of Nov 14, 2025.

Dataset

We use the Greptile AI Code Review benchmark dataset (https://www.greptile.com/benchmarks), the only public dataset for Code Reviews, for this comparison. It consists of 5 mid-to-large size open source repositories with 10 Pull Requests each and covers a variety of languages. For more details on how the dataset was constructed refer to https://www.greptile.com/benchmarks.

Benchmark	LOC	Files	Primary Language	PRs Chosen	Golden Comments
Sentry	1,037,299	6,697	Python	10	35
Cal.com	710,499	6,905	TypeScript	10	32
Grafana	2,657,238	16,722	Go	10	24
Discourse	331,960	3,199	Ruby	10	29
Keycloak	905,337	9,319	Java	10	25

Updating the Dataset

There is one major problem with the original Greptile benchmark set that we fix. It has exactly 1 golden comment per PR. However, there are many valid bugs in each PR that are missing from the golden comments set, and without a full set of golden comments one cannot compute an accurate precision or recall score. To fix the set of golden comments, we manually reviewed the PRs, manually reviewed issues pointed out by the 7 tools, used Auggie to improve our understanding of the codebase, and updated the golden comments to add missing valid golden comments. One major contribution of this work is a more complete ground truth for comments for the benchmark set.

Computing Scores

The standard way to measure the quality (or accuracy) of an AI Code Review tool (or any automated tool that catches bugs, security vulnerabilities, etc.) is to compare it against a set of known or expected comments for a given PR called golden comments. If the tool’s comment matches a golden comment it is called a true positive, and if it doesn’t match one it is called a false positive. If the tool misses a golden comment, it is called false negative. We compare two comments simply by asking an LLM if the they refer to the same underlying issue; this comparison works well in practice. This is better than comparing by file or line number since that can be problematic for multi-line or multi-file issues.

There are three main metrics we use here (higher score is better):

Precision: This metric says - out of the comments posted, what percentage were true positives (true-positives / (true-positives + false-positives)). It is one minus the false positive rate. Research shows that automated tools with a low precision are seen as untrustworthy by developers and ignored entirely.
Recall: This metric says - out of the golden comments, what percentage were caught by the tool (true-positives / (true-positives + false-negatives)). A high recall is necessary to match the breadth of issues reviewed by a human developer.
F-score: This is a harmonic mean of the Precision and Recall. It gives an overall quality score for the tool.

One issue with the original Greptile eval ((https://www.greptile.com/benchmarks)) is that each golden comment has the same score irrespective of its severity - a test-file typo gets the same weightage as an outage-causing bug. To solve this, we count low-severity comments as neither true-positives, false negatives nor false-positives. Low-severity comments add very little value compared to High or Critical severity bugs and hence shouldn’t contribute the same +1 True Positive score, but they are not incorrect either and hence would be unfair to be counted as a +1 False positive.

We observe that this modification to the dataset does not significantly affect the relative rankings of the tools.

Adjustments for Claude Code

Claude Code outputs a single monolithic comment with all suggestions. Hence, to compare it against golden comments, we use an LLM to split the monolithic comment into a set of individual comments (while leaving out the comments it marks as optional) before doing the comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code_review_benchmarks		code_review_benchmarks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating the most popular AI Code Review Tools

Dataset

Updating the Dataset

Computing Scores

Adjustments for Claude Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Evaluating the most popular AI Code Review Tools

Dataset

Updating the Dataset

Computing Scores

Adjustments for Claude Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages