Skip to content

ai-code-review-evaluations/golden_comments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Evaluating the most popular AI Code Review Tools

In this organization, we evaluate the review quality of the seven most popular tools: Augment Code Review, Cursor Bugbot, OpenAI Codex Code Review, CodeRabbit, Greptile, Github Copilot, and Graphite in their default setting as of Nov 14, 2025.

Dataset

We use the Greptile AI Code Review benchmark dataset (https://www.greptile.com/benchmarks), the only public dataset for Code Reviews, for this comparison. It consists of 5 mid-to-large size open source repositories with 10 Pull Requests each and covers a variety of languages. For more details on how the dataset was constructed refer to https://www.greptile.com/benchmarks.

Benchmark LOC Files Primary Language PRs Chosen Golden Comments
Sentry 1,037,299 6,697 Python 10 35
Cal.com 710,499 6,905 TypeScript 10 32
Grafana 2,657,238 16,722 Go 10 24
Discourse 331,960 3,199 Ruby 10 29
Keycloak 905,337 9,319 Java 10 25

Updating the Dataset

There is one major problem with the original Greptile benchmark set that we fix. It has exactly 1 golden comment per PR. However, there are many valid bugs in each PR that are missing from the golden comments set, and without a full set of golden comments one cannot compute an accurate precision or recall score. To fix the set of golden comments, we manually reviewed the PRs, manually reviewed issues pointed out by the 7 tools, used Auggie to improve our understanding of the codebase, and updated the golden comments to add missing valid golden comments. One major contribution of this work is a more complete ground truth for comments for the benchmark set.

Computing Scores

The standard way to measure the quality (or accuracy) of an AI Code Review tool (or any automated tool that catches bugs, security vulnerabilities, etc.) is to compare it against a set of known or expected comments for a given PR called golden comments. If the tool’s comment matches a golden comment it is called a true positive, and if it doesn’t match one it is called a false positive. If the tool misses a golden comment, it is called false negative. We compare two comments simply by asking an LLM if the they refer to the same underlying issue; this comparison works well in practice. This is better than comparing by file or line number since that can be problematic for multi-line or multi-file issues.

There are three main metrics we use here (higher score is better):

  1. Precision: This metric says - out of the comments posted, what percentage were true positives (true-positives / (true-positives + false-positives)). It is one minus the false positive rate. Research shows that automated tools with a low precision are seen as untrustworthy by developers and ignored entirely.
  2. Recall: This metric says - out of the golden comments, what percentage were caught by the tool (true-positives / (true-positives + false-negatives)). A high recall is necessary to match the breadth of issues reviewed by a human developer.
  3. F-score: This is a harmonic mean of the Precision and Recall. It gives an overall quality score for the tool.

One issue with the original Greptile eval ((https://www.greptile.com/benchmarks)) is that each golden comment has the same score irrespective of its severity - a test-file typo gets the same weightage as an outage-causing bug. To solve this, we count low-severity comments as neither true-positives, false negatives nor false-positives. Low-severity comments add very little value compared to High or Critical severity bugs and hence shouldn’t contribute the same +1 True Positive score, but they are not incorrect either and hence would be unfair to be counted as a +1 False positive.

We observe that this modification to the dataset does not significantly affect the relative rankings of the tools.

Adjustments for Claude Code

Claude Code outputs a single monolithic comment with all suggestions. Hence, to compare it against golden comments, we use an LLM to split the monolithic comment into a set of individual comments (while leaving out the comments it marks as optional) before doing the comparison.

About

The Golden (expected) comments against which the AI Code Review tools are compared

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors