Skip to content

codoki-ai/benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Codoki AI Benchmarks (2025)

This repository serves as the index for Codoki's reproducible AI code-review benchmarks.
We recreated real-world bugs across multiple open-source projects, opened fresh PRs, and measured how Codoki and other tools perform in catching them.


Methodology

Dataset.
50 real bugs across Sentry (Python), Grafana (Go), Cal.com (TypeScript), Keycloak (Java), and Discourse (Ruby).
Each bug maps to an upstream PR that fixed a production defect.

Procedure.
We recreated the original bug PRs and ran Codoki on the same diffs and repository context.
A bug counted as “caught” only when Codoki identified the fault in a line-level PR comment with actionable guidance.
Mentions only in summaries did not count.

Competitor scores: Results for Greptile, Cursor, GitHub Copilot, CodeRabbit, and Graphite
are reproduced from Greptile’s public benchmark dataset.
We did not re-run competitor tools. Accessed: Sept 2025.


Dataset Index

Repository Language Dataset / Recreated PRs
Sentry Python sentry-codoki
Cal.com TypeScript calcom-codoki
Grafana Go grafana-codoki
Keycloak Java keycloak-codoki
Discourse Ruby discourse-codoki

Why This Matters

Engineering leaders need transparent, reproducible benchmarks to make informed adoption decisions.
This dataset helps teams compare tools on realistic production bugs — not trivial synthetic demos.


Contributing

  • Found a bug we should add to the dataset? Open an issue or PR.
  • Want to replicate this benchmark with your own tool? Fork this repo and run the same PR set.

Related Links


Last updated: September 2025

About

Codoki AI code-review benchmarks — recreated PRs & datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors