Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations
Authors (* = equal contribution):
- Kevin L. Wei* (RAND, Harvard)
- Patricia Paskov* (RAND)
- Sunishchal Dev* (RAND, Algoverse)
- Michael J. Byun* (RAND)
- Anka Reuel (Stanford, Harvard)
- Xavier Roberts-Gaal (Harvard)
- Rachel Calcott (Harvard)
- Evie Coxon (Max Planck School of Cognition)
- Chinmay Deshpande (Center for Democracy & Technology)
Contact: kevinwei@acm.org
Other versions:
- ICML 2025: A version of this paper has been accepted to ICML 2025 as a position paper (spotlight), with the title: "Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations & Reporting Checklist)."
- ICLR 2025 Workshop on Building Trust in Language Models and Applications: a version of this paper was presented at this workshop. Reviews and the earlier draft are available on OpenReview.
This paper finds that existing human baselines are neither sufficiently rigorous nor transparent to enable meaningful comparisons of human vs. AI performance. We provide recommendations and a reporting checklist to increase rigor and transparency in human baselines.
Human baselines are reference sets of metrics intended to represent human performance on specific tasks. They are used in AI evaluations to compare human vs. AI performance on evaluation items, adding important context to results and helping inform stakeholders in the broader AI ecosystem (e.g., downstream users, policymakers).
Specifically, this paper makes three contributions:
-
Methodological recommendations
Based on a meta-review of the measurement theory and AI evaluation literatures, we provide methodological recommendations for evaluators to build rigorous human baselines in AI evaluations. See the summary in Summary of Methodological Recommendations for Human Baselines. -
Reporting checklist
We provide a reporting checklist for evaluators to increase transparency when publishing human baselines. See/reporting_checklist/. -
Literature review
We review 115 human baselines (studies) to identify methodological gaps in existing AI evaluations, and we find substantial shortcomings in the rigor and transparency of existing human baselines.
Maximal rigor may not be possible in all human baselines due to resource limitations. In these cases, we hope to help researchers make informed tradeoffs, discuss/acknowledge methodological limitations, narrow interpretation of results, and transparently report methods and results.
| Human Baseline Stage | Definition |
|---|---|
| Baseline Design & Implementation | Baseline design is the initial stage of human baseline development, at which researchers define baselines’ purpose, scope, concepts, evaluation items, and metrics; baseline implementation is the selection and construction of tools and datasets for evaluation. |
| Baseliner Recruitment | Baseliner recruitment is the stage at which human baseliners---the humans who respond to evaluation items---are found and are engaged to participate in a baseline. |
| Baseline Execution | Baseline execution is the stage at which the human baseline is conducted and result data is collected---e.g., through surveys or crowdwork platforms. |
| Baseline Analysis | Baseline analysis is the stage after data collection at which human baseline data is inspected and compared to AI results. |
| Baseline Documentation | Baseline documentation is the provision of evaluation tasks, datasets, metrics, and experimental materials and resources to relevant audiences. |
A summary of our recommendations for robust and transparent human baselines. More details are in the table below.
| Recommendation | Details |
|---|---|
| Baseline Design & Implementation | |
| Use consistent & representative test sets for human baselines and AI results |
|
| Iteratively develop baseline instruments |
|
| Collect an adequately sized sample of baseliners |
|
| Satisfy ethics requirements for human subjects research |
|
| Baseliner Recruitment | |
| Specify a human population of interest |
|
| Use an appropriate sampling strategy for selecting baseliners |
|
| Employ quality controls for baseliner recruitment |
|
| Baseline Execution | |
| Employ quality controls during baseline execution |
|
| Control for method effects and use identical tasks |
|
| Control for level of effort |
|
| Collect qualitative data from baseliners |
|
| Baseline Analysis | |
| Quantify uncertainty in human vs. AI performance differences |
|
| Use consistent evaluation metrics, scoring methods, and rubrics across human and AI evaluation |
|
| Baseline Documentation | |
| Report key details about baselining methodology and baseliners |
|
| Adopt best practices for open science and reproducibility/replicability |
|
Raw data from our literature review of human baselines is available in /data/lit_review_data.xlsx.
Summary statistics are available here (provided as a Google Sheet because Excel broke the formulas).
|-- figures.ipynb
|-- paper.pdf
|-- README.md
|-- repo_tree.txt
|-- data
|-- lit_review_data.xlsx
|-- README.md
|-- figures
|-- baseline_language_frequencies.png
|-- baseline_year_frequencies.png
|-- Summary_Figure.png
|-- venue_frequencies.png
|-- fonts
|
|-- reporting_checklist
|-- reporting_checklist.docx
|-- reporting_checklist.pdf
|-- reporting_checklist.tex