👋 Welcome to RefineBench — a comprehensive evaluation library for testing refinement capabilities of language models across multiple settings and domains.
- 💥 News
- ✨ Key Features
- 📦 Installation
- 🚀 Quick Start
- 📊 Generating Reports
- ⚙️ Configuration
- 🧐 What is RefineBench?
- 🛠️ Troubleshooting
- 🤝 Contributing
- 👏 Acknowledgements
- 📄 License
- 📝 Citation
- [2025.12.1] Our paper is now accessible at arXiv
- [2025.10.6] 🎉 RefineBench accepted at MT-LLM Workshop on NeurIPS 2025 as Oral (Top 1%)!
- 🔄 Four Refinement Settings: Self-Refinement, Guided Refinement, Partial Guided Refinement, Self-Refinement with Criteria
- ⚡ Lightweight Evaluation: Quick testing with configurable sample sizes
- 📊 Comprehensive Reporting: Pretty reports with accuracy, costs, latency, and improvement tracking
- 🤖 Multi-Model Support: OpenAI, Anthropic, Google, OpenRouter, Together, AWS Bedrock, vLLM
- 🌐 11 Domains: Math, Statistics, STEM, Humanities, Social Science, Law (Reasoning-heavy)
- 📈 Broader Tasks: Both free-form and correctness-based tasks
# Clone and install
git clone https://github.com/RefineBench/refinebench-eval.git
cd refinebench-eval
pip install -e .
# Install dependencies
pip install -r requirements.txt
# Optional: Provider-specific dependencies
pip install openai anthropic google-generativeai litellm boto3 vllmYou can configure your API keys either by setting environment variables directly or by creating a .env file. The library automatically loads keys from .env using load_dotenv().
# Option 1: Environment Variables
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
# Option 2: Create .env file
echo "OPENAI_API_KEY=your-key" >> .envYou can include additional keys (e.g.,
GOOGLE_API_KEY,OPENROUTER_API_KEY) in the same.envfile for multi-provider support.
You can easily load the RefineBench dataset using the Hugging Face datasets library:
from datasets import load_dataset
# Load the RefineBench dataset
dataset = load_dataset("RefineBench/RefineBench")
# Explore the dataset
print(dataset)We provide a Lightweight Evaluation mode for quick testing on a small subset of RefineBench before running a full-scale evaluation. This mode allows you to specify both the number of samples (n_samples) and the maximum number of turns (max_turn_num) for rapid experimentation. The evaluation automatically displays a summary report upon completion.
# Run lightweight evaluation with 5 samples and 3 turns
refinebench-eval --mode lightweight --n-samples 5 --max-turn-num 3
# The command automatically displays:
# ✅ Completion status
# 📊 Accuracy per turn
# 📈 Improvement trackingfrom refinebench_eval import RefineBenchEvaluator, RefineBenchConfig
config = RefineBenchConfig(
agent=RefineBenchConfig.ModelConfig(
model_name="gpt_4o_mini",
model_path="openrouter/openai/gpt-4o-mini",
temperature=0.9,
top_p=1.0,
max_tokens=10000,
reasoning_effort="no_reasoning"
),
evaluator=RefineBenchConfig.ModelConfig(
model_name="gpt_4o_mini",
model_path="openrouter/openai/gpt-4o-mini",
temperature=0.0,
top_p=1.0,
max_tokens=10000,
reasoning_effort="no_reasoning"
),
experimental_setup=RefineBenchConfig.ExperimentalConfig(
refinement_setting="self_refinement",
max_turn_num=5,
max_workers=4,
seed=42,
debug=False,
use_cache=False
)
)
evaluator = RefineBenchEvaluator(config)
results = evaluator.run_lightweight_evaluation(n_samples=5, max_turn_num=3)
print(f"✅ Evaluated {len(results)} samples")To reproduce the full results reported in our paper, use the provided shell scripts.
Each script corresponds to a specific refinement setting. They first generate model responses for each turn (--mode generate), then evaluate them against the RefineBench checklist (--mode evaluate).
Supported Refinement Settings
| Refinement Setting | Description |
|---|---|
| Self-Refinement | The model iteratively improves its own responses without any external feedback. |
| Guided Refinement | The model refines its responses using explicit feedback derived from checklist results. |
| Partial-Guided Refinement | The model refines its responses based on partially provided feedback (unknown_ratio). |
| Self-Refinement with Criteria | The model refines its responses while being aware of specific evaluation criteria. |
# Run in order (self-refinement first!)
./scripts/run_self_refinement.sh # 🎯 Start here
./scripts/run_guided_refinement.sh
./scripts/run_partial_guided_refinement.sh
./scripts/run_self_refinement_with_criteria.sh
⚠️ Important: Before running guided refinement, partial-guided refinement, or self-refinement with criteria, you must first complete self-refinement, since all subsequent settings depend on the initial outputs from it.
How to set refinement settings
# for self-refinement
config.experimental_setup.refinement_setting = "self_refinement"
# for guided refinement
config.experimental_setup.refinement_setting = "guided_refinement"
# for partial-guided refinement
config.experimental_setup.refinement_setting = "partial_guided_refinement"
config.experimental_setup.unknown_ratio = 0.5 # For partial guided refinement
# for self-refinement with criteria
config.experimental_setup.refinement_setting = "self_refinement_w_criteria"After running evaluations, you can generate comprehensive reports to analyze the results. The reporting feature provides statistics on Acc (accuracy) and Pass@t (pass accuracy) metrics, along with costs, latency, and improvement tracking across turns.
- Acc_t (Accuracy): Percentage of checklist items that are correct (marked as "Yes")
- Pass_t (Pass Accuracy): Percentage of samples where ALL checklist items are correct (100 if perfect, 0 otherwise)
# Generate a report from evaluation results
refinebench-eval --mode report --output-dir ./results --max-turn-num 5
# Export report to JSON file
refinebench-eval --mode report --output-dir ./results --export-report ./report.json
# Save detailed CSV/TSV reports
refinebench-eval --mode report --output-dir ./results --save-csv ./csv_reports
# Report for specific refinement setting
refinebench-eval --mode report --output-dir ./results --refinement-setting guided_refinementfrom refinebench_eval import RefineBenchReporter
# Initialize reporter
reporter = RefineBenchReporter("./results")
# Load results
reporter.load_results(max_turn_num=5)
# Calculate statistics
stats = reporter.calculate_statistics(max_turn_num=5)
# Print pretty report (shows Acc and Pass@t)
reporter.print_report(stats=stats, max_turn_num=5, show_domains=True)
# Export to JSON
reporter.export_report("./report.json", stats=stats, max_turn_num=5)
# Save detailed CSV/TSV reports
reporter.save_csv_reports("./csv_reports", max_turn_num=5)The generated report includes:
- Overall Statistics: Total samples, costs, tokens, and latency
- Per-Turn Analysis: Acc and Pass@t metrics, average cost, tokens, and time for each turn
- Improvement Tracking: Shows how Acc and Pass@t change across turns with visual indicators
- Domain-Level Statistics: Breakdown by domain (Math, CS, Biology, etc.)
- Export Options: Save reports as JSON or CSV/TSV for further analysis
When using --save-csv, the following files are generated:
full_report.tsv: Complete data for all samples and turnsperformance_by_turn.tsv: Acc and Pass@t by turnperformance_by_domain.tsv: Performance breakdown by domaincost_latency_summary.tsv: Cost and latency statistics
Check out the examples/ directory for sample scripts:
generate_report.py: Demonstrates comprehensive report generation- See
examples/README.mdfor more details
In refinebench-eval, users or practitioners can configure two major components — ModelConfig and ExperimentalConfig. Detailed parameter information for each configuration is provided below.
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str |
— | Model identifier |
model_path |
str |
— | Provider/model path (e.g., openai/gpt-4o-mini) |
temperature |
float |
0.9 |
Sampling temperature |
top_p |
float |
1.0 |
Nucleus sampling parameter |
max_tokens |
int |
10000 |
Maximum number of generation tokens (for o1-mini, this is automatically set internally as max_completion_tokens) |
reasoning_effort |
str |
"medium" |
Reasoning effort level (for reasoning models). If the LM is non-reasoning, specify this value as no_reasoning. |
Note:
Since RefineBench is a multi-turn benchmark, users must setmax_tokensto at least 10,000 to ensure sufficient context length across turns.
| Parameter | Type | Default | Description |
|---|---|---|---|
refinement_setting |
str |
"self_refinement" |
Refinement mode: one of self_refinement, guided_refinement, partial_guided_refinement, or self_refinement_with_criteria |
max_turn_num |
int |
5 |
Maximum number of refinement turns |
max_workers |
int |
1 |
Number of parallel workers |
seed |
int |
0 |
Random seed for reproducibility |
debug |
bool |
False |
Enable debug mode |
use_cache |
bool |
False |
Use cached results if available |
unknown_ratio |
float |
0.5 |
Ratio used for partial-guided refinement (range: 0–1) |
run_domain |
bool |
False |
Enable domain-specific evaluation |
domain |
str |
None |
Target domain to filter (e.g., "math", "cs") |
enforce_execution |
bool |
False |
Force code execution for code-based tasks |
mode |
str |
None |
Execution mode: one of lighteval, generate, or evaluate |
To evaluate only a specific domain, set run_domain=True and specify the desired domain using one of the abbreviations listed below. (Currently, only one domain can be evaluated at a time; multiple-domain support will be added in future updates.)
| Domain | Abbreviation | Count |
|---|---|---|
| 📚 Math | math |
321 |
| 📊 Statistics | statistics |
163 |
| 📖 Humanities / Social Science | humanities |
185 |
| ⚖️ Law | law |
142 |
| 💻 Computer Science / AI | cs |
51 |
| 🔬 Physics | physics |
69 |
| 🧪 Chemistry | chem |
18 |
| 🏗️ Engineering | engineering |
14 |
| 🧬 Biology / Medicine | bio |
11 |
| 💼 Economics / Business | economics |
9 |
| 📦 Other | other |
17 |
When a domain is specified, the following internal code is executed automatically:
# Filter dataset by domain
from refinebench_eval.dataset import RefineBenchDataset
dataset = RefineBenchDataset()
math_indices = dataset.filter_by_domain("math")
cs_indices = dataset.filter_by_domain("cs")RefineBench is a benchmark of 1,002 challenging problems across 11 domains paired with a checklist-based evaluation framework. Our benchmark has three key features:
- Verifiable and Non-verifiable tasks: it incorporates both free-form generation tasks and tasks evaluated by answer correctness
- Multiple refinement setting support: its evaluation framework assesses both guided refinement and self-refinement scenarios
- Diverse domains: the questions cover 11 domains including not only math, statistics, STEM, but also humanities, social science, and law (reasoning-heavy)
As refinement becomes an increasingly important post-hoc method for improving language model (LM) responses, it is crucial to evaluate whether an LM can effectively refine its own previous outputs. However, existing studies present several limitations:
- (1) Most prior work focuses on evaluating refinement capabilities on verifiable tasks, such as mathematical problem solving. Yet, in real-world settings, users often pose open-ended, reasoning-intensive questions that require long or subjective free-form answers—i.e., non-verifiable tasks. For instance, as shown in the figure (left), Gemini 2.5 Pro successfully refines its previous answers toward the correct solution on mathematical problems (e.g., AIME24), but in RefineBench, it struggles to refine effectively, achieving only a marginal improvement (+1.8%) over five turns.
- (2) In real use cases, users typically request refinement repeatedly across multiple turns, a setting underexplored in existing benchmarks.
- (3) Refinement performance is highly dependent on the feedback provided, yet previous analyses rarely control or systematically vary this feedback. As shown in the figure (right), giving more specific feedback and clear directions for improvement leads to substantial gains in self-refinement.
- (4) Finally, a new generation of reasoning-oriented LMs has emerged, raising the question of whether conclusions from prior studies still generalize to these models.
Therefore, to advance the development and evaluation of refinement capabilities in modern LMs, RefineBench serves as a comprehensive testbed for systematically measuring self-refinement performance of frontier LMs.
RefineBench stands out through several unique features:
- Covering verifiable/non-verifiable tasks — It includes both free-form generation tasks and answer-based correctness tasks.
- Supporing various refinement scenarios — The evaluation framework supports both guided refinement and self-refinement settings.
- Broad domain diversity — Questions span 11 domains, ranging from math, statistics, and STEM to humanities, social sciences, and law.
- Checklist-based evaluation — Each task is assessed using a detailed checklist that defines explicit evaluation criteria.
- Consistent multi-turn assessment — Building on the checklist-based evaluation framework, RefineBench measures how much LM's performance improves across multiple turns, and evaluates refinement capability under both self- and guided-refinement scenarios—especially when checklist items remain unfulfilled (i.e., when the evaluator LM judges a checklist criterion as “No”).
RefineBench evaluates LM's ability to iteratively refine its own answers through a three-step workflow:
- Refinement Step — Given a user query and the previous answer, the target LM generates a refined response (or an initial answer at the first turn). In self-refinement, the model autonomously decides whether to continue refining.
- Evaluation Step — An evaluator LM (GPT-4.1) checks the refined answer against a predefined checklist of binary criteria ("Yes"/"No") to assess quality and completeness.
- Feedback Step — The feedback derived from checklist results is used to form the next query, allowing iterative improvement across multiple turns (typically up to five).
How to provide feedback in RefineBench?
- Self-Refinement — The model receives no explicit feedback and must improve its answer independently.
- Guided Refinement — The model is given explicit feedback on checklist items it failed previously. It also provides a partially guided setting (i.e., Partial Guided Refinement), where only a subset of checklist feedback is given, simulating limited real-world supervision.
RefineBench includes 1,000 problems across 11 domains and 239 subjects, each paired with a checklist averaging 9.9 binary criteria. In addition, the largest domains are Math (32%), Humanities/Social Science (19%), and Law (14%), ensuring broad coverage of both verifiable and non-verifiable reasoning tasks.
Compared with existing datasets, RefineBench supports both extrinsic (guided) and intrinsic (self-refinement) settings, as well as partially guided refinement. It also provides checklist-based, fine-grained feedback control, enabling precise measurement of how models respond to feedback. In addition, RefineBench covers both verifiable (exact-match) and non-verifiable (free-form) tasks across 11 domains — representing the broadest coverage among existing datasets. Finally, it enables multi-turn, checklist-based evaluation, uniquely tracking consistent improvement across turns.
| Issue | Solution |
|---|---|
| 🔑 API Key Errors | Set keys in environment or .env file |
| 🤖 Model Not Supported | Check model path format (e.g., openai/gpt-4o-mini) |
| 📦 Missing Dependencies | Install provider-specific packages (pip install openai anthropic) |
| 💾 Memory Issues | Reduce max_workers or max_tokens |
| 📊 No Results in Report | Ensure results directory contains JSON files with evaluation data |
Debug Mode:
config.experimental_setup.debug = True # Evaluates only 2 samplesCheck Your Results:
# Verify evaluation completed successfully
refinebench-eval --mode report --output-dir ./resultsContributions are always welcome! 🎉 If you'd like to improve RefineBench, feel free to submit a Pull Request or open an Issue to discuss potential changes. While the current version of RefineBench is primarily English-based and designed for LM evaluation, we strongly believe that extending it to multilingual or multi-modal settings would be both valuable and necessary for advancing robust refinement capabilities across diverse languages and modalities. We warmly encourage contributions in these directions — whether it's adding new data, extending evaluation dimensions, or supporting additional model types. 🚀
We would like to express our sincere gratitude to Changjae Lee (Department of Chemistry, KAIST) and Changyu Lee (Department of Engineering, KAIST) for their invaluable assistance in dataset annotation. Our underlying codebase was developed with reference to the IneqMath project. For inference and model serving, we leveraged several open-source frameworks, including LiteLLM, OpenRouter, vLLM, and the Transformers library. Huge thanks to all the contributors for these awesome repositories!! 🙌
RefineBench is released under the CC BY-NC-ND 4.0 License. We strongly recommend using RefineBench exclusively for research purposes (non-commercial use) in accordance with institutional and ethical guidelines. The CC BY-NC-ND 4.0 License explicitly prohibits redistribution and modification of RefineBench. All accompanying code and evaluation scripts are released under the MIT License.
If you use RefineBench in your work, please kindly cite it using the BibTeX entry below:
@misc{lee2025refinebenchevaluatingrefinementcapability,
title={RefineBench: Evaluating Refinement Capability of Language Models via Checklists},
author={Young-Jun Lee and Seungone Kim and Byung-Kwan Lee and Minkyeong Moon and Yechan Hwang and Jong Myoung Kim and Graham Neubig and Sean Welleck and Ho-Jin Choi},
year={2025},
eprint={2511.22173},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.22173},
}


