GitHub - passing2961/refinebench-eval: Official code and dataset for our paper: RefineBench: Evaluating Refinement Capability of Language Models via Checklists

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

👋 Welcome to RefineBench — a comprehensive evaluation library for testing refinement capabilities of language models across multiple settings and domains.

📑 Table of Contents

💥 News
✨ Key Features
📦 Installation
🚀 Quick Start
📊 Generating Reports
⚙️ Configuration
🧐 What is RefineBench?
- Why RefineBench is required?
- What makes Refinebench special?
🛠️ Troubleshooting
🤝 Contributing
👏 Acknowledgements
📄 License
📝 Citation

💥 News

[2025.12.1] Our paper is now accessible at arXiv
[2025.10.6] 🎉 RefineBench accepted at MT-LLM Workshop on NeurIPS 2025 as Oral (Top 1%)!

✨ Key Features

🔄 Four Refinement Settings: Self-Refinement, Guided Refinement, Partial Guided Refinement, Self-Refinement with Criteria
⚡ Lightweight Evaluation: Quick testing with configurable sample sizes
📊 Comprehensive Reporting: Pretty reports with accuracy, costs, latency, and improvement tracking
🤖 Multi-Model Support: OpenAI, Anthropic, Google, OpenRouter, Together, AWS Bedrock, vLLM
🌐 11 Domains: Math, Statistics, STEM, Humanities, Social Science, Law (Reasoning-heavy)
📈 Broader Tasks: Both free-form and correctness-based tasks

📦 Installation

Environment Setup

# Clone and install
git clone https://github.com/RefineBench/refinebench-eval.git
cd refinebench-eval
pip install -e .

# Install dependencies
pip install -r requirements.txt

# Optional: Provider-specific dependencies
pip install openai anthropic google-generativeai litellm boto3 vllm

Set up API Keys

You can configure your API keys either by setting environment variables directly or by creating a .env file. The library automatically loads keys from .env using load_dotenv().

# Option 1: Environment Variables
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

# Option 2: Create .env file
echo "OPENAI_API_KEY=your-key" >> .env

You can include additional keys (e.g., GOOGLE_API_KEY, OPENROUTER_API_KEY) in the same .env file for multi-provider support.

🚀 Quick Start

🤗 How to Load RefineBench from HuggingFace

You can easily load the RefineBench dataset using the Hugging Face datasets library:

from datasets import load_dataset

# Load the RefineBench dataset
dataset = load_dataset("RefineBench/RefineBench")

# Explore the dataset
print(dataset)

⚡ Lightweight Evaluation (Quick Test)

We provide a Lightweight Evaluation mode for quick testing on a small subset of RefineBench before running a full-scale evaluation. This mode allows you to specify both the number of samples (n_samples) and the maximum number of turns (max_turn_num) for rapid experimentation. The evaluation automatically displays a summary report upon completion.

Command-Line Interface

# Run lightweight evaluation with 5 samples and 3 turns
refinebench-eval --mode lightweight --n-samples 5 --max-turn-num 3

# The command automatically displays:
# ✅ Completion status
# 📊 Accuracy per turn
# 📈 Improvement tracking

Python API Usage

from refinebench_eval import RefineBenchEvaluator, RefineBenchConfig

config = RefineBenchConfig(
    agent=RefineBenchConfig.ModelConfig(
        model_name="gpt_4o_mini",
        model_path="openrouter/openai/gpt-4o-mini",
        temperature=0.9,
        top_p=1.0,
        max_tokens=10000,
        reasoning_effort="no_reasoning"
    ),
    evaluator=RefineBenchConfig.ModelConfig(
        model_name="gpt_4o_mini",
        model_path="openrouter/openai/gpt-4o-mini",
        temperature=0.0,
        top_p=1.0,
        max_tokens=10000,
        reasoning_effort="no_reasoning"
    ),
    experimental_setup=RefineBenchConfig.ExperimentalConfig(
        refinement_setting="self_refinement",
        max_turn_num=5,
        max_workers=4,
        seed=42,
        debug=False,
        use_cache=False
    )
)

evaluator = RefineBenchEvaluator(config)
results = evaluator.run_lightweight_evaluation(n_samples=5, max_turn_num=3)
print(f"✅ Evaluated {len(results)} samples")

🧪 Full Evaluation with Shell Scripts

To reproduce the full results reported in our paper, use the provided shell scripts. Each script corresponds to a specific refinement setting. They first generate model responses for each turn (--mode generate), then evaluate them against the RefineBench checklist (--mode evaluate).

Supported Refinement Settings

Refinement Setting	Description
Self-Refinement	The model iteratively improves its own responses without any external feedback.
Guided Refinement	The model refines its responses using explicit feedback derived from checklist results.
Partial-Guided Refinement	The model refines its responses based on partially provided feedback (`unknown_ratio`).
Self-Refinement with Criteria	The model refines its responses while being aware of specific evaluation criteria.

# Run in order (self-refinement first!)
./scripts/run_self_refinement.sh              # 🎯 Start here
./scripts/run_guided_refinement.sh
./scripts/run_partial_guided_refinement.sh
./scripts/run_self_refinement_with_criteria.sh

⚠️ Important: Before running guided refinement, partial-guided refinement, or self-refinement with criteria, you must first complete self-refinement, since all subsequent settings depend on the initial outputs from it.

How to set refinement settings

# for self-refinement
config.experimental_setup.refinement_setting = "self_refinement"

# for guided refinement
config.experimental_setup.refinement_setting = "guided_refinement"

# for partial-guided refinement
config.experimental_setup.refinement_setting = "partial_guided_refinement"
config.experimental_setup.unknown_ratio = 0.5  # For partial guided refinement

# for self-refinement with criteria
config.experimental_setup.refinement_setting = "self_refinement_w_criteria"

📊 Generating Reports

After running evaluations, you can generate comprehensive reports to analyze the results. The reporting feature provides statistics on Acc (accuracy) and Pass@t (pass accuracy) metrics, along with costs, latency, and improvement tracking across turns.

RefineBench Metrics

Acc_t (Accuracy): Percentage of checklist items that are correct (marked as "Yes")
Pass_t (Pass Accuracy): Percentage of samples where ALL checklist items are correct (100 if perfect, 0 otherwise)

Command-Line Reporting

# Generate a report from evaluation results
refinebench-eval --mode report --output-dir ./results --max-turn-num 5

# Export report to JSON file
refinebench-eval --mode report --output-dir ./results --export-report ./report.json

# Save detailed CSV/TSV reports
refinebench-eval --mode report --output-dir ./results --save-csv ./csv_reports

# Report for specific refinement setting
refinebench-eval --mode report --output-dir ./results --refinement-setting guided_refinement

Python API for Reporting

from refinebench_eval import RefineBenchReporter

# Initialize reporter
reporter = RefineBenchReporter("./results")

# Load results
reporter.load_results(max_turn_num=5)

# Calculate statistics
stats = reporter.calculate_statistics(max_turn_num=5)

# Print pretty report (shows Acc and Pass@t)
reporter.print_report(stats=stats, max_turn_num=5, show_domains=True)

# Export to JSON
reporter.export_report("./report.json", stats=stats, max_turn_num=5)

# Save detailed CSV/TSV reports
reporter.save_csv_reports("./csv_reports", max_turn_num=5)

Report Contents

The generated report includes:

Overall Statistics: Total samples, costs, tokens, and latency
Per-Turn Analysis: Acc and Pass@t metrics, average cost, tokens, and time for each turn
Improvement Tracking: Shows how Acc and Pass@t change across turns with visual indicators
Domain-Level Statistics: Breakdown by domain (Math, CS, Biology, etc.)
Export Options: Save reports as JSON or CSV/TSV for further analysis

CSV/TSV Reports

When using --save-csv, the following files are generated:

full_report.tsv: Complete data for all samples and turns
performance_by_turn.tsv: Acc and Pass@t by turn
performance_by_domain.tsv: Performance breakdown by domain
cost_latency_summary.tsv: Cost and latency statistics

Example Scripts

Check out the examples/ directory for sample scripts:

generate_report.py: Demonstrates comprehensive report generation
See examples/README.md for more details

⚙️ `refinebench-eval` Configuration

In refinebench-eval, users or practitioners can configure two major components — ModelConfig and ExperimentalConfig. Detailed parameter information for each configuration is provided below.

`ModelConfig` Parameters

Parameter	Type	Default	Description
`model_name`	`str`	—	Model identifier
`model_path`	`str`	—	Provider/model path (e.g., `openai/gpt-4o-mini`)
`temperature`	`float`	`0.9`	Sampling temperature
`top_p`	`float`	`1.0`	Nucleus sampling parameter
`max_tokens`	`int`	`10000`	Maximum number of generation tokens (for `o1-mini`, this is automatically set internally as `max_completion_tokens`)
`reasoning_effort`	`str`	`"medium"`	Reasoning effort level (for reasoning models). If the LM is non-reasoning, specify this value as `no_reasoning`.

Note:
Since RefineBench is a multi-turn benchmark, users must set max_tokens to at least 10,000 to ensure sufficient context length across turns.

`ExperimentalConfig` Parameters

Parameter	Type	Default	Description
`refinement_setting`	`str`	`"self_refinement"`	Refinement mode: one of `self_refinement`, `guided_refinement`, `partial_guided_refinement`, or `self_refinement_with_criteria`
`max_turn_num`	`int`	`5`	Maximum number of refinement turns
`max_workers`	`int`	`1`	Number of parallel workers
`seed`	`int`	`0`	Random seed for reproducibility
`debug`	`bool`	`False`	Enable debug mode
`use_cache`	`bool`	`False`	Use cached results if available
`unknown_ratio`	`float`	`0.5`	Ratio used for partial-guided refinement (range: `0`–`1`)
`run_domain`	`bool`	`False`	Enable domain-specific evaluation
`domain`	`str`	`None`	Target domain to filter (e.g., `"math"`, `"cs"`)
`enforce_execution`	`bool`	`False`	Force code execution for code-based tasks
`mode`	`str`	`None`	Execution mode: one of `lighteval`, `generate`, or `evaluate`

🎯 Domain-Specific Evaluation

To evaluate only a specific domain, set run_domain=True and specify the desired domain using one of the abbreviations listed below. (Currently, only one domain can be evaluated at a time; multiple-domain support will be added in future updates.)

Domain	Abbreviation	Count
📚 Math	`math`	321
📊 Statistics	`statistics`	163
📖 Humanities / Social Science	`humanities`	185
⚖️ Law	`law`	142
💻 Computer Science / AI	`cs`	51
🔬 Physics	`physics`	69
🧪 Chemistry	`chem`	18
🏗️ Engineering	`engineering`	14
🧬 Biology / Medicine	`bio`	11
💼 Economics / Business	`economics`	9
📦 Other	`other`	17

When a domain is specified, the following internal code is executed automatically:

# Filter dataset by domain
from refinebench_eval.dataset import RefineBenchDataset

dataset = RefineBenchDataset()
math_indices = dataset.filter_by_domain("math")
cs_indices = dataset.filter_by_domain("cs")

What is RefineBench?

RefineBench is a benchmark of 1,002 challenging problems across 11 domains paired with a checklist-based evaluation framework. Our benchmark has three key features:

Verifiable and Non-verifiable tasks: it incorporates both free-form generation tasks and tasks evaluated by answer correctness
Multiple refinement setting support: its evaluation framework assesses both guided refinement and self-refinement scenarios
Diverse domains: the questions cover 11 domains including not only math, statistics, STEM, but also humanities, social science, and law (reasoning-heavy)

🤔 Why RefineBench is required?

As refinement becomes an increasingly important post-hoc method for improving language model (LM) responses, it is crucial to evaluate whether an LM can effectively refine its own previous outputs. However, existing studies present several limitations:

(1) Most prior work focuses on evaluating refinement capabilities on verifiable tasks, such as mathematical problem solving. Yet, in real-world settings, users often pose open-ended, reasoning-intensive questions that require long or subjective free-form answers—i.e., non-verifiable tasks. For instance, as shown in the figure (left), Gemini 2.5 Pro successfully refines its previous answers toward the correct solution on mathematical problems (e.g., AIME24), but in RefineBench, it struggles to refine effectively, achieving only a marginal improvement (+1.8%) over five turns.
(2) In real use cases, users typically request refinement repeatedly across multiple turns, a setting underexplored in existing benchmarks.
(3) Refinement performance is highly dependent on the feedback provided, yet previous analyses rarely control or systematically vary this feedback. As shown in the figure (right), giving more specific feedback and clear directions for improvement leads to substantial gains in self-refinement.
(4) Finally, a new generation of reasoning-oriented LMs has emerged, raising the question of whether conclusions from prior studies still generalize to these models.

Therefore, to advance the development and evaluation of refinement capabilities in modern LMs, RefineBench serves as a comprehensive testbed for systematically measuring self-refinement performance of frontier LMs.

🌟 What makes RefineBench special?

RefineBench stands out through several unique features:

Covering verifiable/non-verifiable tasks — It includes both free-form generation tasks and answer-based correctness tasks.
Supporing various refinement scenarios — The evaluation framework supports both guided refinement and self-refinement settings.
Broad domain diversity — Questions span 11 domains, ranging from math, statistics, and STEM to humanities, social sciences, and law.
Checklist-based evaluation — Each task is assessed using a detailed checklist that defines explicit evaluation criteria.
Consistent multi-turn assessment — Building on the checklist-based evaluation framework, RefineBench measures how much LM's performance improves across multiple turns, and evaluates refinement capability under both self- and guided-refinement scenarios—especially when checklist items remain unfulfilled (i.e., when the evaluator LM judges a checklist criterion as “No”).

⚙️ Evaluation Workflow of RefineBench

RefineBench evaluates LM's ability to iteratively refine its own answers through a three-step workflow:

Refinement Step — Given a user query and the previous answer, the target LM generates a refined response (or an initial answer at the first turn). In self-refinement, the model autonomously decides whether to continue refining.
Evaluation Step — An evaluator LM (GPT-4.1) checks the refined answer against a predefined checklist of binary criteria ("Yes"/"No") to assess quality and completeness.
Feedback Step — The feedback derived from checklist results is used to form the next query, allowing iterative improvement across multiple turns (typically up to five).

How to provide feedback in RefineBench?

Self-Refinement — The model receives no explicit feedback and must improve its answer independently.
Guided Refinement — The model is given explicit feedback on checklist items it failed previously. It also provides a partially guided setting (i.e., Partial Guided Refinement), where only a subset of checklist feedback is given, simulating limited real-world supervision.

📚 Dataset Statistics & Comparison

RefineBench includes 1,000 problems across 11 domains and 239 subjects, each paired with a checklist averaging 9.9 binary criteria. In addition, the largest domains are Math (32%), Humanities/Social Science (19%), and Law (14%), ensuring broad coverage of both verifiable and non-verifiable reasoning tasks.

Compared with existing datasets, RefineBench supports both extrinsic (guided) and intrinsic (self-refinement) settings, as well as partially guided refinement. It also provides checklist-based, fine-grained feedback control, enabling precise measurement of how models respond to feedback. In addition, RefineBench covers both verifiable (exact-match) and non-verifiable (free-form) tasks across 11 domains — representing the broadest coverage among existing datasets. Finally, it enables multi-turn, checklist-based evaluation, uniquely tracking consistent improvement across turns.

🛠️ Troubleshooting

Issue	Solution
🔑 API Key Errors	Set keys in environment or `.env` file
🤖 Model Not Supported	Check model path format (e.g., `openai/gpt-4o-mini`)
📦 Missing Dependencies	Install provider-specific packages (`pip install openai anthropic`)
💾 Memory Issues	Reduce `max_workers` or `max_tokens`
📊 No Results in Report	Ensure results directory contains JSON files with evaluation data

Debug Mode:

config.experimental_setup.debug = True  # Evaluates only 2 samples

Check Your Results:

# Verify evaluation completed successfully
refinebench-eval --mode report --output-dir ./results

🤝 Contributing

Contributions are always welcome! 🎉 If you'd like to improve RefineBench, feel free to submit a Pull Request or open an Issue to discuss potential changes. While the current version of RefineBench is primarily English-based and designed for LM evaluation, we strongly believe that extending it to multilingual or multi-modal settings would be both valuable and necessary for advancing robust refinement capabilities across diverse languages and modalities. We warmly encourage contributions in these directions — whether it's adding new data, extending evaluation dimensions, or supporting additional model types. 🚀

👏 Acknowledgements

We would like to express our sincere gratitude to Changjae Lee (Department of Chemistry, KAIST) and Changyu Lee (Department of Engineering, KAIST) for their invaluable assistance in dataset annotation. Our underlying codebase was developed with reference to the IneqMath project. For inference and model serving, we leveraged several open-source frameworks, including LiteLLM, OpenRouter, vLLM, and the Transformers library. Huge thanks to all the contributors for these awesome repositories!! 🙌

📄 License

RefineBench is released under the CC BY-NC-ND 4.0 License. We strongly recommend using RefineBench exclusively for research purposes (non-commercial use) in accordance with institutional and ethical guidelines. The CC BY-NC-ND 4.0 License explicitly prohibits redistribution and modification of RefineBench. All accompanying code and evaluation scripts are released under the MIT License.

📝 Citation

If you use RefineBench in your work, please kindly cite it using the BibTeX entry below:

@misc{lee2025refinebenchevaluatingrefinementcapability,
      title={RefineBench: Evaluating Refinement Capability of Language Models via Checklists}, 
      author={Young-Jun Lee and Seungone Kim and Byung-Kwan Lee and Minkyeong Moon and Yechan Hwang and Jong Myoung Kim and Graham Neubig and Sean Welleck and Ho-Jin Choi},
      year={2025},
      eprint={2511.22173},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.22173}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
examples		examples
refinebench_eval		refinebench_eval
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
test_library.py		test_library.py

passing2961/refinebench-eval

Folders and files

Latest commit

History

Repository files navigation