Skip to content

Official code and dataset for our paper: RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Notifications You must be signed in to change notification settings

passing2961/refinebench-eval

 
 

Repository files navigation

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

MIT License CC BY-NC-ND License arXiv Paper Hugging Face Dataset Website

👋 Welcome to RefineBench — a comprehensive evaluation library for testing refinement capabilities of language models across multiple settings and domains.

📑 Table of Contents

💥 News

  • [2025.12.1] Our paper is now accessible at arXiv
  • [2025.10.6] 🎉 RefineBench accepted at MT-LLM Workshop on NeurIPS 2025 as Oral (Top 1%)!

✨ Key Features

  • 🔄 Four Refinement Settings: Self-Refinement, Guided Refinement, Partial Guided Refinement, Self-Refinement with Criteria
  • Lightweight Evaluation: Quick testing with configurable sample sizes
  • 📊 Comprehensive Reporting: Pretty reports with accuracy, costs, latency, and improvement tracking
  • 🤖 Multi-Model Support: OpenAI, Anthropic, Google, OpenRouter, Together, AWS Bedrock, vLLM
  • 🌐 11 Domains: Math, Statistics, STEM, Humanities, Social Science, Law (Reasoning-heavy)
  • 📈 Broader Tasks: Both free-form and correctness-based tasks

📦 Installation

Environment Setup

# Clone and install
git clone https://github.com/RefineBench/refinebench-eval.git
cd refinebench-eval
pip install -e .

# Install dependencies
pip install -r requirements.txt

# Optional: Provider-specific dependencies
pip install openai anthropic google-generativeai litellm boto3 vllm

Set up API Keys

You can configure your API keys either by setting environment variables directly or by creating a .env file. The library automatically loads keys from .env using load_dotenv().

# Option 1: Environment Variables
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

# Option 2: Create .env file
echo "OPENAI_API_KEY=your-key" >> .env

You can include additional keys (e.g., GOOGLE_API_KEY, OPENROUTER_API_KEY) in the same .env file for multi-provider support.

🚀 Quick Start

🤗 How to Load RefineBench from HuggingFace

You can easily load the RefineBench dataset using the Hugging Face datasets library:

from datasets import load_dataset

# Load the RefineBench dataset
dataset = load_dataset("RefineBench/RefineBench")

# Explore the dataset
print(dataset)

⚡ Lightweight Evaluation (Quick Test)

We provide a Lightweight Evaluation mode for quick testing on a small subset of RefineBench before running a full-scale evaluation. This mode allows you to specify both the number of samples (n_samples) and the maximum number of turns (max_turn_num) for rapid experimentation. The evaluation automatically displays a summary report upon completion.

Command-Line Interface

# Run lightweight evaluation with 5 samples and 3 turns
refinebench-eval --mode lightweight --n-samples 5 --max-turn-num 3

# The command automatically displays:
# ✅ Completion status
# 📊 Accuracy per turn
# 📈 Improvement tracking

Python API Usage

from refinebench_eval import RefineBenchEvaluator, RefineBenchConfig

config = RefineBenchConfig(
    agent=RefineBenchConfig.ModelConfig(
        model_name="gpt_4o_mini",
        model_path="openrouter/openai/gpt-4o-mini",
        temperature=0.9,
        top_p=1.0,
        max_tokens=10000,
        reasoning_effort="no_reasoning"
    ),
    evaluator=RefineBenchConfig.ModelConfig(
        model_name="gpt_4o_mini",
        model_path="openrouter/openai/gpt-4o-mini",
        temperature=0.0,
        top_p=1.0,
        max_tokens=10000,
        reasoning_effort="no_reasoning"
    ),
    experimental_setup=RefineBenchConfig.ExperimentalConfig(
        refinement_setting="self_refinement",
        max_turn_num=5,
        max_workers=4,
        seed=42,
        debug=False,
        use_cache=False
    )
)

evaluator = RefineBenchEvaluator(config)
results = evaluator.run_lightweight_evaluation(n_samples=5, max_turn_num=3)
print(f"✅ Evaluated {len(results)} samples")

🧪 Full Evaluation with Shell Scripts

To reproduce the full results reported in our paper, use the provided shell scripts. Each script corresponds to a specific refinement setting. They first generate model responses for each turn (--mode generate), then evaluate them against the RefineBench checklist (--mode evaluate).

Supported Refinement Settings

Refinement Setting Description
Self-Refinement The model iteratively improves its own responses without any external feedback.
Guided Refinement The model refines its responses using explicit feedback derived from checklist results.
Partial-Guided Refinement The model refines its responses based on partially provided feedback (unknown_ratio).
Self-Refinement with Criteria The model refines its responses while being aware of specific evaluation criteria.
# Run in order (self-refinement first!)
./scripts/run_self_refinement.sh              # 🎯 Start here
./scripts/run_guided_refinement.sh
./scripts/run_partial_guided_refinement.sh
./scripts/run_self_refinement_with_criteria.sh

⚠️ Important: Before running guided refinement, partial-guided refinement, or self-refinement with criteria, you must first complete self-refinement, since all subsequent settings depend on the initial outputs from it.

How to set refinement settings

# for self-refinement
config.experimental_setup.refinement_setting = "self_refinement"

# for guided refinement
config.experimental_setup.refinement_setting = "guided_refinement"

# for partial-guided refinement
config.experimental_setup.refinement_setting = "partial_guided_refinement"
config.experimental_setup.unknown_ratio = 0.5  # For partial guided refinement

# for self-refinement with criteria
config.experimental_setup.refinement_setting = "self_refinement_w_criteria"

📊 Generating Reports

After running evaluations, you can generate comprehensive reports to analyze the results. The reporting feature provides statistics on Acc (accuracy) and Pass@t (pass accuracy) metrics, along with costs, latency, and improvement tracking across turns.

RefineBench Metrics

  • Acc_t (Accuracy): Percentage of checklist items that are correct (marked as "Yes")
  • Pass_t (Pass Accuracy): Percentage of samples where ALL checklist items are correct (100 if perfect, 0 otherwise)

Command-Line Reporting

# Generate a report from evaluation results
refinebench-eval --mode report --output-dir ./results --max-turn-num 5

# Export report to JSON file
refinebench-eval --mode report --output-dir ./results --export-report ./report.json

# Save detailed CSV/TSV reports
refinebench-eval --mode report --output-dir ./results --save-csv ./csv_reports

# Report for specific refinement setting
refinebench-eval --mode report --output-dir ./results --refinement-setting guided_refinement

Python API for Reporting

from refinebench_eval import RefineBenchReporter

# Initialize reporter
reporter = RefineBenchReporter("./results")

# Load results
reporter.load_results(max_turn_num=5)

# Calculate statistics
stats = reporter.calculate_statistics(max_turn_num=5)

# Print pretty report (shows Acc and Pass@t)
reporter.print_report(stats=stats, max_turn_num=5, show_domains=True)

# Export to JSON
reporter.export_report("./report.json", stats=stats, max_turn_num=5)

# Save detailed CSV/TSV reports
reporter.save_csv_reports("./csv_reports", max_turn_num=5)

Report Contents

The generated report includes:

  • Overall Statistics: Total samples, costs, tokens, and latency
  • Per-Turn Analysis: Acc and Pass@t metrics, average cost, tokens, and time for each turn
  • Improvement Tracking: Shows how Acc and Pass@t change across turns with visual indicators
  • Domain-Level Statistics: Breakdown by domain (Math, CS, Biology, etc.)
  • Export Options: Save reports as JSON or CSV/TSV for further analysis

CSV/TSV Reports

When using --save-csv, the following files are generated:

  • full_report.tsv: Complete data for all samples and turns
  • performance_by_turn.tsv: Acc and Pass@t by turn
  • performance_by_domain.tsv: Performance breakdown by domain
  • cost_latency_summary.tsv: Cost and latency statistics

Example Scripts

Check out the examples/ directory for sample scripts:

  • generate_report.py: Demonstrates comprehensive report generation
  • See examples/README.md for more details

⚙️ refinebench-eval Configuration

In refinebench-eval, users or practitioners can configure two major components — ModelConfig and ExperimentalConfig. Detailed parameter information for each configuration is provided below.

ModelConfig Parameters

Parameter Type Default Description
model_name str Model identifier
model_path str Provider/model path (e.g., openai/gpt-4o-mini)
temperature float 0.9 Sampling temperature
top_p float 1.0 Nucleus sampling parameter
max_tokens int 10000 Maximum number of generation tokens (for o1-mini, this is automatically set internally as max_completion_tokens)
reasoning_effort str "medium" Reasoning effort level (for reasoning models). If the LM is non-reasoning, specify this value as no_reasoning.

Note:
Since RefineBench is a multi-turn benchmark, users must set max_tokens to at least 10,000 to ensure sufficient context length across turns.


ExperimentalConfig Parameters

Parameter Type Default Description
refinement_setting str "self_refinement" Refinement mode: one of self_refinement, guided_refinement, partial_guided_refinement, or self_refinement_with_criteria
max_turn_num int 5 Maximum number of refinement turns
max_workers int 1 Number of parallel workers
seed int 0 Random seed for reproducibility
debug bool False Enable debug mode
use_cache bool False Use cached results if available
unknown_ratio float 0.5 Ratio used for partial-guided refinement (range: 01)
run_domain bool False Enable domain-specific evaluation
domain str None Target domain to filter (e.g., "math", "cs")
enforce_execution bool False Force code execution for code-based tasks
mode str None Execution mode: one of lighteval, generate, or evaluate

🎯 Domain-Specific Evaluation

To evaluate only a specific domain, set run_domain=True and specify the desired domain using one of the abbreviations listed below. (Currently, only one domain can be evaluated at a time; multiple-domain support will be added in future updates.)

Domain Abbreviation Count
📚 Math math 321
📊 Statistics statistics 163
📖 Humanities / Social Science humanities 185
⚖️ Law law 142
💻 Computer Science / AI cs 51
🔬 Physics physics 69
🧪 Chemistry chem 18
🏗️ Engineering engineering 14
🧬 Biology / Medicine bio 11
💼 Economics / Business economics 9
📦 Other other 17

When a domain is specified, the following internal code is executed automatically:

# Filter dataset by domain
from refinebench_eval.dataset import RefineBenchDataset

dataset = RefineBenchDataset()
math_indices = dataset.filter_by_domain("math")
cs_indices = dataset.filter_by_domain("cs")

What is RefineBench?

RefineBench is a benchmark of 1,002 challenging problems across 11 domains paired with a checklist-based evaluation framework. Our benchmark has three key features:

  • Verifiable and Non-verifiable tasks: it incorporates both free-form generation tasks and tasks evaluated by answer correctness
  • Multiple refinement setting support: its evaluation framework assesses both guided refinement and self-refinement scenarios
  • Diverse domains: the questions cover 11 domains including not only math, statistics, STEM, but also humanities, social science, and law (reasoning-heavy)

🤔 Why RefineBench is required?

As refinement becomes an increasingly important post-hoc method for improving language model (LM) responses, it is crucial to evaluate whether an LM can effectively refine its own previous outputs. However, existing studies present several limitations:

  • (1) Most prior work focuses on evaluating refinement capabilities on verifiable tasks, such as mathematical problem solving. Yet, in real-world settings, users often pose open-ended, reasoning-intensive questions that require long or subjective free-form answers—i.e., non-verifiable tasks. For instance, as shown in the figure (left), Gemini 2.5 Pro successfully refines its previous answers toward the correct solution on mathematical problems (e.g., AIME24), but in RefineBench, it struggles to refine effectively, achieving only a marginal improvement (+1.8%) over five turns.
  • (2) In real use cases, users typically request refinement repeatedly across multiple turns, a setting underexplored in existing benchmarks.
  • (3) Refinement performance is highly dependent on the feedback provided, yet previous analyses rarely control or systematically vary this feedback. As shown in the figure (right), giving more specific feedback and clear directions for improvement leads to substantial gains in self-refinement.
  • (4) Finally, a new generation of reasoning-oriented LMs has emerged, raising the question of whether conclusions from prior studies still generalize to these models.

Therefore, to advance the development and evaluation of refinement capabilities in modern LMs, RefineBench serves as a comprehensive testbed for systematically measuring self-refinement performance of frontier LMs.

refinebench_teaser

🌟 What makes RefineBench special?

RefineBench stands out through several unique features:

  1. Covering verifiable/non-verifiable tasks — It includes both free-form generation tasks and answer-based correctness tasks.
  2. Supporing various refinement scenarios — The evaluation framework supports both guided refinement and self-refinement settings.
  3. Broad domain diversity — Questions span 11 domains, ranging from math, statistics, and STEM to humanities, social sciences, and law.
  4. Checklist-based evaluation — Each task is assessed using a detailed checklist that defines explicit evaluation criteria.
  5. Consistent multi-turn assessment — Building on the checklist-based evaluation framework, RefineBench measures how much LM's performance improves across multiple turns, and evaluates refinement capability under both self- and guided-refinement scenarios—especially when checklist items remain unfulfilled (i.e., when the evaluator LM judges a checklist criterion as “No”).

⚙️ Evaluation Workflow of RefineBench

RefineBench evaluates LM's ability to iteratively refine its own answers through a three-step workflow:

  1. Refinement Step — Given a user query and the previous answer, the target LM generates a refined response (or an initial answer at the first turn). In self-refinement, the model autonomously decides whether to continue refining.
  2. Evaluation Step — An evaluator LM (GPT-4.1) checks the refined answer against a predefined checklist of binary criteria ("Yes"/"No") to assess quality and completeness.
  3. Feedback Step — The feedback derived from checklist results is used to form the next query, allowing iterative improvement across multiple turns (typically up to five).

How to provide feedback in RefineBench?

  • Self-Refinement — The model receives no explicit feedback and must improve its answer independently.
  • Guided Refinement — The model is given explicit feedback on checklist items it failed previously. It also provides a partially guided setting (i.e., Partial Guided Refinement), where only a subset of checklist feedback is given, simulating limited real-world supervision.
refinebench_workflow

📚 Dataset Statistics & Comparison

RefineBench includes 1,000 problems across 11 domains and 239 subjects, each paired with a checklist averaging 9.9 binary criteria. In addition, the largest domains are Math (32%), Humanities/Social Science (19%), and Law (14%), ensuring broad coverage of both verifiable and non-verifiable reasoning tasks.

refinebench_statistics

Compared with existing datasets, RefineBench supports both extrinsic (guided) and intrinsic (self-refinement) settings, as well as partially guided refinement. It also provides checklist-based, fine-grained feedback control, enabling precise measurement of how models respond to feedback. In addition, RefineBench covers both verifiable (exact-match) and non-verifiable (free-form) tasks across 11 domains — representing the broadest coverage among existing datasets. Finally, it enables multi-turn, checklist-based evaluation, uniquely tracking consistent improvement across turns.

refinebench_dataset_comparison

🛠️ Troubleshooting

Issue Solution
🔑 API Key Errors Set keys in environment or .env file
🤖 Model Not Supported Check model path format (e.g., openai/gpt-4o-mini)
📦 Missing Dependencies Install provider-specific packages (pip install openai anthropic)
💾 Memory Issues Reduce max_workers or max_tokens
📊 No Results in Report Ensure results directory contains JSON files with evaluation data

Debug Mode:

config.experimental_setup.debug = True  # Evaluates only 2 samples

Check Your Results:

# Verify evaluation completed successfully
refinebench-eval --mode report --output-dir ./results

🤝 Contributing

Contributions are always welcome! 🎉 If you'd like to improve RefineBench, feel free to submit a Pull Request or open an Issue to discuss potential changes. While the current version of RefineBench is primarily English-based and designed for LM evaluation, we strongly believe that extending it to multilingual or multi-modal settings would be both valuable and necessary for advancing robust refinement capabilities across diverse languages and modalities. We warmly encourage contributions in these directions — whether it's adding new data, extending evaluation dimensions, or supporting additional model types. 🚀

👏 Acknowledgements

We would like to express our sincere gratitude to Changjae Lee (Department of Chemistry, KAIST) and Changyu Lee (Department of Engineering, KAIST) for their invaluable assistance in dataset annotation. Our underlying codebase was developed with reference to the IneqMath project. For inference and model serving, we leveraged several open-source frameworks, including LiteLLM, OpenRouter, vLLM, and the Transformers library. Huge thanks to all the contributors for these awesome repositories!! 🙌

📄 License

RefineBench is released under the CC BY-NC-ND 4.0 License. We strongly recommend using RefineBench exclusively for research purposes (non-commercial use) in accordance with institutional and ethical guidelines. The CC BY-NC-ND 4.0 License explicitly prohibits redistribution and modification of RefineBench. All accompanying code and evaluation scripts are released under the MIT License.

📝 Citation

If you use RefineBench in your work, please kindly cite it using the BibTeX entry below:

@misc{lee2025refinebenchevaluatingrefinementcapability,
      title={RefineBench: Evaluating Refinement Capability of Language Models via Checklists}, 
      author={Young-Jun Lee and Seungone Kim and Byung-Kwan Lee and Minkyeong Moon and Yechan Hwang and Jong Myoung Kim and Graham Neubig and Sean Welleck and Ho-Jin Choi},
      year={2025},
      eprint={2511.22173},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.22173}, 
}

About

Official code and dataset for our paper: RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.1%
  • Shell 7.9%