OpenJury makes it easy to benchmark language models against each other while giving you complete control over the evaluation process. Whether you're comparing proprietary models or testing your own fine-tuned creations, OpenJury lets you choose your judge.
🎯 Flexible Benchmarking – Evaluate models on Alpaca-Eval, Arena-Hard, m-Arena-Hard and others
🔄 Swappable Judges – Switch between self-hosted (vLLM) or remote judges (OpenAI, Together AI, OpenRouter)
🌍 Multilingual Support – Test models across multiple languages with m-Arena-Hard
🛠️ Provider Agnostic – Works with any model available in LangChain
Compared to other libraries, here is a breakdown of features:
| Framework | MT-Bench | AlpacaEval | Arena-Hard | M-Arena-Hard | Tuned judge configuration | Support vLLM Judges |
|---|---|---|---|---|---|---|
| FastChat | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AlpacaEval | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Arena-Hard-Auto | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Lighteval | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Evalchemy | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| OpenJury | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue or send a PR, we will be happy to update the information.
git clone https://github.com/OpenEuroLLM/OpenJury
cd OpenJury
uv sync
uv sync --extra vllm # Optional: install vLLM support
uv sync --extra llamacpp # Optional: install LlamaCpp supportCompare two models head-to-head:
python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A gpt4_1106_preview \
--model_B VLLM/utter-project/EuroLLM-9B \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
--n_instructions 10 What happens here?
- Use completions available for
gpt4_1106_previewin Alpaca-Eval dataset - Generates completions for
model_Bif not already cached onvLLM - Compares two models using
deepseek-chat-v3.1which the cheapest option available onOpenRouter
It will then display the results of the battles:
============================================================
🏆 MODEL BATTLE RESULTS 🏆
📊 Dataset: alpaca-eval
🤖 Competitors: Model A: gpt4_1106_preview vs Model B: VLLM/utter-project/EuroLLM-9B
⚖️ Judge: OpenRouter/deepseek/deepseek-chat-v3.1
📈 Results Summary:
Total Battles: 10
Win Rate (A): 30.0%
✅ Wins: 3
❌ Losses: 6
🤝 Ties: 1
============================================================Models are specified using the format: {LangChain Backend}/{Model Path}
Examples:
Together/meta-llama/Llama-3.3-70B-Instruct-Turbo
ChatOpenAI/gpt-4o
LlamaCpp/jwiggerthale_Llama-3.2-3B-Q8_0-GGUF_llama-3.2-3b-q8_0.gguf
VLLM/utter-project/EuroLLM-9B
OpenRouter/deepseek/deepseek-chat-v3.1For instance, to run everything locally with vLLM:
python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A VLLM/Qwen/Qwen2.5-0.5B-Instruct \
--model_B VLLM/Qwen/Qwen2.5-1.5B-Instruct \
--judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
--n_instructions 10 LlamaCpp allows you to run GGUF models locally with high efficiency across various hardware, including CPUs, Apple Silicon (Metal), and NVIDIA GPUs. This is ideal for testing your setup without relying on external API keys or high-end server GPUs.
Install the LlamaCpp extra:
uv sync --extra llamacppDownload GGUF models using huggingface-cli (included via huggingface-hub):
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q8_0.gguf --local-dir ./models
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q8_0.gguf --local-dir ./modelsThe LlamaCpp provider expects a file path to a .gguf model after the LlamaCpp/ prefix.
For absolute paths, this results in a double slash (e.g., LlamaCpp//home/user/models/model.gguf).
Mixed example — local LlamaCpp model with a remote judge:
uv run python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A LlamaCpp/./models/qwen2.5-0.5b-instruct-q8_0.gguf \
--model_B OpenRouter/qwen/qwen-2.5-7b-instruct \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
--n_instructions 10 --max_out_tokens_models 16384Fully local example — no API keys required (useful for verifying your setup):
uv run python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A LlamaCpp/./models/qwen2.5-0.5b-instruct-q8_0.gguf \
--model_B LlamaCpp/./models/qwen2.5-1.5b-instruct-q8_0.gguf \
--judge_model LlamaCpp/./models/qwen2.5-1.5b-instruct-q8_0.gguf \
--n_instructions 5 --max_out_tokens_models 16384Note: Ensure you have the required LangChain dependencies installed for your chosen provider. If you use remote endpoint, you would have to set your credentials.
When using vLLM, OpenJury automatically picks the right inference method based on the model:
- Instruct/chat models (e.g.
swiss-ai/Apertus-8B-Instruct-2509): the tokenizer already defines a chat template, so OpenJury usesvllm.LLM.chat()and the template is applied automatically. - Base/pretrained models (e.g.
swiss-ai/Apertus-8B-2509): these typically don't ship a chat template. OpenJury detects this and falls back tovllm.LLM.generate()(plain text, no chat formatting). A warning is printed when this happens.
If you need to force a specific chat template (for example, a base model that you know works with ChatML), pass it via --chat_template:
python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A VLLM/swiss-ai/Apertus-8B-2509 \
--model_B VLLM/swiss-ai/Apertus-8B-Instruct-2509 \
--judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
--chat_template '{% for message in messages %}<|im_start|>{{ message["role"] }}\n{{ message["content"] }}<|im_end|>\n{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}'This override applies to all vLLM models in the run. For remote providers (OpenAI, Together, OpenRouter), the flag is ignored since they handle templates server-side.
| Dataset | Description |
|---|---|
alpaca-eval |
General instruction-following benchmark |
arena-hard |
More challenging evaluation suite |
m-arena-hard |
Translated version of Arena-Hard in 23 languages |
m-arena-hard-{lang} |
Language-specific variants (e.g., ar, cs, de) |
m-arena-hard-EU |
All EU languages combined |
fluency-{lang} |
Fluency evaluation for pretrained models (finnish, french, german, spanish, swedish) |
Pre-download all datasets before running jobs:
python -c "from openjury.utils import download_all; download_all()" # Download all datasets (optional)Datasets are stored in:
$OPENJURY_EVAL_DATA(if set)~/openjury-eval-data/(default)
We welcome contributions! Whether it's bug fixes, new features, or additional benchmark support, feel free to open an issue or submit a pull request.
If you use this work in your research, please cite the following paper.
@inproceedings{
salinas2025tuning,
title={Tuning {LLM} Judge Design Decisions for 1/1000 of the Cost},
author={David Salinas and Omar Swelam and Frank Hutter},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=cve4NOiyVp}
}The judge configurations was tuned in this paper and a lot of code is reused in this package.