This framework provides an evaluation system for LLMs with support for:
- Multiple LLM providers (OpenAI, Anthropic, Google, Ollama, Custom OpenAI-compatible deployments)
- Web-search-based answer generation
- Random model selection for judges
- Web-based dashboard for results visualization
- Token / cost estimations
- Python: 3.10 or higher
- Operating System: Linux, macOS, or WSL recommended
- API Keys: Provider API keys for the models
- Ollama: Local Ollama installation for running open-source models
- Web Search: Tavily API key for web search (optional)
- OpenAI Compliant Endpoint: Custom OpenAI-compatible endpoint (optional)
The benchmark is built on a custom dataset of Romanian tax and accounting questions.
Each entry contains both the question (intrebare) and the official or expert answer (raspuns), along with metadata for evaluation.
Schema:
id– unique identifierdata– date of entrylink– reference or context (if available)intrebare– the user’s questionraspuns– expert answerdificultate– difficulty level (easy,medium,hard)categorie– topic area (e.g., impozit pe profit, contabilitate, etc.)timp_raspuns_uman_min– estimated time (minutes) for a human expert to answer
Example rows:
| id | data | intrebare | dificultate | categorie | timp_raspuns_uman_min |
|---|---|---|---|---|---|
| 199999 | 2025-08-05 | Cesiune creanțe pentru care s-a constituit anterior provizion – Impozit pe profit | hard | impozit pe profit | 60.0 |
| 199837 | 2025-06-05 | Scutire profit reinvestit – Impozit pe profit | easy | impozit pe profit | 20.0 |
| 199766 | 2025-04-05 | Completare D101 – Impozit pe profit | medium | impozit pe profit | 15.0 (example) |
This dataset ensures that models are tested on realistic, domain-specific scenarios, combining:
- Legal references (Cod Fiscal, norme metodologice)
- Accounting practices (monografii contabile)
- Difficulty levels for nuanced benchmarking
Each answer is evaluated on three dimensions using binary (VALID/INVALID) judgments:
| Metric | What it evaluates |
|---|---|
| Correctness | Does the answer match the gold answer or provide a valid alternative? Evaluates completeness and reasoning. |
| Legal Citation | Are the cited legal articles correct and aligned with the gold answer? Checks for contradicting or missing citations. |
| Clarity & Structure | Is the answer well-structured, easy to understand, and accessible to non-experts? |
All metrics return VALID (1.0) or INVALID (0.0) with an explanation in Romanian.
The --evaluators flag lets you specify which LLM models act as judges to score the generated answers.
# Single evaluator
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5
# Multiple evaluators (scores are averaged)
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5 claude-4-sonnet gemini-2.5-proHow it works:
- Each evaluator judges every answer on the metrics above (VALID/INVALID)
- When using multiple evaluators, one answer is scored by a random selected evaluator, from the list you provide
- Mix cloud and local models:
--evaluators gpt-5 llama3.3:latest qwen3:30b
Supported evaluator models:
- OpenAI:
gpt-5,gpt-5-mini,gpt-4o,o1,o3etc. - Anthropic:
claude-4-sonnet,claude-3-5-sonnetetc. - Google:
gemini-2.5-pro,gemini-2.5-flashetc. - Ollama (local):
llama3.3:latest,qwen3:30b,mistral-large, etc.
Tip: Using multiple diverse evaluators reduces bias and provides more robust scores. Some models are more reliable than others.
- Clone the repository and set up a virtual environment:
git clone https://github.com/Nexus-Media/RO-FIN-LLM-Benchmark
cd RO-FIN-LLM-Benchmark
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt- Make the CLI script executable:
chmod +x rollmbenchmark- For Ollama users: Install and set up Ollama:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model (example with Llama 3.1)
ollama pull llama3.1:8bRename the .env_example to .env and add your API keys:
# LLM Provider API Keys
OPENAI_API_KEY=placeholder
ANTHROPIC_API_KEY=placeholder
GOOGLE_API_KEY=placeholder
# Model name
MODEL_NAME=gpt-5-nano
# Optional: ollama url configuration
OLLAMA_BASE_URL=http://localhost:11434
# Optional: Web search for RAG (only needed for --with-tools)
TAVILY_API_KEY=placeholder
# Optional: Custom endpoint
# VERY IMPORTANT: Must be OpenAI-compatible
# When set, provider is forced to OpenAI-compatible, and the key below is used
# If your endpoint requires an explicit model, set MODEL_NAME accordingly
CUSTOM_ENDPOINT=
CUSTOM_ENDPOINT_API_KEY=Notes:
- When
CUSTOM_ENDPOINTis set, the provider is treated as OpenAI-compatible andCUSTOM_ENDPOINT_API_KEYis used. - If your custom endpoint does not require a model name, you can omit
MODEL_NAME(the system will use an internal "auto" value).
Important
Enabling the search tool will make the token consumption to skyrocket (100k tokens / question for some models)
Make sure the model you are testing is capable of tool calling
Use with caution
The framework provides five main commands through the rollmbenchmark CLI:
Generates answers for questions and evaluates them in one step.
./rollmbenchmark evaluate [OPTIONS]Options:
--model MODEL: Model to use (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro)--dataset PATH: Path to dataset CSV file (optional, uses default if not specified)--with-tools: Enable web search during answer generation ( IMPORTANT tokens used will skyrocket, use with caution)--custom-endpoint URL: OpenAI-compatible base URL (usesCUSTOM_ENDPOINT_API_KEY)--evaluators: Models to use for the judges (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro, llama3.3) minim 1 model
Examples:
# Basic evaluation with default dataset
./rollmbenchmark evaluate --model gpt-5-mini
# Evaluation with tools enabled
./rollmbenchmark evaluate --model claude-3-5-sonnet --with-tools
# Custom dataset and evaluators
./rollmbenchmark evaluate --dataset /path/to/my-dataset.csv --model gemini-2.5-pro --evaluators gpt-5 llama3.3:latest
# Using Groq via custom endpoint (model required)
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark evaluate --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192
# Using your own server (no model required)
CUSTOM_ENDPOINT_API_KEY=my_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-serverOutput:
- Generates answers and saves them to
data/answers/ - Evaluates the answers using multiple metrics
- Saves evaluation results to
data/evaluations/ - Prints run ID and results CSV path
Evaluates answers from a previously generated CSV file.
./rollmbenchmark evaluate-answers [OPTIONS]Options:
--answers-csv PATH: Path to answers CSV file (optional, uses latest generated answers csv in the data/answers if not specified)--evaluators evaluators: Model list to randomly be picked an evaluator (random for each answer) Examples:
# Evaluate the latest answers file
./rollmbenchmark evaluate-answers
# Evaluate specific answers file
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv
# Evaluate spefic answers file with speific evaluators
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv --evaluators gpt-5Use Case: When you have generated answers separately and want to evaluate them.
Generates answers without evaluation.
./rollmbenchmark generate-answers [OPTIONS]Options:
--model MODEL: Model to use--dataset PATH: Path to dataset CSV file (optional)--with-tools: web search during generation (tokens used will skyrocket, use with caution)--custom-endpoint URL: OpenAI-compatible base URL (usesCUSTOM_ENDPOINT_API_KEY)
Examples:
# Generate answers only with tools
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
# Generate with Groq
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark generate-answers --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192Use Case: When you want to generate answers in bulk and evaluate them later, or compare different models answers.
Launches a Streamlit-based web dashboard for visualizing results and conducting human evaluations.
./rollmbenchmark dashboardThe dashboard provides three main pages:
- LLM Evaluation Dashboard (default) - Visualize and compare evaluation results
- Human Review - Rate generated answers using a rubric-based scoring system
- Human A/B Review - Compare two model answers side-by-side and pick the best one
data/
├── users.txt # Reviewer IDs (one per line, auto-created)
├── answers/
│ ├── generated_answers_*.csv # Generated answers (from evaluate/generate-answers)
│ └── combined/
│ └── combined_dataset.csv # Required for human evaluation pages
├── evaluations/
│ ├── automated/ # LLM evaluation results (from evaluate)
│ │ └── *.csv
│ └── human/
│ ├── rubric_evaluations.csv # Human review scores (auto-generated)
│ └── model_comparison.csv # A/B comparison results (auto-generated)
Important
The LLM Evaluation Dashboard works out of the box after running evaluations.
The Human Review pages require additional setup (see below).
For LLM Evaluation Dashboard:
Run at least one evaluation:
./rollmbenchmark evaluate --model gpt-5-miniFor Human Review pages:
-
Reviewer IDs - A default
data/users.txtis auto-created with example reviewers. Edit this file to add your team's reviewer IDs (one per line). -
Create a combined dataset with answers from multiple models:
mkdir -p data/answers/combined # Combine your generated answers CSVs into one file # The file must have columns: question_id, intrebare, raspuns_generat, model_name, raspuns (optional)
Example: merge multiple answer files:
import pandas as pd files = [ "data/answers/generated_answers_model1.csv", "data/answers/generated_answers_model2.csv", ] combined = pd.concat([pd.read_csv(f) for f in files]) combined.to_csv("data/answers/combined/combined_dataset.csv", index=False)
| Page | Purpose | Data source |
|---|---|---|
| LLM Evaluation Dashboard | Automated + human evaluation insights | data/evaluations/automated/*.csv, data/evaluations/human/*.csv |
| Human Review | Score answers on correctness, legal citations, clarity | data/answers/combined/combined_dataset.csv |
| Human A/B Review | Side-by-side comparison of two models | data/answers/combined/combined_dataset.csv |
Generate many answers efficiently using the OpenAI Batch API, then download results and convert them to the standard answers CSV.
Note
Requires a valid OPENAI_API_KEY and access to OpenAI Batch. Custom endpoints are not supported for Batch.
Commands:
generate-answers-batch– create a Batch job from a dataset (no evaluation)batch-status– check the status of a Batch jobbatch-download– download results and produce an answers CSV
Files produced:
- Requests and metadata in
data/batches/openai_<timestamp>/ - Output JSONL files saved on download
- Final answers CSV saved to
data/answers/
# Create a batch job (uses Responses API by default)
./rollmbenchmark generate-answers-batch \
--dataset /absolute/path/to/dataset.csv \
--model gpt-5-mini
# Create a batch job and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
--dataset /absolute/path/to/dataset.csv \
--model gpt-5 \
--wait
# Check job status later
./rollmbenchmark batch-status --batch-id BATCH_ID
# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_ID
# If needed, point to a specific batch_meta.json to improve file placement
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/openai_<timestamp>/batch_meta.jsonOptions (generate-answers-batch):
--dataset PATH– absolute path recommended (CSV)--model MODEL– Model name. Behavior is automatic by model:- OpenAI (e.g.,
gpt-5-mini,gpt-4o): uses OpenAI Batch API - Anthropic (e.g.,
claude-3.7-sonnet): uses Anthropic Message Batches - Google Gemini (e.g.,
gemini-3-pro,gemini-2.5-flash): uses Gemini Batch API
- OpenAI (e.g.,
--wait– wait for job completion and auto-save answers CSV--custom-endpoint– not supported for Batch
Limitations and tips:
- Only OpenAI’s official API is supported for Batch (no custom base URLs).
- Very large datasets may take time to process; use
batch-statusto monitor. - Results are stored under
data/batches/and converted to CSV underdata/answers/.
When you pass a Gemini model to generate-answers-batch, it uses the Gemini Batch API under the hood.
# Create a Gemini batch and wait for completion
./rollmbenchmark generate-answers-batch \
--dataset /absolute/path/to/dataset.csv \
--model gemini-2.5-pro \
--wait
# Check job status
./rollmbenchmark batch-status --batch-id BATCH_ID
# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_IDFiles produced (Gemini):
data/batches/gemini_<timestamp>/gemini_batch_requests_<timestamp>.jsondata/batches/gemini_<timestamp>/batch_meta.jsondata/batches/gemini_<timestamp>/gemini_batch_output_<timestamp>.jsonldata/answers/generated_answers_batch_<timestamp>.csv
When you pass an Anthropic model to the same generate-answers-batch command, it automatically uses Anthropic Message Batches under the hood.
# Create an Anthropic batch and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
--dataset /absolute/path/to/dataset.csv \
--model claude-3.7-sonnet \
--wait
# Check job status later (works for both providers)
./rollmbenchmark batch-status --batch-id BATCH_ID
# Download results and produce answers CSV (works for both providers)
./rollmbenchmark batch-download --batch-id BATCH_ID
# Optional: pointing to meta helps with file placement and model detection
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/anthropic_<timestamp>/batch_meta.jsonFiles produced (Anthropic):
data/batches/anthropic_<timestamp>/anthropic_batch_requests_<timestamp>.json(requests payload)data/batches/anthropic_<timestamp>/batch_meta.json(submission metadata)data/batches/anthropic_<timestamp>/anthropic_batch_output_<timestamp>.jsonl(raw results)data/answers/generated_answers_batch_<timestamp>.csv(final answers CSV)
Other behavior matches OpenAI Batch notes above (ID normalization, token usage normalization, cost calculation).
Important
To make sure the model is found by the framework, make sure it has a generic name like the ones below
The framework automatically detects Ollama models based on their names. Exemple patterns include:
- Llama models:
llama3.1:8b,llama3.1:70b,llama2:7b - Mistral models:
mistral:7b,mistral-nemo:12b - Qwen models:
qwen2.5:7b,qwen2.5:14b - Phi models:
phi3:3.8b,phi3:14b - Gemma models:
gemma2:2b,gemma2:9b - CodeLlama models:
codellama:7b,codellama:13b - DeepSeek models:
deepseek-coder:6.7b - Dolphin models:
dolphin-2.6-mistral:7b
-
Start Ollama service:
ollama serve
-
Pull a model (if not already available):
ollama pull llama3.1:8b
-
Run evaluation with Ollama model:
# Basic evaluation ./rollmbenchmark evaluate --model llama3.1:8b # With web search tools ./rollmbenchmark evaluate --model mistral:7b --with-tools
The framework automatically connects to Ollama at http://localhost:11434. To use a different Ollama instance:
-
Set custom Ollama URL in
.env:OLLAMA_BASE_URL=http://your-ollama-server:11434
-
Verify connection: The framework will automatically validate the Ollama connection when using Ollama models.
Connection Issues:
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama service
ollama serveModel Not Found:
# List available models
ollama list
# Pull missing model
ollama pull <model-name># 1. Generate answers and evaluate in one step
./rollmbenchmark evaluate --model gpt-5-mini --with-tools
# 2. View results in dashboard
./rollmbenchmark dashboard# 1. Generate answers with different models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model claude-3-5-sonnet --with-tools
./rollmbenchmark generate-answers --model gemini-2.5-pro --with-tools
# 2. Evaluate each set of answers
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp1>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp2>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp3>.csv
# 3. Compare results in dashboard
./rollmbenchmark dashboard# 1. Start Ollama service
ollama serve
# 2. Pull models (if needed)
ollama pull llama3.1:8b
ollama pull mistral:7b
# 3. Evaluate with local models
./rollmbenchmark evaluate --model llama3.1:8b
./rollmbenchmark evaluate --model mistral:7b --with-tools
# 4. Compare local vs cloud models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model llama3.1:8b --with-tools
# 5. Evaluate and compare
./rollmbenchmark evaluate-answers <answers_csv_path_for_gpt-5-mini> --evaluators gpt-5 qwen3:30b
./rollmbenchmark evaluate-answers <answers_csv_path_for_llama3.1:8b>--evaluators gemini-2.5-pro
./rollmbenchmark dashboardYou can point the system at any OpenAI-compatible API using a custom base URL. Two common scenarios:
- Example values (adjust to your account):
CUSTOM_ENDPOINT=https://api.groq.com/openai/v1
CUSTOM_ENDPOINT_API_KEY=grq_your_key_here
MODEL_NAME=llama3-70b-8192- Run with defaults from
.env:
./rollmbenchmark evaluate- Or override via CLI (recommended) :
CUSTOM_ENDPOINT_API_KEY=grq_your_key_here \
./rollmbenchmark evaluate \
--custom-endpoint https://api.groq.com/openai/v1 \
--model llama3-70b-8192- Minimal
.env:
CUSTOM_ENDPOINT=https://my-openai-compatible-server
CUSTOM_ENDPOINT_API_KEY=my_server_token
# MODEL_NAME omitted on purpose- Run:
./rollmbenchmark evaluate- Or with CLI only (no
.env):
CUSTOM_ENDPOINT_API_KEY=my_server_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-server --evaluators gpt-5Behavior summary when using a custom endpoint:
-
Provider is forced to OpenAI-compatible.
-
If
MODEL_NAMEis not set, the system uses an internal "auto" model sentinel.
The pricing is currently based on hardcoded values (per 1 million tokens).
To update or add prices for models:
- Open the file:
src/infrastructure/configuration.py - Locate the following lines:
- Line 70 → Input token pricing
- Line 94 → Output token pricing
- Modify the existing values or add new models with their corresponding prices.
price_per_1m_input_tokens_by_model: dict[str, float] = Field(default_factory=lambda: {
# OpenAI (Last update 17 Sep 2025)
"gpt-5": 1.25,
"gpt-5-mini": 0.25,
"gpt-5-nano": 0.05,
"gpt-4o": 2.50,
"gpt-4o-mini":0.15,
"o1": 15.00,
"o1-pro": 150.00,
"o3":2.00,To monitor each LLM call in detail, configure your LangSmith environment variables:
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=<your_api_key>
LANGSMITH_PROJECT=<your_project_name>The experiment_data/ folder contains the analysis datasets used in the paper, including cleaned Excel/CSV exports for model evaluations, human ratings, and preference tests.
It also includes raw generation artifacts under experiment_data/generated_answers/ (JSON/JSONL), plus small utilities used during analysis. See experiment_data/README.md for column-level descriptions of each dataset.
RAG/retrieval results are stored in the repo as Langfuse traces, located in experiment_data/generated_answers/.
Have questions, feedback, or suggestions? Feel free to reach out:
GitHub Issues: Open an issue for bug reports or feature requests.
We’d love to hear from you!

