RO-FIN-LLM: A Benchmark for Romanian Tax, Accounting

Overview

This framework provides an evaluation system for LLMs with support for:

Multiple LLM providers (OpenAI, Anthropic, Google, Ollama, Custom OpenAI-compatible deployments)
Web-search-based answer generation
Random model selection for judges
Web-based dashboard for results visualization
Token / cost estimations

Requirements

Python: 3.10 or higher
Operating System: Linux, macOS, or WSL recommended
API Keys: Provider API keys for the models
Ollama: Local Ollama installation for running open-source models
Web Search: Tavily API key for web search (optional)
OpenAI Compliant Endpoint: Custom OpenAI-compatible endpoint (optional)

Dataset

The benchmark is built on a custom dataset of Romanian tax and accounting questions.
Each entry contains both the question (intrebare) and the official or expert answer (raspuns), along with metadata for evaluation.

Schema:

id – unique identifier
data – date of entry
link – reference or context (if available)
intrebare – the user’s question
raspuns – expert answer
dificultate – difficulty level (easy, medium, hard)
categorie – topic area (e.g., impozit pe profit, contabilitate, etc.)
timp_raspuns_uman_min – estimated time (minutes) for a human expert to answer

Example rows:

id	data	intrebare	dificultate	categorie	timp_raspuns_uman_min
199999	2025-08-05	Cesiune creanțe pentru care s-a constituit anterior provizion – Impozit pe profit	hard	impozit pe profit	60.0
199837	2025-06-05	Scutire profit reinvestit – Impozit pe profit	easy	impozit pe profit	20.0
199766	2025-04-05	Completare D101 – Impozit pe profit	medium	impozit pe profit	15.0 (example)

This dataset ensures that models are tested on realistic, domain-specific scenarios, combining:

Legal references (Cod Fiscal, norme metodologice)
Accounting practices (monografii contabile)
Difficulty levels for nuanced benchmarking

Metrics

Each answer is evaluated on three dimensions using binary (VALID/INVALID) judgments:

Metric	What it evaluates
Correctness	Does the answer match the gold answer or provide a valid alternative? Evaluates completeness and reasoning.
Legal Citation	Are the cited legal articles correct and aligned with the gold answer? Checks for contradicting or missing citations.
Clarity & Structure	Is the answer well-structured, easy to understand, and accessible to non-experts?

All metrics return VALID (1.0) or INVALID (0.0) with an explanation in Romanian.

Evaluators (Judges)

The --evaluators flag lets you specify which LLM models act as judges to score the generated answers.

# Single evaluator
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5

# Multiple evaluators (scores are averaged)
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5 claude-4-sonnet gemini-2.5-pro

How it works:

Each evaluator judges every answer on the metrics above (VALID/INVALID)
When using multiple evaluators, one answer is scored by a random selected evaluator, from the list you provide
Mix cloud and local models: --evaluators gpt-5 llama3.3:latest qwen3:30b

Supported evaluator models:

OpenAI: gpt-5, gpt-5-mini, gpt-4o, o1, o3 etc.
Anthropic: claude-4-sonnet, claude-3-5-sonnet etc.
Google: gemini-2.5-pro, gemini-2.5-flash etc.
Ollama (local): llama3.3:latest, qwen3:30b, mistral-large, etc.

Tip: Using multiple diverse evaluators reduces bias and provides more robust scores. Some models are more reliable than others.

Installation

Clone the repository and set up a virtual environment:

git clone https://github.com/Nexus-Media/RO-FIN-LLM-Benchmark
cd RO-FIN-LLM-Benchmark
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Make the CLI script executable:

chmod +x rollmbenchmark

For Ollama users: Install and set up Ollama:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model (example with Llama 3.1)
ollama pull llama3.1:8b

Configuration

Rename the .env_example to .env and add your API keys:

# LLM Provider API Keys
OPENAI_API_KEY=placeholder
ANTHROPIC_API_KEY=placeholder
GOOGLE_API_KEY=placeholder

# Model name
MODEL_NAME=gpt-5-nano

# Optional: ollama url configuration
OLLAMA_BASE_URL=http://localhost:11434

# Optional: Web search for RAG (only needed for --with-tools)
TAVILY_API_KEY=placeholder

# Optional: Custom endpoint
# VERY IMPORTANT: Must be OpenAI-compatible
# When set, provider is forced to OpenAI-compatible, and the key below is used
# If your endpoint requires an explicit model, set MODEL_NAME accordingly
CUSTOM_ENDPOINT=
CUSTOM_ENDPOINT_API_KEY=

Notes:

When CUSTOM_ENDPOINT is set, the provider is treated as OpenAI-compatible and CUSTOM_ENDPOINT_API_KEY is used.
If your custom endpoint does not require a model name, you can omit MODEL_NAME (the system will use an internal "auto" value).

Examples and Tutorials

Commands

Important

Enabling the search tool will make the token consumption to skyrocket (100k tokens / question for some models)

Make sure the model you are testing is capable of tool calling

Use with caution

The framework provides five main commands through the rollmbenchmark CLI:

1. evaluate - Generate and Evaluate

Generates answers for questions and evaluates them in one step.

./rollmbenchmark evaluate [OPTIONS]

Options:

--model MODEL: Model to use (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro)
--dataset PATH: Path to dataset CSV file (optional, uses default if not specified)
--with-tools: Enable web search during answer generation ( IMPORTANT tokens used will skyrocket, use with caution)
--custom-endpoint URL: OpenAI-compatible base URL (uses CUSTOM_ENDPOINT_API_KEY)
--evaluators: Models to use for the judges (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro, llama3.3) minim 1 model

Examples:

# Basic evaluation with default dataset
./rollmbenchmark evaluate --model gpt-5-mini

# Evaluation with tools enabled
./rollmbenchmark evaluate --model claude-3-5-sonnet --with-tools

# Custom dataset and evaluators
./rollmbenchmark evaluate --dataset /path/to/my-dataset.csv --model gemini-2.5-pro --evaluators gpt-5 llama3.3:latest

# Using Groq via custom endpoint (model required)
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark evaluate --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192

# Using your own server (no model required)
CUSTOM_ENDPOINT_API_KEY=my_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-server

Output:

Generates answers and saves them to data/answers/
Evaluates the answers using multiple metrics
Saves evaluation results to data/evaluations/
Prints run ID and results CSV path

2. evaluate-answers - Evaluate Existing Answers

Evaluates answers from a previously generated CSV file.

./rollmbenchmark evaluate-answers [OPTIONS]

Options:

--answers-csv PATH: Path to answers CSV file (optional, uses latest generated answers csv in the data/answers if not specified)
--evaluators evaluators: Model list to randomly be picked an evaluator (random for each answer) Examples:

# Evaluate the latest answers file
./rollmbenchmark evaluate-answers

# Evaluate specific answers file
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv

# Evaluate spefic answers file with speific evaluators
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv --evaluators gpt-5

Use Case: When you have generated answers separately and want to evaluate them.

3. generate-answers - Generate Only

Generates answers without evaluation.

./rollmbenchmark generate-answers [OPTIONS]

Options:

--model MODEL: Model to use
--dataset PATH: Path to dataset CSV file (optional)
--with-tools: web search during generation (tokens used will skyrocket, use with caution)
--custom-endpoint URL: OpenAI-compatible base URL (uses CUSTOM_ENDPOINT_API_KEY)

Examples:

# Generate answers only with tools
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools

# Generate with Groq
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark generate-answers --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192

Use Case: When you want to generate answers in bulk and evaluate them later, or compare different models answers.

4. dashboard - Launch Web Interface

Launches a Streamlit-based web dashboard for visualizing results and conducting human evaluations.

./rollmbenchmark dashboard

What it does

The dashboard provides three main pages:

LLM Evaluation Dashboard (default) - Visualize and compare evaluation results
Human Review - Rate generated answers using a rubric-based scoring system
Human A/B Review - Compare two model answers side-by-side and pick the best one

Data Directory Structure

data/
├── users.txt                        # Reviewer IDs (one per line, auto-created)
├── answers/
│   ├── generated_answers_*.csv      # Generated answers (from evaluate/generate-answers)
│   └── combined/
│       └── combined_dataset.csv     # Required for human evaluation pages
├── evaluations/
│   ├── automated/                   # LLM evaluation results (from evaluate)
│   │   └── *.csv
│   └── human/
│       ├── rubric_evaluations.csv   # Human review scores (auto-generated)
│       └── model_comparison.csv     # A/B comparison results (auto-generated)

Setup

Important

The LLM Evaluation Dashboard works out of the box after running evaluations.
The Human Review pages require additional setup (see below).

For LLM Evaluation Dashboard:

Run at least one evaluation:

./rollmbenchmark evaluate --model gpt-5-mini

For Human Review pages:

Reviewer IDs - A default data/users.txt is auto-created with example reviewers. Edit this file to add your team's reviewer IDs (one per line).

Create a combined dataset with answers from multiple models:

mkdir -p data/answers/combined
# Combine your generated answers CSVs into one file
# The file must have columns: question_id, intrebare, raspuns_generat, model_name, raspuns (optional)

Example: merge multiple answer files:

import pandas as pd

files = [
    "data/answers/generated_answers_model1.csv",
    "data/answers/generated_answers_model2.csv",
]
combined = pd.concat([pd.read_csv(f) for f in files])
combined.to_csv("data/answers/combined/combined_dataset.csv", index=False)

Dashboard Pages

Page	Purpose	Data source
LLM Evaluation Dashboard	Automated + human evaluation insights	`data/evaluations/automated/.csv`, `data/evaluations/human/.csv`
Human Review	Score answers on correctness, legal citations, clarity	`data/answers/combined/combined_dataset.csv`
Human A/B Review	Side-by-side comparison of two models	`data/answers/combined/combined_dataset.csv`

5. OpenAI Batch API - Generate at Scale

Generate many answers efficiently using the OpenAI Batch API, then download results and convert them to the standard answers CSV.

Note

Requires a valid OPENAI_API_KEY and access to OpenAI Batch. Custom endpoints are not supported for Batch.

Commands:

generate-answers-batch – create a Batch job from a dataset (no evaluation)
batch-status – check the status of a Batch job
batch-download – download results and produce an answers CSV

Files produced:

Requests and metadata in data/batches/openai_<timestamp>/
Output JSONL files saved on download
Final answers CSV saved to data/answers/

# Create a batch job (uses Responses API by default)
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gpt-5-mini

# Create a batch job and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gpt-5 \
  --wait

# Check job status later
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_ID

# If needed, point to a specific batch_meta.json to improve file placement
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/openai_<timestamp>/batch_meta.json

Options (generate-answers-batch):

--dataset PATH – absolute path recommended (CSV)
--model MODEL – Model name. Behavior is automatic by model:
- OpenAI (e.g., gpt-5-mini, gpt-4o): uses OpenAI Batch API
- Anthropic (e.g., claude-3.7-sonnet): uses Anthropic Message Batches
- Google Gemini (e.g., gemini-3-pro, gemini-2.5-flash): uses Gemini Batch API
--wait – wait for job completion and auto-save answers CSV
--custom-endpoint – not supported for Batch

Limitations and tips:

Only OpenAI’s official API is supported for Batch (no custom base URLs).
Very large datasets may take time to process; use batch-status to monitor.
Results are stored under data/batches/ and converted to CSV under data/answers/.

6. Gemini Batch API

When you pass a Gemini model to generate-answers-batch, it uses the Gemini Batch API under the hood.

# Create a Gemini batch and wait for completion
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gemini-2.5-pro \
  --wait

# Check job status
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_ID

Files produced (Gemini):

data/batches/gemini_<timestamp>/gemini_batch_requests_<timestamp>.json
data/batches/gemini_<timestamp>/batch_meta.json
data/batches/gemini_<timestamp>/gemini_batch_output_<timestamp>.jsonl
data/answers/generated_answers_batch_<timestamp>.csv

7. Anthropic Message Batches

When you pass an Anthropic model to the same generate-answers-batch command, it automatically uses Anthropic Message Batches under the hood.

# Create an Anthropic batch and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model claude-3.7-sonnet \
  --wait

# Check job status later (works for both providers)
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV (works for both providers)
./rollmbenchmark batch-download --batch-id BATCH_ID

# Optional: pointing to meta helps with file placement and model detection
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/anthropic_<timestamp>/batch_meta.json

Files produced (Anthropic):

data/batches/anthropic_<timestamp>/anthropic_batch_requests_<timestamp>.json (requests payload)
data/batches/anthropic_<timestamp>/batch_meta.json (submission metadata)
data/batches/anthropic_<timestamp>/anthropic_batch_output_<timestamp>.jsonl (raw results)
data/answers/generated_answers_batch_<timestamp>.csv (final answers CSV)

Other behavior matches OpenAI Batch notes above (ID normalization, token usage normalization, cost calculation).

Ollama Integration

Supported Ollama Models

Important

To make sure the model is found by the framework, make sure it has a generic name like the ones below

The framework automatically detects Ollama models based on their names. Exemple patterns include:

Llama models: llama3.1:8b, llama3.1:70b, llama2:7b
Mistral models: mistral:7b, mistral-nemo:12b
Qwen models: qwen2.5:7b, qwen2.5:14b
Phi models: phi3:3.8b, phi3:14b
Gemma models: gemma2:2b, gemma2:9b
CodeLlama models: codellama:7b, codellama:13b
DeepSeek models: deepseek-coder:6.7b
Dolphin models: dolphin-2.6-mistral:7b

Using Ollama Models

Start Ollama service:
```
ollama serve
```
Pull a model (if not already available):
```
ollama pull llama3.1:8b
```

Run evaluation with Ollama model:

# Basic evaluation
./rollmbenchmark evaluate --model llama3.1:8b

# With web search tools
./rollmbenchmark evaluate --model mistral:7b --with-tools

Ollama Configuration

The framework automatically connects to Ollama at http://localhost:11434. To use a different Ollama instance:

Set custom Ollama URL in .env:

OLLAMA_BASE_URL=http://your-ollama-server:11434

Verify connection: The framework will automatically validate the Ollama connection when using Ollama models.

Troubleshooting Ollama

Connection Issues:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Restart Ollama service
ollama serve

Model Not Found:

# List available models
ollama list

# Pull missing model
ollama pull <model-name>

Workflow Examples

Complete Evaluation Workflow

# 1. Generate answers and evaluate in one step
./rollmbenchmark evaluate --model gpt-5-mini --with-tools

# 2. View results in dashboard
./rollmbenchmark dashboard

Comparative Analysis Workflow

# 1. Generate answers with different models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model claude-3-5-sonnet --with-tools
./rollmbenchmark generate-answers --model gemini-2.5-pro --with-tools

# 2. Evaluate each set of answers
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp1>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp2>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp3>.csv

# 3. Compare results in dashboard
./rollmbenchmark dashboard

Ollama Local Evaluation Workflow

# 1. Start Ollama service
ollama serve

# 2. Pull models (if needed)
ollama pull llama3.1:8b
ollama pull mistral:7b

# 3. Evaluate with local models
./rollmbenchmark evaluate --model llama3.1:8b
./rollmbenchmark evaluate --model mistral:7b --with-tools

# 4. Compare local vs cloud models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model llama3.1:8b --with-tools

# 5. Evaluate and compare
./rollmbenchmark evaluate-answers <answers_csv_path_for_gpt-5-mini> --evaluators gpt-5 qwen3:30b
./rollmbenchmark evaluate-answers <answers_csv_path_for_llama3.1:8b>--evaluators gemini-2.5-pro
./rollmbenchmark dashboard

Custom OpenAI-Compatible Endpoints (Groq example)

You can point the system at any OpenAI-compatible API using a custom base URL. Two common scenarios:

A) Third-party provider (e.g., Groq) that requires a model name

Example values (adjust to your account):

CUSTOM_ENDPOINT=https://api.groq.com/openai/v1
CUSTOM_ENDPOINT_API_KEY=grq_your_key_here
MODEL_NAME=llama3-70b-8192

Run with defaults from .env:

./rollmbenchmark evaluate

Or override via CLI (recommended) :

CUSTOM_ENDPOINT_API_KEY=grq_your_key_here \
./rollmbenchmark evaluate \
  --custom-endpoint https://api.groq.com/openai/v1 \
  --model llama3-70b-8192

B) Your own server where the model is implicit (no model required)

Minimal .env:

CUSTOM_ENDPOINT=https://my-openai-compatible-server
CUSTOM_ENDPOINT_API_KEY=my_server_token
# MODEL_NAME omitted on purpose

Run:

./rollmbenchmark evaluate

Or with CLI only (no .env):

CUSTOM_ENDPOINT_API_KEY=my_server_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-server --evaluators gpt-5

Behavior summary when using a custom endpoint:

Provider is forced to OpenAI-compatible.
If MODEL_NAME is not set, the system uses an internal "auto" model sentinel.

Price calculation

The pricing is currently based on hardcoded values (per 1 million tokens).

To update or add prices for models:

Open the file:
src/infrastructure/configuration.py
Locate the following lines:
- Line 70 → Input token pricing
- Line 94 → Output token pricing
Modify the existing values or add new models with their corresponding prices.

    price_per_1m_input_tokens_by_model: dict[str, float] = Field(default_factory=lambda: {
        # OpenAI (Last update 17 Sep 2025)
        "gpt-5": 1.25,
        "gpt-5-mini": 0.25,
        "gpt-5-nano": 0.05,
        "gpt-4o": 2.50,
        "gpt-4o-mini":0.15,
        "o1": 15.00,
        "o1-pro": 150.00,
        "o3":2.00,

For more advanced customization edit the `src/configuration.py` file

Langsmith integration

To monitor each LLM call in detail, configure your LangSmith environment variables:

LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=<your_api_key>
LANGSMITH_PROJECT=<your_project_name>

Experiment Data (Article Assets)

The experiment_data/ folder contains the analysis datasets used in the paper, including cleaned Excel/CSV exports for model evaluations, human ratings, and preference tests.

It also includes raw generation artifacts under experiment_data/generated_answers/ (JSON/JSONL), plus small utilities used during analysis. See experiment_data/README.md for column-level descriptions of each dataset.

RAG/retrieval results are stored in the repo as Langfuse traces, located in experiment_data/generated_answers/.

Contact

Have questions, feedback, or suggestions? Feel free to reach out:

GitHub Issues: Open an issue for bug reports or feature requests.

We’d love to hear from you!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
analysis		analysis
data		data
docs/screenshots		docs/screenshots
experiment_data		experiment_data
src		src
.env_example		.env_example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rollmbenchmark		rollmbenchmark

Folders and files

Latest commit

History

Repository files navigation

RO-FIN-LLM: A Benchmark for Romanian Tax, Accounting

Overview

Requirements

Dataset

Metrics

Evaluators (Judges)

Installation

Configuration

Examples and Tutorials

Commands

1. evaluate - Generate and Evaluate

2. evaluate-answers - Evaluate Existing Answers

3. generate-answers - Generate Only

4. dashboard - Launch Web Interface

What it does

Data Directory Structure

Setup

Dashboard Pages

5. OpenAI Batch API - Generate at Scale

6. Gemini Batch API

7. Anthropic Message Batches

Ollama Integration

Supported Ollama Models

Using Ollama Models

Ollama Configuration

Troubleshooting Ollama

Workflow Examples

Complete Evaluation Workflow

Comparative Analysis Workflow

Ollama Local Evaluation Workflow

Custom OpenAI-Compatible Endpoints (Groq example)

A) Third-party provider (e.g., Groq) that requires a model name

B) Your own server where the model is implicit (no model required)

Price calculation

For more advanced customization edit the src/configuration.py file

Langsmith integration

Experiment Data (Article Assets)

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

For more advanced customization edit the `src/configuration.py` file

Packages