Skip to content

Nexus-Media/RO-FIN-LLM-Benchmark

Repository files navigation

RO-FIN-LLM: A Benchmark for Romanian Tax, Accounting

Python 3.10+ Streamlit Status Dataset

Overview

This framework provides an evaluation system for LLMs with support for:

  • Multiple LLM providers (OpenAI, Anthropic, Google, Ollama, Custom OpenAI-compatible deployments)
  • Web-search-based answer generation
  • Random model selection for judges
  • Web-based dashboard for results visualization
  • Token / cost estimations

Requirements

  • Python: 3.10 or higher
  • Operating System: Linux, macOS, or WSL recommended
  • API Keys: Provider API keys for the models
  • Ollama: Local Ollama installation for running open-source models
  • Web Search: Tavily API key for web search (optional)
  • OpenAI Compliant Endpoint: Custom OpenAI-compatible endpoint (optional)

Dataset

The benchmark is built on a custom dataset of Romanian tax and accounting questions.
Each entry contains both the question (intrebare) and the official or expert answer (raspuns), along with metadata for evaluation.

Schema:

  • id – unique identifier
  • data – date of entry
  • link – reference or context (if available)
  • intrebare – the user’s question
  • raspuns – expert answer
  • dificultate – difficulty level (easy, medium, hard)
  • categorie – topic area (e.g., impozit pe profit, contabilitate, etc.)
  • timp_raspuns_uman_min – estimated time (minutes) for a human expert to answer

Example rows:

id data intrebare dificultate categorie timp_raspuns_uman_min
199999 2025-08-05 Cesiune creanțe pentru care s-a constituit anterior provizion – Impozit pe profit hard impozit pe profit 60.0
199837 2025-06-05 Scutire profit reinvestit – Impozit pe profit easy impozit pe profit 20.0
199766 2025-04-05 Completare D101 – Impozit pe profit medium impozit pe profit 15.0 (example)

This dataset ensures that models are tested on realistic, domain-specific scenarios, combining:

  • Legal references (Cod Fiscal, norme metodologice)
  • Accounting practices (monografii contabile)
  • Difficulty levels for nuanced benchmarking

Metrics

Each answer is evaluated on three dimensions using binary (VALID/INVALID) judgments:

Metric What it evaluates
Correctness Does the answer match the gold answer or provide a valid alternative? Evaluates completeness and reasoning.
Legal Citation Are the cited legal articles correct and aligned with the gold answer? Checks for contradicting or missing citations.
Clarity & Structure Is the answer well-structured, easy to understand, and accessible to non-experts?

All metrics return VALID (1.0) or INVALID (0.0) with an explanation in Romanian.

Evaluators (Judges)

The --evaluators flag lets you specify which LLM models act as judges to score the generated answers.

# Single evaluator
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5

# Multiple evaluators (scores are averaged)
./rollmbenchmark evaluate --model gpt-5-mini --evaluators gpt-5 claude-4-sonnet gemini-2.5-pro

How it works:

  • Each evaluator judges every answer on the metrics above (VALID/INVALID)
  • When using multiple evaluators, one answer is scored by a random selected evaluator, from the list you provide
  • Mix cloud and local models: --evaluators gpt-5 llama3.3:latest qwen3:30b

Supported evaluator models:

  • OpenAI: gpt-5, gpt-5-mini, gpt-4o, o1, o3 etc.
  • Anthropic: claude-4-sonnet, claude-3-5-sonnet etc.
  • Google: gemini-2.5-pro, gemini-2.5-flash etc.
  • Ollama (local): llama3.3:latest, qwen3:30b, mistral-large, etc.

Tip: Using multiple diverse evaluators reduces bias and provides more robust scores. Some models are more reliable than others.

Installation

  1. Clone the repository and set up a virtual environment:
git clone https://github.com/Nexus-Media/RO-FIN-LLM-Benchmark
cd RO-FIN-LLM-Benchmark
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
  1. Make the CLI script executable:
chmod +x rollmbenchmark
  1. For Ollama users: Install and set up Ollama:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model (example with Llama 3.1)
ollama pull llama3.1:8b

Configuration

Rename the .env_example to .env and add your API keys:

# LLM Provider API Keys
OPENAI_API_KEY=placeholder
ANTHROPIC_API_KEY=placeholder
GOOGLE_API_KEY=placeholder

# Model name
MODEL_NAME=gpt-5-nano

# Optional: ollama url configuration
OLLAMA_BASE_URL=http://localhost:11434

# Optional: Web search for RAG (only needed for --with-tools)
TAVILY_API_KEY=placeholder

# Optional: Custom endpoint
# VERY IMPORTANT: Must be OpenAI-compatible
# When set, provider is forced to OpenAI-compatible, and the key below is used
# If your endpoint requires an explicit model, set MODEL_NAME accordingly
CUSTOM_ENDPOINT=
CUSTOM_ENDPOINT_API_KEY=

Notes:

  • When CUSTOM_ENDPOINT is set, the provider is treated as OpenAI-compatible and CUSTOM_ENDPOINT_API_KEY is used.
  • If your custom endpoint does not require a model name, you can omit MODEL_NAME (the system will use an internal "auto" value).

Examples and Tutorials

Commands

Important

Enabling the search tool will make the token consumption to skyrocket (100k tokens / question for some models)

Make sure the model you are testing is capable of tool calling

Use with caution

The framework provides five main commands through the rollmbenchmark CLI:

1. evaluate - Generate and Evaluate

Generates answers for questions and evaluates them in one step.

./rollmbenchmark evaluate [OPTIONS]

Options:

  • --model MODEL: Model to use (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro)
  • --dataset PATH: Path to dataset CSV file (optional, uses default if not specified)
  • --with-tools: Enable web search during answer generation ( IMPORTANT tokens used will skyrocket, use with caution)
  • --custom-endpoint URL: OpenAI-compatible base URL (uses CUSTOM_ENDPOINT_API_KEY)
  • --evaluators: Models to use for the judges (e.g., gpt-5-mini, claude-3-5-sonnet, gemini-2.5-pro, llama3.3) minim 1 model

Examples:

# Basic evaluation with default dataset
./rollmbenchmark evaluate --model gpt-5-mini

# Evaluation with tools enabled
./rollmbenchmark evaluate --model claude-3-5-sonnet --with-tools

# Custom dataset and evaluators
./rollmbenchmark evaluate --dataset /path/to/my-dataset.csv --model gemini-2.5-pro --evaluators gpt-5 llama3.3:latest

# Using Groq via custom endpoint (model required)
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark evaluate --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192

# Using your own server (no model required)
CUSTOM_ENDPOINT_API_KEY=my_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-server

Output:

  • Generates answers and saves them to data/answers/
  • Evaluates the answers using multiple metrics
  • Saves evaluation results to data/evaluations/
  • Prints run ID and results CSV path

2. evaluate-answers - Evaluate Existing Answers

Evaluates answers from a previously generated CSV file.

./rollmbenchmark evaluate-answers [OPTIONS]

Options:

  • --answers-csv PATH: Path to answers CSV file (optional, uses latest generated answers csv in the data/answers if not specified)
  • --evaluators evaluators: Model list to randomly be picked an evaluator (random for each answer) Examples:
# Evaluate the latest answers file
./rollmbenchmark evaluate-answers

# Evaluate specific answers file
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv

# Evaluate spefic answers file with speific evaluators
./rollmbenchmark evaluate-answers --answers-csv /path/to/answers.csv --evaluators gpt-5

Use Case: When you have generated answers separately and want to evaluate them.

3. generate-answers - Generate Only

Generates answers without evaluation.

./rollmbenchmark generate-answers [OPTIONS]

Options:

  • --model MODEL: Model to use
  • --dataset PATH: Path to dataset CSV file (optional)
  • --with-tools: web search during generation (tokens used will skyrocket, use with caution)
  • --custom-endpoint URL: OpenAI-compatible base URL (uses CUSTOM_ENDPOINT_API_KEY)

Examples:

# Generate answers only with tools
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools

# Generate with Groq
CUSTOM_ENDPOINT_API_KEY=grq_your_key \
./rollmbenchmark generate-answers --custom-endpoint https://api.groq.com/openai/v1 --model llama3-70b-8192

Use Case: When you want to generate answers in bulk and evaluate them later, or compare different models answers.

4. dashboard - Launch Web Interface

Launches a Streamlit-based web dashboard for visualizing results and conducting human evaluations.

./rollmbenchmark dashboard

What it does

The dashboard provides three main pages:

  1. LLM Evaluation Dashboard (default) - Visualize and compare evaluation results
  2. Human Review - Rate generated answers using a rubric-based scoring system
  3. Human A/B Review - Compare two model answers side-by-side and pick the best one

dashbord

Data Directory Structure

data/
├── users.txt                        # Reviewer IDs (one per line, auto-created)
├── answers/
│   ├── generated_answers_*.csv      # Generated answers (from evaluate/generate-answers)
│   └── combined/
│       └── combined_dataset.csv     # Required for human evaluation pages
├── evaluations/
│   ├── automated/                   # LLM evaluation results (from evaluate)
│   │   └── *.csv
│   └── human/
│       ├── rubric_evaluations.csv   # Human review scores (auto-generated)
│       └── model_comparison.csv     # A/B comparison results (auto-generated)

Setup

Important

The LLM Evaluation Dashboard works out of the box after running evaluations.
The Human Review pages require additional setup (see below).

For LLM Evaluation Dashboard:

Run at least one evaluation:

./rollmbenchmark evaluate --model gpt-5-mini

For Human Review pages:

  1. Reviewer IDs - A default data/users.txt is auto-created with example reviewers. Edit this file to add your team's reviewer IDs (one per line).

  2. Create a combined dataset with answers from multiple models:

    mkdir -p data/answers/combined
    # Combine your generated answers CSVs into one file
    # The file must have columns: question_id, intrebare, raspuns_generat, model_name, raspuns (optional)

    Example: merge multiple answer files:

    import pandas as pd
    
    files = [
        "data/answers/generated_answers_model1.csv",
        "data/answers/generated_answers_model2.csv",
    ]
    combined = pd.concat([pd.read_csv(f) for f in files])
    combined.to_csv("data/answers/combined/combined_dataset.csv", index=False)

human_review

Dashboard Pages

Page Purpose Data source
LLM Evaluation Dashboard Automated + human evaluation insights data/evaluations/automated/*.csv, data/evaluations/human/*.csv
Human Review Score answers on correctness, legal citations, clarity data/answers/combined/combined_dataset.csv
Human A/B Review Side-by-side comparison of two models data/answers/combined/combined_dataset.csv

5. OpenAI Batch API - Generate at Scale

Generate many answers efficiently using the OpenAI Batch API, then download results and convert them to the standard answers CSV.

Note

Requires a valid OPENAI_API_KEY and access to OpenAI Batch. Custom endpoints are not supported for Batch.

Commands:

  • generate-answers-batch – create a Batch job from a dataset (no evaluation)
  • batch-status – check the status of a Batch job
  • batch-download – download results and produce an answers CSV

Files produced:

  • Requests and metadata in data/batches/openai_<timestamp>/
  • Output JSONL files saved on download
  • Final answers CSV saved to data/answers/
# Create a batch job (uses Responses API by default)
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gpt-5-mini

# Create a batch job and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gpt-5 \
  --wait

# Check job status later
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_ID

# If needed, point to a specific batch_meta.json to improve file placement
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/openai_<timestamp>/batch_meta.json

Options (generate-answers-batch):

  • --dataset PATH – absolute path recommended (CSV)
  • --model MODEL – Model name. Behavior is automatic by model:
    • OpenAI (e.g., gpt-5-mini, gpt-4o): uses OpenAI Batch API
    • Anthropic (e.g., claude-3.7-sonnet): uses Anthropic Message Batches
    • Google Gemini (e.g., gemini-3-pro, gemini-2.5-flash): uses Gemini Batch API
  • --wait – wait for job completion and auto-save answers CSV
  • --custom-endpoint – not supported for Batch

Limitations and tips:

  • Only OpenAI’s official API is supported for Batch (no custom base URLs).
  • Very large datasets may take time to process; use batch-status to monitor.
  • Results are stored under data/batches/ and converted to CSV under data/answers/.

6. Gemini Batch API

When you pass a Gemini model to generate-answers-batch, it uses the Gemini Batch API under the hood.

# Create a Gemini batch and wait for completion
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model gemini-2.5-pro \
  --wait

# Check job status
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV
./rollmbenchmark batch-download --batch-id BATCH_ID

Files produced (Gemini):

  • data/batches/gemini_<timestamp>/gemini_batch_requests_<timestamp>.json
  • data/batches/gemini_<timestamp>/batch_meta.json
  • data/batches/gemini_<timestamp>/gemini_batch_output_<timestamp>.jsonl
  • data/answers/generated_answers_batch_<timestamp>.csv

7. Anthropic Message Batches

When you pass an Anthropic model to the same generate-answers-batch command, it automatically uses Anthropic Message Batches under the hood.

# Create an Anthropic batch and wait for completion, then auto-save answers CSV
./rollmbenchmark generate-answers-batch \
  --dataset /absolute/path/to/dataset.csv \
  --model claude-3.7-sonnet \
  --wait

# Check job status later (works for both providers)
./rollmbenchmark batch-status --batch-id BATCH_ID

# Download results and produce answers CSV (works for both providers)
./rollmbenchmark batch-download --batch-id BATCH_ID

# Optional: pointing to meta helps with file placement and model detection
./rollmbenchmark batch-download --batch-id BATCH_ID --meta-path data/batches/anthropic_<timestamp>/batch_meta.json

Files produced (Anthropic):

  • data/batches/anthropic_<timestamp>/anthropic_batch_requests_<timestamp>.json (requests payload)
  • data/batches/anthropic_<timestamp>/batch_meta.json (submission metadata)
  • data/batches/anthropic_<timestamp>/anthropic_batch_output_<timestamp>.jsonl (raw results)
  • data/answers/generated_answers_batch_<timestamp>.csv (final answers CSV)

Other behavior matches OpenAI Batch notes above (ID normalization, token usage normalization, cost calculation).

Ollama Integration

Supported Ollama Models

Important

To make sure the model is found by the framework, make sure it has a generic name like the ones below

The framework automatically detects Ollama models based on their names. Exemple patterns include:

  • Llama models: llama3.1:8b, llama3.1:70b, llama2:7b
  • Mistral models: mistral:7b, mistral-nemo:12b
  • Qwen models: qwen2.5:7b, qwen2.5:14b
  • Phi models: phi3:3.8b, phi3:14b
  • Gemma models: gemma2:2b, gemma2:9b
  • CodeLlama models: codellama:7b, codellama:13b
  • DeepSeek models: deepseek-coder:6.7b
  • Dolphin models: dolphin-2.6-mistral:7b

Using Ollama Models

  1. Start Ollama service:

    ollama serve
  2. Pull a model (if not already available):

    ollama pull llama3.1:8b
  3. Run evaluation with Ollama model:

    # Basic evaluation
    ./rollmbenchmark evaluate --model llama3.1:8b
    
    # With web search tools
    ./rollmbenchmark evaluate --model mistral:7b --with-tools

Ollama Configuration

The framework automatically connects to Ollama at http://localhost:11434. To use a different Ollama instance:

  1. Set custom Ollama URL in .env:

    OLLAMA_BASE_URL=http://your-ollama-server:11434
  2. Verify connection: The framework will automatically validate the Ollama connection when using Ollama models.

Troubleshooting Ollama

Connection Issues:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Restart Ollama service
ollama serve

Model Not Found:

# List available models
ollama list

# Pull missing model
ollama pull <model-name>

Workflow Examples

Complete Evaluation Workflow

# 1. Generate answers and evaluate in one step
./rollmbenchmark evaluate --model gpt-5-mini --with-tools

# 2. View results in dashboard
./rollmbenchmark dashboard

Comparative Analysis Workflow

# 1. Generate answers with different models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model claude-3-5-sonnet --with-tools
./rollmbenchmark generate-answers --model gemini-2.5-pro --with-tools

# 2. Evaluate each set of answers
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp1>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp2>.csv
./rollmbenchmark evaluate-answers --answers-csv data/answers/generated_answers_<timestamp3>.csv

# 3. Compare results in dashboard
./rollmbenchmark dashboard

Ollama Local Evaluation Workflow

# 1. Start Ollama service
ollama serve

# 2. Pull models (if needed)
ollama pull llama3.1:8b
ollama pull mistral:7b

# 3. Evaluate with local models
./rollmbenchmark evaluate --model llama3.1:8b
./rollmbenchmark evaluate --model mistral:7b --with-tools

# 4. Compare local vs cloud models
./rollmbenchmark generate-answers --model gpt-5-mini --with-tools
./rollmbenchmark generate-answers --model llama3.1:8b --with-tools

# 5. Evaluate and compare
./rollmbenchmark evaluate-answers <answers_csv_path_for_gpt-5-mini> --evaluators gpt-5 qwen3:30b
./rollmbenchmark evaluate-answers <answers_csv_path_for_llama3.1:8b>--evaluators gemini-2.5-pro
./rollmbenchmark dashboard

Custom OpenAI-Compatible Endpoints (Groq example)

You can point the system at any OpenAI-compatible API using a custom base URL. Two common scenarios:

A) Third-party provider (e.g., Groq) that requires a model name

  • Example values (adjust to your account):
CUSTOM_ENDPOINT=https://api.groq.com/openai/v1
CUSTOM_ENDPOINT_API_KEY=grq_your_key_here
MODEL_NAME=llama3-70b-8192
  • Run with defaults from .env:
./rollmbenchmark evaluate
  • Or override via CLI (recommended) :
CUSTOM_ENDPOINT_API_KEY=grq_your_key_here \
./rollmbenchmark evaluate \
  --custom-endpoint https://api.groq.com/openai/v1 \
  --model llama3-70b-8192

B) Your own server where the model is implicit (no model required)

  • Minimal .env:
CUSTOM_ENDPOINT=https://my-openai-compatible-server
CUSTOM_ENDPOINT_API_KEY=my_server_token
# MODEL_NAME omitted on purpose
  • Run:
./rollmbenchmark evaluate
  • Or with CLI only (no .env):
CUSTOM_ENDPOINT_API_KEY=my_server_token \
./rollmbenchmark evaluate --custom-endpoint https://my-openai-compatible-server --evaluators gpt-5

Behavior summary when using a custom endpoint:

  • Provider is forced to OpenAI-compatible.

  • If MODEL_NAME is not set, the system uses an internal "auto" model sentinel.

Price calculation

The pricing is currently based on hardcoded values (per 1 million tokens).

To update or add prices for models:

  1. Open the file:
    src/infrastructure/configuration.py
  2. Locate the following lines:
    • Line 70 → Input token pricing
    • Line 94 → Output token pricing
  3. Modify the existing values or add new models with their corresponding prices.
    price_per_1m_input_tokens_by_model: dict[str, float] = Field(default_factory=lambda: {
        # OpenAI (Last update 17 Sep 2025)
        "gpt-5": 1.25,
        "gpt-5-mini": 0.25,
        "gpt-5-nano": 0.05,
        "gpt-4o": 2.50,
        "gpt-4o-mini":0.15,
        "o1": 15.00,
        "o1-pro": 150.00,
        "o3":2.00,

For more advanced customization edit the src/configuration.py file

Langsmith integration

To monitor each LLM call in detail, configure your LangSmith environment variables:

LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=<your_api_key>
LANGSMITH_PROJECT=<your_project_name>

Experiment Data (Article Assets)

The experiment_data/ folder contains the analysis datasets used in the paper, including cleaned Excel/CSV exports for model evaluations, human ratings, and preference tests.

It also includes raw generation artifacts under experiment_data/generated_answers/ (JSON/JSONL), plus small utilities used during analysis. See experiment_data/README.md for column-level descriptions of each dataset.

RAG/retrieval results are stored in the repo as Langfuse traces, located in experiment_data/generated_answers/.

Contact

Have questions, feedback, or suggestions? Feel free to reach out:

GitHub Issues: Open an issue for bug reports or feature requests.

We’d love to hear from you!

About

A Benchmark for Romanian Tax, Accounting, and ERP-Centric Decision Support

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors