AIEvaluationTool

A comprehensive evaluation tool for verifying conversational AI applications.

This project offers a robust, end-to-end framework for evaluating the performance and reliability of conversational AI systems across a variety of real-world scenarios and quality metrics. The AIEvaluationTool is designed to automate the process of testing, analyzing, and benchmarking conversational agents, ensuring they meet high standards of accuracy, safety, and user experience.

Architecture

Directory Structure

AIEvaluationTool/
├── data/
│   ├── DataPoints.json
│   ├── plans.json
│   ├── strategy_map.json
│   └── ...
├── src/
│   ├── app/sarvam_ai
│   │   └── ... (scripts to run LLMs locally)
│   ├── app/importer
│   │   └── ... (scripts to import data from json to the database)
│   ├── app/interface_manager
│   │   └── ... (scripts to interact with the whastapp web or web app bots)
│   ├── app/testcase_executor
│   │   └── ... (scripts to run testcase execution from the prompts stored in the database)
│   ├── app/response_analyzer
│   │   └── ... (scripts to analyse the collected response and computer score and store in the database)
│   ├── lib/strategy
│   │   └── ... (implementation of model and rules based evaluation strategies)
│   ├── lib/orm
│   │   └── ... (ORM implementation of the data model)
│   ├── lib/data
│   │   └── ... (Pydantic classes of all the data model objects)
│   ├── lib/interface_manager
│   │   └── ... (wrapper class to talk to the Interface Manager stub)
│   ├── lib/utils
│   │   └── ... (helper functions)
│   ├── notebooks
│   │   └── ... (Python notebooks)
└── requirements.txt

Key Features and Evaluation Dimensions:

Responsible AI: Assesses the ethical and safe behavior of the AI, including toxicity detection and guardrail enforcement.
Conversational Quality: Measures the fluency, coherence, and appropriateness of responses using advanced linguistic metrics and human-like judgment.
Guardrails and Safety: Evaluates the AI's ability to avoid generating unsafe, toxic, or inappropriate content, and to comply with predefined safety and ethics.
Language Support: Evaluates the model's ability to understand and generate text in multiple languages, including coverage and similarity metrics.
Task Understanding: Tests the AI's ability to comprehend and execute user instructions accurately.
Performance and Scalability: Assesses the system’s speed, reliability, and stability through performance and scalability metrics.
Privacy and Security: Assesses the system’s ability to safeguard sensitive information, maintain user trust, and resist misuse or adversarial manipulation while ensuring balanced and responsible handling of safety constraints.

How It Works:

Test Case Execution: A mechanism to send a diverse set of prompts to the conversational AI, simulating real user interactions across different platforms (e.g., WhatsApp, web interfaces).
Response Analysis: Applies a suite of custom and standard evaluation strategies—including text similarity, grammar checking, toxicity analysis, and more—to each response.
Metric Aggregation: Aggregates results into comprehensive reports, highlighting strengths and areas for improvement across all tested dimensions.

Setup Guide

1. Clone the Repository

git clone https://github.com/cerai-iitm/AIEvaluationTool
cd AIEvaluationTool

2. Install Prerequisites

Before installing Python dependencies, ensure you have the following prerequisites installed on your system:

Python 3.10+
Google Chrome Browser
ChromeDriver (must match your Chrome version; this is a mandatory install for interface automation)
MariaDB Server

3. Install Python Dependencies

Install all dependencies for each component using the provided requirements.txt files:

# For installing dependencies
pip install -r requirements.txt

4. Model Setup for LLM-based Evaluation

To use the LLM-as-a-judge mechanism for evaluation, you must have a language model available. You can either:

Run a model locally (e.g., using Ollama, OpenAI-compatible local models, etc.), or
Provide API keys for cloud-based models (e.g., OpenAI, Anthropic, etc.)

Supported Models:

OpenAI GPT-3.5/4 (via API key)
Anthropic Claude (via API key)
Ollama (local)
Any OpenAI-compatible local model

Configuration:

Ensure that .env.example in the root folder is initialized with appropriate values to create a .env file.
OLLAMA_URL points to the installed Ollama instance's endpoint address. Typically it is http://localhost:11434/
LLM_AS_JUDGE_MODEL points to the name of the LLM (loaded via Ollama) that we want to use as a judge. Typically, it is llama3.1:70b.
PERSPECTIVE_API_KEY should have the API KEY of Perspective service for toxicity detection.
GPU_URL should point to the Sarvam AI RestAPI server (./src/app/sarvam_ai/) hosted elsewhere. Typically, the URL is http://localhost:8000.
For API-based models, set your API key in a .env file or as an environment variable (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).
For local models, ensure the model server is running and accessible at the expected endpoint (see your model provider's documentation).

Ensure your model is accessible and properly configured before running the evaluation pipeline. Refer to the relevant documentation for your chosen model provider for setup instructions.

5. Prepare Data Files

Ensure the data/ directory contains the following files (already present in the repository):

DataPoints.json (sample test dataset)
plans.json
strategy_map.json
strategy_id.json
metric_strategy_mapping.json
A detailed set of Seeding data points shall be provided upon request.

Running the Evaluation Pipeline

Step 1: Import datapoints into Database

Create a database in the MariaDB server and authorize a database user with full privileges. Replace the host, port number, username, password, and database name in the config.json file.

Open a terminal on your machine and run:

python3 src/app/importer/main.py --config "path to the config file"

After running the importer script, the terminal shows the following outputs.

Step 2: Start the InterfaceManager API Service

Open a terminal on your machine and run:

cd src/app/interface_manager
python main.py

API Server Running

After starting the InterfaceManager API Service, the terminal shows the following outputs.

Step 3: Run the Test Case Execution Manager

Replace the host, port number, username, password, and database name in the config.json file. Open another new terminal on your machine and run to see what options are available in testcase executor:

cd src/app/testcase_executor
python main.py --config "path to config file" -h

To run the Testcase execution, run the following command:

cd src/app/testcase_executor
python main.py --testplan-id <testplan-id> --testcase-id <testcase-id> --metric-id <metric-id> --max-testcases <max-testcases>  --config "path to config file" --execute

(Adjust --testplan-id, --testcase-id, --metric-id, --max-testcases and as needed.)

Test Case Execution Manager Running

On running the Test Case Execution Manager, the terminal output should look similar to:

The Test Case Execution Manager leverages the interface automation to automatically deliver test cases to the conversational platform and retrieve responses without manual intervention.

This step will execute the test cases and store the responses in data/responses.json.

Step 4: Run the LLMS in your GPUs

In order for the evaluation framework to work we need to have 4 models to be in place -

sarvamai/sarvam-2b-v0.5
google/shieldgemma-2b
sarvamai/sarvam-translate
mistral:7b-instruct (Default LLM as Judge)

cd src/app/sarvam_ai
python main.py

You need to port forward to facilitate the model to connect to your testing machine. You can use the command below -

ollama run mistral:7b-instruct
ssh gpu_machine_cred@machineIP -L testing_machine_ip:11434:localhost:11434 -L testing_machine_ip:8000:localhost:8000

Here in 11434 the LLM as Judge Model ie mistral:7b-instruct is hosted through ollama and 8000 is used to serve the other 3 models.

There are other small sized models which gets downloaded while running this application. The models are -

amedvedev/bert-tiny-cognitive-bias
NousResearch/Minos-v1
LibrAI/longformer-harmful-ro
vectara/hallucination_evaluation_model
thenlper/gte-small
all-MiniLM-L6-v2
facebook/bart-large-cnn
nicholasKluge/ToxiGuardrail
paraphrase-multilingual-mpnet-base-v2

Step 5: Run the Response Analyzer

Once the previous step has completed and responses.json is populated, open a new terminal and run:

cd src/app/response_analyzer
python analyze.py --config "path to config file" --run-name <run-name>

(Adjust --run-name as needed.)

Note: If you are using a local model (e.g., Ollama or any OpenAI-compatible local model), ensure that the model server is running in the background and accessible before executing the Response Analyzer.

Response Analyzer Running

Results

The Response Analyzer block when executed will display a detailed report on the terminal, showing scores evaluated for metrics under the test plan and can be used as an indicator of the validity of the model against a particular metric.

Evaluation Report

A sample evaluation report generated by the Response Analyzer can be seen below:

Name		Name	Last commit message	Last commit date
Latest commit History 635 Commits
.vscode		.vscode
data		data
screenshots		screenshots
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIEvaluationTool

Architecture

Directory Structure

Key Features and Evaluation Dimensions:

How It Works:

Setup Guide

1. Clone the Repository

2. Install Prerequisites

3. Install Python Dependencies

4. Model Setup for LLM-based Evaluation

5. Prepare Data Files

Running the Evaluation Pipeline

API Server Running

Test Case Execution Manager Running

Response Analyzer Running

Results

Evaluation Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AIEvaluationTool

Architecture

Directory Structure

Key Features and Evaluation Dimensions:

How It Works:

Setup Guide

1. Clone the Repository

2. Install Prerequisites

3. Install Python Dependencies

4. Model Setup for LLM-based Evaluation

5. Prepare Data Files

Running the Evaluation Pipeline

API Server Running

Test Case Execution Manager Running

Response Analyzer Running

Results

Evaluation Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages