A comprehensive evaluation tool for verifying conversational AI applications.
This project offers a robust, end-to-end framework for evaluating the performance and reliability of conversational AI systems across a variety of real-world scenarios and quality metrics. The AIEvaluationTool is designed to automate the process of testing, analyzing, and benchmarking conversational agents, ensuring they meet high standards of accuracy, safety, and user experience.
AIEvaluationTool/
├── data/
│ ├── DataPoints.json
│ ├── plans.json
│ ├── strategy_map.json
│ └── ...
├── src/
│ ├── app/sarvam_ai
│ │ └── ... (scripts to run LLMs locally)
│ ├── app/importer
│ │ └── ... (scripts to import data from json to the database)
│ ├── app/interface_manager
│ │ └── ... (scripts to interact with the whastapp web or web app bots)
│ ├── app/testcase_executor
│ │ └── ... (scripts to run testcase execution from the prompts stored in the database)
│ ├── app/response_analyzer
│ │ └── ... (scripts to analyse the collected response and computer score and store in the database)
│ ├── lib/strategy
│ │ └── ... (implementation of model and rules based evaluation strategies)
│ ├── lib/orm
│ │ └── ... (ORM implementation of the data model)
│ ├── lib/data
│ │ └── ... (Pydantic classes of all the data model objects)
│ ├── lib/interface_manager
│ │ └── ... (wrapper class to talk to the Interface Manager stub)
│ ├── lib/utils
│ │ └── ... (helper functions)
│ ├── notebooks
│ │ └── ... (Python notebooks)
└── requirements.txt
-
Responsible AI: Assesses the ethical and safe behavior of the AI, including toxicity detection and guardrail enforcement.
-
Conversational Quality: Measures the fluency, coherence, and appropriateness of responses using advanced linguistic metrics and human-like judgment.
-
Guardrails and Safety: Evaluates the AI's ability to avoid generating unsafe, toxic, or inappropriate content, and to comply with predefined safety and ethics.
-
Language Support: Evaluates the model's ability to understand and generate text in multiple languages, including coverage and similarity metrics.
-
Task Understanding: Tests the AI's ability to comprehend and execute user instructions accurately.
-
Performance and Scalability: Assesses the system’s speed, reliability, and stability through performance and scalability metrics.
-
Privacy and Security: Assesses the system’s ability to safeguard sensitive information, maintain user trust, and resist misuse or adversarial manipulation while ensuring balanced and responsible handling of safety constraints.
-
Test Case Execution: A mechanism to send a diverse set of prompts to the conversational AI, simulating real user interactions across different platforms (e.g., WhatsApp, web interfaces).
-
Response Analysis: Applies a suite of custom and standard evaluation strategies—including text similarity, grammar checking, toxicity analysis, and more—to each response.
-
Metric Aggregation: Aggregates results into comprehensive reports, highlighting strengths and areas for improvement across all tested dimensions.
git clone https://github.com/cerai-iitm/AIEvaluationTool
cd AIEvaluationToolBefore installing Python dependencies, ensure you have the following prerequisites installed on your system:
- Python 3.10+
- Google Chrome Browser
- ChromeDriver (must match your Chrome version; this is a mandatory install for interface automation)
- MariaDB Server
Install all dependencies for each component using the provided requirements.txt files:
# For installing dependencies
pip install -r requirements.txtTo use the LLM-as-a-judge mechanism for evaluation, you must have a language model available. You can either:
- Run a model locally (e.g., using Ollama, OpenAI-compatible local models, etc.), or
- Provide API keys for cloud-based models (e.g., OpenAI, Anthropic, etc.)
Supported Models:
- OpenAI GPT-3.5/4 (via API key)
- Anthropic Claude (via API key)
- Ollama (local)
- Any OpenAI-compatible local model
Configuration:
- Ensure that
.env.examplein the root folder is initialized with appropriate values to create a.envfile. OLLAMA_URLpoints to the installed Ollama instance's endpoint address. Typically it ishttp://localhost:11434/LLM_AS_JUDGE_MODELpoints to the name of the LLM (loaded via Ollama) that we want to use as a judge. Typically, it isllama3.1:70b.PERSPECTIVE_API_KEYshould have the API KEY of Perspective service for toxicity detection.GPU_URLshould point to the Sarvam AI RestAPI server (./src/app/sarvam_ai/) hosted elsewhere. Typically, the URL ishttp://localhost:8000.- For API-based models, set your API key in a
.envfile or as an environment variable (e.g.,OPENAI_API_KEY,ANTHROPIC_API_KEY). - For local models, ensure the model server is running and accessible at the expected endpoint (see your model provider's documentation).
Ensure your model is accessible and properly configured before running the evaluation pipeline. Refer to the relevant documentation for your chosen model provider for setup instructions.
Ensure the data/ directory contains the following files (already present in the repository):
DataPoints.json(sample test dataset)plans.jsonstrategy_map.jsonstrategy_id.jsonmetric_strategy_mapping.json- A detailed set of Seeding data points shall be provided upon request.
Step 1: Import datapoints into Database
Create a database in the MariaDB server and authorize a database user with full privileges. Replace the host, port number, username, password, and database name in the config.json file.
Open a terminal on your machine and run:
python3 src/app/importer/main.py --config "path to the config file"After running the importer script, the terminal shows the following outputs.
Step 2: Start the InterfaceManager API Service
Open a terminal on your machine and run:
cd src/app/interface_manager
python main.pyAfter starting the InterfaceManager API Service, the terminal shows the following outputs.
Step 3: Run the Test Case Execution Manager
Replace the host, port number, username, password, and database name in the config.json file. Open another new terminal on your machine and run to see what options are available in testcase executor:
cd src/app/testcase_executor
python main.py --config "path to config file" -hTo run the Testcase execution, run the following command:
cd src/app/testcase_executor
python main.py --testplan-id <testplan-id> --testcase-id <testcase-id> --metric-id <metric-id> --max-testcases <max-testcases> --config "path to config file" --execute (Adjust --testplan-id, --testcase-id, --metric-id, --max-testcases and as needed.)
On running the Test Case Execution Manager, the terminal output should look similar to:
The Test Case Execution Manager leverages the interface automation to automatically deliver test cases to the conversational platform and retrieve responses without manual intervention.
This step will execute the test cases and store the responses in data/responses.json.
Step 4: Run the LLMS in your GPUs
In order for the evaluation framework to work we need to have 4 models to be in place -
- sarvamai/sarvam-2b-v0.5
- google/shieldgemma-2b
- sarvamai/sarvam-translate
- mistral:7b-instruct (Default LLM as Judge)
cd src/app/sarvam_ai
python main.pyYou need to port forward to facilitate the model to connect to your testing machine. You can use the command below -
ollama run mistral:7b-instruct
ssh gpu_machine_cred@machineIP -L testing_machine_ip:11434:localhost:11434 -L testing_machine_ip:8000:localhost:8000Here in 11434 the LLM as Judge Model ie mistral:7b-instruct is hosted through ollama and 8000 is used to serve the other 3 models.
There are other small sized models which gets downloaded while running this application. The models are -
- amedvedev/bert-tiny-cognitive-bias
- NousResearch/Minos-v1
- LibrAI/longformer-harmful-ro
- vectara/hallucination_evaluation_model
- thenlper/gte-small
- all-MiniLM-L6-v2
- facebook/bart-large-cnn
- nicholasKluge/ToxiGuardrail
- paraphrase-multilingual-mpnet-base-v2
Step 5: Run the Response Analyzer
Once the previous step has completed and responses.json is populated, open a new terminal and run:
cd src/app/response_analyzer
python analyze.py --config "path to config file" --run-name <run-name>(Adjust --run-name as needed.)
Note: If you are using a local model (e.g., Ollama or any OpenAI-compatible local model), ensure that the model server is running in the background and accessible before executing the Response Analyzer.
The Response Analyzer block when executed will display a detailed report on the terminal, showing scores evaluated for metrics under the test plan and can be used as an indicator of the validity of the model against a particular metric.
A sample evaluation report generated by the Response Analyzer can be seen below:









