A tool that helps you improve the instructions you give to AI assistants (like ChatGPT, Claude, or Gemini).
When you build an AI-powered application, you write a "system prompt" - the hidden instructions that tell the AI how to behave. But how do you know if your instructions are actually working?
This tool answers: What parts of my AI instructions are working, what's broken, and how do I fix them?
-
Analyse - The system breaks down your instructions into testable sections and identifies potential problem areas
-
Test - It runs your instructions through multiple AI models (Claude, Gemini, GPT-4, etc.) with real user scenarios
-
Evaluate - An AI judge scores how well each model followed your instructions, checking:
- Did it use the right tools?
- Did it respond in the right style/voice?
- Was the response helpful and accurate?
-
Report - You get a diagnostic report showing which parts of your instructions work well and which need improvement, with specific recommendations
The system uses a 3-layer architecture for reliability:
┌─────────────────────────────────────────────────┐
│ Directives │ What to do (Markdown SOPs) │
├─────────────────────────────────────────────────┤
│ Orchestration │ AI decision-making layer │
├─────────────────────────────────────────────────┤
│ Execution │ Deterministic Python scripts │
└─────────────────────────────────────────────────┘
This separation ensures that complex operations are handled by tested, reliable code rather than unpredictable AI generation.
- Clone the repository
- Copy
.env.exampleto.envand add your API keys - Install dependencies:
pip install -r requirements.txt - Run a diagnostic:
python scripts/run_diagnostic.py
- Python 3.11+
- API key for OpenRouter (provides access to multiple AI models)
├── directives/ # Instructions for the AI orchestrator
├── execution/ # Python scripts that do the actual work
├── config/ # Model and tool configurations
├── scripts/ # Entry points for running diagnostics
└── tests/ # Test suite
MIT