Eval-driven QA architecture for tool-using AI agents.
Tool correctness • handoff validation • state retention • failure recovery • CI reliability gates
Most AI demos stop at “the answer looked right.”
This lab focuses on the quality problems enterprise teams actually need to govern in production:
- Did the agent choose the right tool?
- Did it preserve state across handoffs?
- Did it recover safely from partial failure?
- Did it avoid unsafe actions?
- Did the reliability profile regress after a prompt or model change?
This project is designed to help position a QA leader as someone who understands AI quality architecture, not just AI prompt experimentation.
- Multi-step support / ops agent with specialist handoffs
- 5 tools exposed through a controlled registry
- Fault injection for context loss and partial execution issues
- Offline eval benchmark with weighted reliability scoring
- Trace logging for scenario-by-scenario failure analysis
- CI gate that blocks merges when reliability drops below threshold
- Streamlit dashboard scaffold for review and storytelling
User Request
|
v
Orchestrator Agent
|----> Order Specialist
|----> Billing Specialist
|----> Knowledge Specialist
|
+----> Tool Layer
- get_order_status
- issue_refund
- search_kb
- update_ticket
- notify_human
|
+----> Trace Collector
|
+----> Eval Runner
- offline datasets
- rule-based graders
- reliability score
- CI threshold gate
What the dashboard is meant to surface:
- aggregate reliability score
- tool precision and safety score
- scenario-level outcomes
- per-scenario reliability trends
The trace layer captures the sequence that CTOs and architects care about:
- orchestrator decision
- specialist handoff
- injected fault
- recovery action
- tool call history
- policy violations
The GitHub Action is designed to make AI quality enforceable, not just observable. A pull request should fail when the benchmark reliability score drops below the configured threshold.
src/agent_reliability_lab/
agents/ orchestration, specialists, optional OpenAI adapter
core/ shared models, runtime, tool registry
tools/ mock tool implementations with safety and fault injection
evals/ graders, reporting, thresholds
tracing/ trace models and exporters
scenarios/ scenario library for benchmark execution
dashboard/ Streamlit dashboard
artifacts/
reports/ latest benchmark output
traces/ per-scenario traces
data/evals/ benchmark dataset
scripts/ CLI and CI helpers
.github/workflows/ CI reliability gate
python -m venv .venv
source .venv/bin/activatepip install -e .Optional OpenAI support:
pip install -e .[openai]arl run --scenario order_delayarl eval --dataset data/evals/core_benchmark.yaml --min-score 0.80streamlit run src/agent_reliability_lab/dashboard/app.pySet the following environment variables:
export OPENAI_API_KEY="your_key_here"
export ARL_USE_OPENAI="1"Then run:
arl run --scenario refund_with_context_loss --use-openaiNote: the repository is intentionally runnable in deterministic local mode first, so the quality architecture can be reviewed immediately without external dependencies.
Each run is scored across:
- task success
- tool precision
- handoff accuracy
- state retention
- recovery score
- safety score
The weighted output becomes the repository’s reliability score, which is then enforced in CI.
order_delay— routes to the order specialist and verifies order lookup flowrefund_eligible— validates refund path with proper eligibility handlingrefund_with_context_loss— injects context loss and expects safe recoveryunsafe_refund_attempt— blocks a policy-breaking refund and escalateskb_question— routes correctly to knowledge tooling without touching billing flows
This project signals that you understand:
- agentic workflows, not just single prompts
- AI testability and controlled orchestration
- trace-based debugging
- eval-driven release governance
- measurable AI quality operations
Suggested GitHub subtitle:
Eval-driven QA architecture for tool-using AI agents with traces, handoff validation, safety checks, and CI reliability gates.
- Palette:
#FF5722,#121212,#FFFFFF - Primary font direction: Inter
- Monospace direction: JetBrains Mono style treatment
All preview images are stored under docs/assets/ so the repository is ready for GitHub display.
- Replace mock tools with MCP-backed tools or service connectors
- Add side-by-side prompt / model comparison runs
- Export traces to OpenTelemetry or LangSmith
- Add red-team datasets for prompt injection and tool abuse
- Extend the dashboard into a real AI QualityOps console
MIT



