Eval Agent is an end-to-end, agentic application evaluation system that plans, executes, observes, and evaluates real applications running in real environments. It is built around a strict control plane / execution plane split: the backend owns all intelligence, and the desktop runner owns all physical interaction.
This repository was formed by merging two separate projects:
app_eval_desktop->desktopapp_evaluation_agent->backend
The unified project name is Eval_Agent.
- Overview
- Project Lineage
- Architecture At A Glance
- System Architecture Diagram
- Execution Model
- Desktop Runner (Execution Plane)
- Backend (Control Plane)
- Component Breakdown
- Bug Tracking and Triage
- Vision and Coordinate Mapping
- Database Schema
- API Reference
- Backend Setup and Installation
- Testing
- Docs Map
Eval Agent evaluates applications by executing real steps on real targets while a backend LLM-driven control plane decides the next action. The system is intentionally split:
- Backend (Control Plane): planning, vision analysis, action selection, bug reasoning, metrics, storage.
- Desktop Runner (Execution Plane): capture, deterministic action execution, visualization.
Eval Agent is the merged successor of two projects:
app_eval_desktop-> desktop (Electron + TypeScript executor)app_evaluation_agent-> backend (FastAPI control plane)
All prior architecture, endpoints, and README material has been consolidated here for a single, canonical entrypoint.
- Control plane orchestrates evaluation lifecycles, test plans, test cases, LLM reasoning, bug management, and summaries.
- Execution plane performs capture and input on real machines, using deterministic action execution.
- Two execution modes are supported: secure cloud execution (headless VM) and interactive local execution (desktop runner).
graph TD
subgraph "Client / CI/CD"
U1[Desktop Runner]
CI[CI/CD Pipeline]
end
subgraph "Control Plane (FastAPI Backend)"
API[FastAPI Server]
DB[(PostgreSQL)]
REDIS[(Redis Queue)]
S3[S3 Artifact Storage]
SCAN[Virus Scanner]
end
subgraph "Execution - Cloud"
EXE[Executor Host]
VM[Ephemeral Test VM]
end
subgraph "Execution - Local"
DR[Desktop Runner Client]
AUT[App Under Test]
end
%% Cloud Path
CI -->|POST /evaluations cloud| API
API --> SCAN
SCAN -->|Clean| S3
API -->|enqueue| REDIS
EXE -->|poll queue| REDIS
EXE --> VM
VM -->|test + send results| API
API --> DB
%% Local Path
U1 -->|POST /evaluations/upload local| API
API --> SCAN
SCAN -->|Clean| S3
U1 --> DR
DR -->|GET /jobs/next| API
API --> DR
DR --> AUT
DR -->|Screenshots + Context| API
API -->|LLM reasoning + coord map| DR
DR -->|Exec actions| AUT
DR -->|Results| API
API --> DB
Eval Agent uses a TestCase runner model:
- Desktop runner polls for the next assigned TestCase.
- If assigned, the runner launches the target application (desktop or web).
- The runner enters a deterministic step loop, sending a screenshot + context and receiving exactly one action per step.
- The backend terminates the TestCase with a
finish_taskaction.
sequenceDiagram
participant DR as Desktop Runner
participant API as Backend
participant AUT as App Under Test
DR->>API: GET /api/v1/testcases/next?executor_id=...
alt TestCase assigned
DR->>AUT: Launch app
loop Step loop (max 40)
DR->>API: POST /api/v1/vision/analyze (screenshot + context)
API-->>DR: {thought, action, description}
DR->>AUT: Execute action
end
DR->>API: PATCH /api/v1/testcases/{id} (result)
DR->>AUT: Tear down app
else None assigned
DR->>DR: Idle or stop
end
App Eval Desktop is the Electron + TypeScript executor. It is intentionally thin: no local perception, UI parsing, or model inference. All reasoning and decision-making lives in the backend.
Capture
- High-performance BGRA capture via Windows Desktop Duplication API
- Multi-monitor aware
- Optional exclusion from capture (
WDA_EXCLUDEFROMCAPTURE) - Conversion to PNG before upload
Execution
- Mouse actions: click, double-click, right-click, hover, drag, scroll
- Keyboard actions: shortcuts, simulated typing
- Clipboard-based direct text entry (paste)
- Deterministic waits and task completion
Orchestration
- Polls backend for next assigned TestCase execution
- Runs TestCases sequentially
- Handles pause, resume, stop
- Manages app lifecycle (launch + teardown)
Visualization
- Live screenshot preview
- Structured logs
- Step-by-step run timeline
- Agent context inspection
- Evaluation and TestCase history
desktop/
├── scripts/
│ ├── utils/copy-recursive.js
│ ├── copy-renderer.js
│ └── copy-native.js
├── src/
│ ├── main.ts # Electron entry, windows, IPC
│ ├── preload.ts # Secure IPC bridge
│ ├── config.ts # Runtime configuration
│ │
│ ├── core/
│ │ ├── orchestrator.ts # TestCase runner loop
│ │ ├── context.ts # AgentExecutionContext
│ │ └── logger.ts # Structured logging
│ │
│ ├── agent/
│ │ ├── executor.ts # nut-js action executor
│ │ ├── coord-mapper.ts # analysis/capture -> screen mapping
│ │ └── capture/native/ # C++ Desktop Duplication addon
│ │
│ ├── api/
│ │ └── client.ts # REST + vision calls
│ │
│ ├── renderer/
│ │ ├── locales/
│ │ ├── pages/
│ │ ├── shared/
│ │ └── styles/
│ │
│ └── types/
│ └── evaluations.d.ts
├── test/
│ └── test-window-capture.ts
Per-step loop
- Capture native screenshot -> PNG (with brightness sanity checks).
- Assemble
AgentExecutionContextand last focus coordinates. - POST screenshot + context to
/api/v1/vision/analyze. - Receive
{ thought, action, description }. - Map coordinates via
coord-mapper.tsand execute. - Update scratchpad, action history, and UI timeline.
Agent View (Run)
- Live screenshot preview
- Step timeline (thought + action + screenshot)
- Structured logs (SYSTEM / JOB / AGENT / TOOL / CAPTURE / WARN / ERROR)
- Pause / resume
- Compact mode toggle
Apps
- Browse apps, versions, and evaluations
- Focus a lineage branch and reset focus
- Create apps + versions (upload or URL)
- Delete apps or versions
- Jump to evaluation history
Evaluations
- Assigned evaluations list
- Metadata: goal, app type, timestamps
- Link to history
- Delete evaluation
- Regenerate or edit summary (for completed evaluations)
History
- Infinite scroll TestCase history
- Markdown rendering of results
- Copy / download summary
- Re-run TestCase
Bugs
- Bug list per app with filtering and search
- Create, edit, delete bugs (status, severity, priority, fingerprint)
- Track occurrences tied to evaluations/TestCases
- Record branch-scoped fixes and verification notes
Compact Mode
- Always-on-top minimal window
- Logs + status
- Execution controls
Renderer communicates only via preload IPC. Key channels include:
getAssignedEvaluationsfetchEvaluationdeleteEvaluationrun:start / pause / resume / stopinjectHumanPromptagent-context-updatedevaluation-attachedrun-timeline-entryhistory:refreshgetLogBuffer / onLogUpdatelistBugs / getBug / createBug / updateBug / deleteBuglistBugOccurrences / createBugOccurrencelistBugFixes / createBugFix / deleteBugFix
Key endpoints used by the desktop runner:
GET /api/v1/appsPOST /api/v1/appsDELETE /api/v1/apps/{app_id}GET /api/v1/apps/{app_id}/versionsPOST /api/v1/apps/{app_id}/versionsDELETE /api/v1/apps/{app_id}/versions/{version_id}GET /api/v1/apps/{app_id}/versions/{version_id}/evaluationsGET /api/v1/apps/{app_id}/bugsPOST /api/v1/bugsGET /api/v1/bugs/{bug_id}PATCH /api/v1/bugs/{bug_id}DELETE /api/v1/bugs/{bug_id}GET /api/v1/bugs/{bug_id}/occurrencesPOST /api/v1/bugs/{bug_id}/occurrencesGET /api/v1/bugs/{bug_id}/fixesPOST /api/v1/bugs/{bug_id}/fixesDELETE /api/v1/bugs/{bug_id}/fixes/{fix_id}GET /api/v1/testcases/nextPATCH /api/v1/testcases/{id}POST /api/v1/vision/analyzeGET /api/v1/evaluations/{id}PATCH /api/v1/evaluations/{id}/summaryPOST /api/v1/evaluations/{id}/regenerate-summary
See docs/endpoints.md for the authoritative schema.
.env:
API_BASE_URL=http://127.0.0.1:8000
EXECUTOR_ID=<unique-machine-id>Additional behavior:
- Capture defaults controlled in
desktop/src/config.ts - Theme, language, executor ID configurable via Settings UI
- Executor ID persists across restarts
Install dependencies:
npm installBuild:
npm run buildDev (renderer + Electron with Vite HMR):
npm run devDev (separate terminals):
npm run dev:renderer
npm run dev:electronRun:
npm startPackage:
npm run makeTest native capture:
npx ts-node test/test-window-capture.ts- Renderer is fully sandboxed (no Node access)
- All side effects happen in
main/orchestrator - Vision is non-streaming
- Clipboard is restored after direct text entry
- Click-through is reference-counted to avoid stuck windows
- Max 40 steps per TestCase
No screenshots
- Rebuild native addon
- Update GPU drivers
- Run capture test script
Actions misaligned
- Check
space+normalizedflags - Verify capture resolution vs model space
- Inspect
desktop/src/agent/coord-mapper.ts
Agent stuck
- Confirm backend returns
finish_task - Check TestCase status transitions
- Inspect vision analyze logs
The backend is an AI-based automated application evaluation system. It uses a visual multimodal model and multi-agent collaboration to achieve end-to-end app exploration, evaluation metric calculation, full bug lifecycle management, and test case generation.
App Information Parsing
- Textual material grading
- Level 1: functional brief of <=200 words
- Level 2: introductory guide covering basic operations
- Level 3: full official documentation/manual
- Interface understanding
- Automatically detects and classifies UI elements (buttons, inputs, icons)
- Supports structural layout parsing
Automated Evaluation Process
- Test case generation
- Feature exploration execution
- Bug detection and management
- Version difference analysis
- Evaluation metric calculation
- Evaluation report generation
| Metric | Calculation Method | Core Parameters |
|---|---|---|
| Stability | 1 - (Crash Rate * 0.7 + Functional Abnormality Rate * 0.3) |
Crash Count / Total Tasks |
| Usability | 1 - (Step Efficiency * 0.5 + Time Efficiency * 0.5) |
Steps / Avg Steps |
| Learnability | (1 - Basic Exploration Efficiency) * Text Level Coeff + Feature Coverage * 0.2 |
Exploration time / Avg |
| Completeness | Feature Coverage * 0.4 + Integrity * 0.6 |
Implemented Features |
app_evaluation_agent/services/agents/coordinator.py: CoordinatorAgent (bootstraps plans and test cases)app_evaluation_agent/services/agents/planner.py: PlannerAgent (LLM plan + test case generation)app_evaluation_agent/services/agents/analyzer.py: AnalyzerAgent (vision analysis + coordinate mapping)app_evaluation_agent/services/agents/summarizer.py: SummarizerAgent (final evaluation summary)app_evaluation_agent/services/agents/bug_triage.py: BugTriageAgent (extracts bugs from results)
Severity level definitions
| Level | Definition | Response Time | Example |
|---|---|---|---|
| P0 | Critical blocker | 24 hours | Crash on launch |
| P1 | Severe abnormality | 3 days | Payment cannot submit |
| P2 | General abnormality | One iteration | Button unresponsive |
| P3 | Minor issue | Next major release | UI contrast issue |
Status transition rules
New -> In Progress -> Pending Verification -> Closed -> (optional Reopen)
General Task Description
- Task ID
- Description
- Expected Result
- Priority
Version-Specific Steps
- Numbered operational steps for each version
- Functional specification
- Evaluation report
- Bug list
- Bug tracking sheet
- Test case set
- Operation process dataset
Secure Cloud Execution (Headless VM Testing)
- Intended for CI/CD, regression testing, and scalable automation
- Backend enqueues background tasks via Redis ARQ
- Cloud executor (outside this repo) consumes jobs and launches ephemeral VMs
Interactive Local Execution (Desktop Runner)
- Used for developer debugging and exploratory testing
- Desktop runner polls for test cases, captures screenshots, and executes actions locally
- Coordinate correction is handled server-side
Cloud path
- Client or CI submits an evaluation
- Backend scans and stores artifacts
- Evaluation is enqueued via Redis
- Cloud executor pulls the job
- VM runs the app headlessly
- Results are sent back and persisted
Local path
- Desktop runner uploads or selects an evaluation
- Backend scans and stores artifacts
- Runner polls
/testcases/next - Runner captures screenshots and sends context
- Backend vision agent returns actions
- Runner executes actions locally
- Results are stored and summarized
Entry and API Layer
- FastAPI app:
backend/app_evaluation_agent/main.py - Routes under
api/v1/:apps.py- app + version managementevaluations.py- evaluation CRUD and lifecycletestplans.py- test plan accesstestcases.py- test case assignment and updatesvision.py- screenshot + LLM vision reasoninglogs.py- log streaming/export
Agent Layer (services/agents/)
- PlannerAgent: generates high-level plans and test cases
- CoordinatorAgent: bootstraps evaluations and assigns test cases to executors
- AnalyzerAgent: handles
/vision/analyze, builds prompts, calls vLLM, remaps coords - SummarizerAgent: produces final evaluation reports after test case completion
- BugTriageAgent: extracts and dedupes bugs by fingerprint
- LLM client:
llm_client.py
Business Services
services/apps.py- app + version management, evaluation creationservices/evaluations.py- evaluation lifecycle and planner bootstrapservices/testcases.py- test case assignment, completion, bug triage
- Bug extraction happens when a runner patches a test case with results via
PATCH /api/v1/testcases/{testcase_id}. BugTriageAgentparses result payloads and emits 0..N bug drafts.- Bugs are deduped per app by
fingerprint, withlast_seen_atupdated on repeats. - Each observation is stored as a
BUG_OCCURRENCElinked to evaluation, test case, app version, step index, action/expected/actual, plus optional artifact URIs. - Fixes are recorded in
BUG_FIXwithfixed_in_version_idand optionalverified_by_evaluation_id. - Severity/status enums are validated; state transitions are not enforced by the backend.
- AnalyzerAgent consumes screenshot + AgentContext and calls the vision LLM
- Coordinates are remapped using
services/vllm_coordinate_mapper.py - Supports letterboxing, normalization, and capture origin offsets
- SQLAlchemy models:
backend/app_evaluation_agent/storage/models.py - Async engine/session:
backend/app_evaluation_agent/storage/database.py - Schemas:
backend/app_evaluation_agent/schemas/ - Migrations: Alembic (
backend/alembic/)
- Redis ARQ worker:
backend/app_evaluation_agent/worker.py - Used for summarization and cloud job enqueueing
- Local execution bypasses Redis where possible
- Config:
backend/app_evaluation_agent/utils/config.py(TOML-based) - Virus scanning:
backend/app_evaluation_agent/integrations/virus_scanner.py - Artifact storage:
backend/app_evaluation_agent/integrations/s3_client.py - Real-time events:
backend/app_evaluation_agent/realtime.py - Logging:
backend/app_evaluation_agent/logging_utils.py
backend/app_evaluation_agent/
├── main.py
├── worker.py
├── logging_utils.py
├── logs/
├── api/
│ └── v1/
│ ├── apps.py
│ ├── evaluations.py
│ ├── testplans.py
│ ├── testcases.py
│ ├── vision.py
│ └── logs.py
├── services/
│ ├── apps.py
│ ├── agents/
│ │ ├── planner.py
│ │ ├── coordinator.py
│ │ ├── analyzer.py
│ │ ├── summarizer.py
│ │ ├── bug_triage.py
│ │ ├── llm_client.py
│ │ └── prompt_loader.py
│ ├── prompts/
│ │ ├── planner/
│ │ ├── bug_triage/
│ │ └── summarizer/
│ ├── evaluations.py
│ ├── testcases.py
│ └── vllm_coordinate_mapper.py
├── storage/
│ ├── models.py
│ └── database.py
├── schemas/
│ ├── evaluation.py
│ ├── testplan.py
│ ├── testcase.py
│ └── agent.py
├── integrations/
│ ├── virus_scanner.py
│ └── s3_client.py
└── utils/
└── config.py
- Classical UI element detection is currently disabled
- Vision endpoints accept a PNG screenshot and execution context
- Returned model coordinates are preserved as raw values and remapped to screen pixels
- This design supports future detector insertion and coordinate drift debugging
erDiagram
APP {
int id PK
string name
enum app_type
datetime created_at
datetime updated_at
}
APP_VERSION {
int id PK
int app_id FK
int previous_version_id FK
string version
string artifact_uri
string app_url
datetime release_date
text change_log
datetime created_at
datetime updated_at
}
APP_VERSION_LINEAGE {
int app_version_id PK, FK
int previous_version_id PK, FK
}
EVALUATION {
int id PK
int app_version_id FK
enum status
string execution_mode
string assigned_executor_id
json results
string local_application_path
string high_level_goal
bool run_on_current_screen
datetime created_at
datetime updated_at
}
TEST_PLAN {
int id PK
int evaluation_id FK
enum status
json summary
datetime created_at
datetime updated_at
}
TEST_CASE {
int id PK
int plan_id FK
int evaluation_id FK
string name
text description
json input_data
enum status
json result
int execution_order
string assigned_executor_id
datetime created_at
datetime updated_at
}
BUG {
int id PK
int app_id FK
string title
text description
enum severity_level
int priority
enum status
int discovered_version_id FK
string fingerprint
json environment
json reproduction_steps
datetime first_seen_at
datetime last_seen_at
datetime created_at
datetime updated_at
}
BUG_OCCURRENCE {
int id PK
int bug_id FK
int evaluation_id FK
int test_case_id FK
int app_version_id FK
int step_index
json action
text expected
text actual
json result_snapshot
string screenshot_uri
string log_uri
json raw_model_coords
datetime observed_at
string executor_id
datetime created_at
datetime updated_at
}
BUG_FIX {
int id PK
int bug_id FK
int fixed_in_version_id FK
int verified_by_evaluation_id FK
text note
datetime created_at
}
APP ||--o{ APP_VERSION : has
APP_VERSION ||--o{ EVALUATION : runs
EVALUATION ||--o{ TEST_PLAN : owns
EVALUATION ||--o{ TEST_CASE : owns
TEST_PLAN ||--o{ TEST_CASE : contains
APP_VERSION ||--o{ APP_VERSION_LINEAGE : has
APP_VERSION_LINEAGE }o--|| APP_VERSION : previous
APP ||--o{ BUG : owns
APP_VERSION ||--o{ BUG : discovered_in
BUG ||--o{ BUG_OCCURRENCE : observed
EVALUATION ||--o{ BUG_OCCURRENCE : observed_in
TEST_CASE ||--o{ BUG_OCCURRENCE : linked_to
APP_VERSION ||--o{ BUG_OCCURRENCE : observed_on
BUG ||--o{ BUG_FIX : fixed_in
APP_VERSION ||--o{ BUG_FIX : fixed_on
EVALUATION ||--o{ BUG_FIX : verified_by
The canonical API documentation lives at docs/endpoints.md. Use the interactive docs when running the backend:
/docs(Swagger UI)/redoc
Key operational endpoints:
POST /api/v1/evaluations(JSON)POST /api/v1/evaluations/upload(desktop app upload)POST /api/v1/evaluations/url(web app URL)POST /api/v1/evaluations/live(use runner current screen)GET /api/v1/testplans/{plan_id}GET /api/v1/testcases/nextPATCH /api/v1/testcases/{testcase_id}POST /api/v1/bugsGET /api/v1/bugs/{bug_id}PATCH /api/v1/bugs/{bug_id}DELETE /api/v1/bugs/{bug_id}GET /api/v1/bugs/{bug_id}/occurrencesPOST /api/v1/bugs/{bug_id}/occurrencesGET /api/v1/bugs/{bug_id}/fixesPOST /api/v1/bugs/{bug_id}/fixesDELETE /api/v1/bugs/{bug_id}/fixes/{fix_id}POST /api/v1/vision/analyzeGET /api/v1/logs/export
Prerequisites
- Python 3.10+
- Poetry
- Docker + Docker Compose
1) Clone
git clone https://github.com/Region-AI/EvalAgent
cd Eval_Agent2) Install dependencies
cd backend
poetry install3) Configure
cp config/settings.example.toml config/settings.tomlEdit backend/config/settings.toml to configure:
- PostgreSQL URL
- Redis host
- LLM base URL and API key
- Model paths (if applicable)
4) Start backend services
cp docker-compose.example.yaml docker-compose.yaml
docker-compose up -d5) Run migrations
cp env.example.py alembic/env.py
poetry run alembic upgrade head6) Run the backend
Terminal 1 (background worker):
arq app_evaluation_agent.worker.WorkerSettingsTerminal 2 (API server):
uvicorn app_evaluation_agent.main:app --reloadAPI docs are available at http://127.0.0.1:8000/docs.
Backend
Most vision tests depend on a configured LLM endpoint and screenshot.png fixture:
cd backend
poetry run pytest tests/test_vllm_coordinate_mapper.py -qDesktop
cd desktop
npx ts-node test/test-window-capture.tsdocs/overview.mddocs/architecture.mddocs/backend.mddocs/desktop.mddocs/endpoints.mddocs/troubleshooting.mdARCHITECTURE_backend.md(legacy full backend architecture)ARCHITECTURE_frontend.md(legacy full desktop architecture)README_backend.md(legacy backend README)README_frontend.md(legacy desktop README)
Apache-2.0. See LICENSE.