Eval Agent

Eval Agent is an end-to-end, agentic application evaluation system that plans, executes, observes, and evaluates real applications running in real environments. It is built around a strict control plane / execution plane split: the backend owns all intelligence, and the desktop runner owns all physical interaction.

This repository was formed by merging two separate projects:

app_eval_desktop -> desktop
app_evaluation_agent -> backend

The unified project name is Eval_Agent.

Overview

Eval Agent evaluates applications by executing real steps on real targets while a backend LLM-driven control plane decides the next action. The system is intentionally split:

Backend (Control Plane): planning, vision analysis, action selection, bug reasoning, metrics, storage.
Desktop Runner (Execution Plane): capture, deterministic action execution, visualization.

Project Lineage

Eval Agent is the merged successor of two projects:

app_eval_desktop -> desktop (Electron + TypeScript executor)
app_evaluation_agent -> backend (FastAPI control plane)

All prior architecture, endpoints, and README material has been consolidated here for a single, canonical entrypoint.

Architecture At A Glance

Control plane orchestrates evaluation lifecycles, test plans, test cases, LLM reasoning, bug management, and summaries.
Execution plane performs capture and input on real machines, using deterministic action execution.
Two execution modes are supported: secure cloud execution (headless VM) and interactive local execution (desktop runner).

System Architecture Diagram

graph TD
    subgraph "Client / CI/CD"
        U1[Desktop Runner]
        CI[CI/CD Pipeline]
    end

    subgraph "Control Plane (FastAPI Backend)"
        API[FastAPI Server]
        DB[(PostgreSQL)]
        REDIS[(Redis Queue)]
        S3[S3 Artifact Storage]
        SCAN[Virus Scanner]
    end

    subgraph "Execution - Cloud"
        EXE[Executor Host]
        VM[Ephemeral Test VM]
    end

    subgraph "Execution - Local"
        DR[Desktop Runner Client]
        AUT[App Under Test]
    end

    %% Cloud Path
    CI -->|POST /evaluations cloud| API
    API --> SCAN
    SCAN -->|Clean| S3
    API -->|enqueue| REDIS
    EXE -->|poll queue| REDIS
    EXE --> VM
    VM -->|test + send results| API
    API --> DB

    %% Local Path
    U1 -->|POST /evaluations/upload local| API
    API --> SCAN
    SCAN -->|Clean| S3
    U1 --> DR
    DR -->|GET /jobs/next| API
    API --> DR
    DR --> AUT
    DR -->|Screenshots + Context| API
    API -->|LLM reasoning + coord map| DR
    DR -->|Exec actions| AUT
    DR -->|Results| API
    API --> DB

Execution Model

Eval Agent uses a TestCase runner model:

Desktop runner polls for the next assigned TestCase.
If assigned, the runner launches the target application (desktop or web).
The runner enters a deterministic step loop, sending a screenshot + context and receiving exactly one action per step.
The backend terminates the TestCase with a finish_task action.

sequenceDiagram
    participant DR as Desktop Runner
    participant API as Backend
    participant AUT as App Under Test
    DR->>API: GET /api/v1/testcases/next?executor_id=...
    alt TestCase assigned
        DR->>AUT: Launch app
        loop Step loop (max 40)
            DR->>API: POST /api/v1/vision/analyze (screenshot + context)
            API-->>DR: {thought, action, description}
            DR->>AUT: Execute action
        end
        DR->>API: PATCH /api/v1/testcases/{id} (result)
        DR->>AUT: Tear down app
    else None assigned
        DR->>DR: Idle or stop
    end

Desktop Runner (Execution Plane)

App Eval Desktop is the Electron + TypeScript executor. It is intentionally thin: no local perception, UI parsing, or model inference. All reasoning and decision-making lives in the backend.

Core Responsibilities

Capture

High-performance BGRA capture via Windows Desktop Duplication API
Multi-monitor aware
Optional exclusion from capture (WDA_EXCLUDEFROMCAPTURE)
Conversion to PNG before upload

Execution

Mouse actions: click, double-click, right-click, hover, drag, scroll
Keyboard actions: shortcuts, simulated typing
Clipboard-based direct text entry (paste)
Deterministic waits and task completion

Orchestration

Polls backend for next assigned TestCase execution
Runs TestCases sequentially
Handles pause, resume, stop
Manages app lifecycle (launch + teardown)

Visualization

Live screenshot preview
Structured logs
Step-by-step run timeline
Agent context inspection
Evaluation and TestCase history

Desktop System Architecture

desktop/
├── scripts/
│   ├── utils/copy-recursive.js
│   ├── copy-renderer.js
│   └── copy-native.js
├── src/
│   ├── main.ts              # Electron entry, windows, IPC
│   ├── preload.ts           # Secure IPC bridge
│   ├── config.ts            # Runtime configuration
│   │
│   ├── core/
│   │   ├── orchestrator.ts  # TestCase runner loop
│   │   ├── context.ts       # AgentExecutionContext
│   │   └── logger.ts        # Structured logging
│   │
│   ├── agent/
│   │   ├── executor.ts      # nut-js action executor
│   │   ├── coord-mapper.ts  # analysis/capture -> screen mapping
│   │   └── capture/native/  # C++ Desktop Duplication addon
│   │
│   ├── api/
│   │   └── client.ts        # REST + vision calls
│   │
│   ├── renderer/
│   │   ├── locales/
│   │   ├── pages/
│   │   ├── shared/
│   │   └── styles/
│   │
│   └── types/
│       └── evaluations.d.ts
├── test/
│   └── test-window-capture.ts

Data and Control Flow

Per-step loop

Capture native screenshot -> PNG (with brightness sanity checks).
Assemble AgentExecutionContext and last focus coordinates.
POST screenshot + context to /api/v1/vision/analyze.
Receive { thought, action, description }.
Map coordinates via coord-mapper.ts and execute.
Update scratchpad, action history, and UI timeline.

User Interface

Agent View (Run)

Live screenshot preview
Step timeline (thought + action + screenshot)
Structured logs (SYSTEM / JOB / AGENT / TOOL / CAPTURE / WARN / ERROR)
Pause / resume
Compact mode toggle

Apps

Browse apps, versions, and evaluations
Focus a lineage branch and reset focus
Create apps + versions (upload or URL)
Delete apps or versions
Jump to evaluation history

Evaluations

Assigned evaluations list
Metadata: goal, app type, timestamps
Link to history
Delete evaluation
Regenerate or edit summary (for completed evaluations)

History

Infinite scroll TestCase history
Markdown rendering of results
Copy / download summary
Re-run TestCase

Bugs

Bug list per app with filtering and search
Create, edit, delete bugs (status, severity, priority, fingerprint)
Track occurrences tied to evaluations/TestCases
Record branch-scoped fixes and verification notes

Compact Mode

Always-on-top minimal window
Logs + status
Execution controls

IPC Contracts

Renderer communicates only via preload IPC. Key channels include:

getAssignedEvaluations
fetchEvaluation
deleteEvaluation
run:start / pause / resume / stop
injectHumanPrompt
agent-context-updated
evaluation-attached
run-timeline-entry
history:refresh
getLogBuffer / onLogUpdate
listBugs / getBug / createBug / updateBug / deleteBug
listBugOccurrences / createBugOccurrence
listBugFixes / createBugFix / deleteBugFix

Backend API Contract (Desktop)

Key endpoints used by the desktop runner:

GET /api/v1/apps
POST /api/v1/apps
DELETE /api/v1/apps/{app_id}
GET /api/v1/apps/{app_id}/versions
POST /api/v1/apps/{app_id}/versions
DELETE /api/v1/apps/{app_id}/versions/{version_id}
GET /api/v1/apps/{app_id}/versions/{version_id}/evaluations
GET /api/v1/apps/{app_id}/bugs
POST /api/v1/bugs
GET /api/v1/bugs/{bug_id}
PATCH /api/v1/bugs/{bug_id}
DELETE /api/v1/bugs/{bug_id}
GET /api/v1/bugs/{bug_id}/occurrences
POST /api/v1/bugs/{bug_id}/occurrences
GET /api/v1/bugs/{bug_id}/fixes
POST /api/v1/bugs/{bug_id}/fixes
DELETE /api/v1/bugs/{bug_id}/fixes/{fix_id}
GET /api/v1/testcases/next
PATCH /api/v1/testcases/{id}
POST /api/v1/vision/analyze
GET /api/v1/evaluations/{id}
PATCH /api/v1/evaluations/{id}/summary
POST /api/v1/evaluations/{id}/regenerate-summary

See docs/endpoints.md for the authoritative schema.

Desktop Configuration

.env:

API_BASE_URL=http://127.0.0.1:8000
EXECUTOR_ID=<unique-machine-id>

Additional behavior:

Capture defaults controlled in desktop/src/config.ts
Theme, language, executor ID configurable via Settings UI
Executor ID persists across restarts

Desktop Build and Run

Install dependencies:

npm install

Build:

npm run build

Dev (renderer + Electron with Vite HMR):

npm run dev

Dev (separate terminals):

npm run dev:renderer
npm run dev:electron

Run:

npm start

Package:

npm run make

Test native capture:

npx ts-node test/test-window-capture.ts

Desktop Development Notes

Renderer is fully sandboxed (no Node access)
All side effects happen in main / orchestrator
Vision is non-streaming
Clipboard is restored after direct text entry
Click-through is reference-counted to avoid stuck windows
Max 40 steps per TestCase

Desktop Troubleshooting

No screenshots

Rebuild native addon
Update GPU drivers
Run capture test script

Actions misaligned

Check space + normalized flags
Verify capture resolution vs model space
Inspect desktop/src/agent/coord-mapper.ts

Agent stuck

Confirm backend returns finish_task
Check TestCase status transitions
Inspect vision analyze logs

Backend (Control Plane)

Product Overview

The backend is an AI-based automated application evaluation system. It uses a visual multimodal model and multi-agent collaboration to achieve end-to-end app exploration, evaluation metric calculation, full bug lifecycle management, and test case generation.

Core Functional Requirements

App Information Parsing

Textual material grading
- Level 1: functional brief of <=200 words
- Level 2: introductory guide covering basic operations
- Level 3: full official documentation/manual
Interface understanding
- Automatically detects and classifies UI elements (buttons, inputs, icons)
- Supports structural layout parsing

Automated Evaluation Process

Test case generation
Feature exploration execution
Bug detection and management
Version difference analysis
Evaluation metric calculation
Evaluation report generation

Evaluation System Metrics

Metric	Calculation Method	Core Parameters
Stability	`1 - (Crash Rate * 0.7 + Functional Abnormality Rate * 0.3)`	Crash Count / Total Tasks
Usability	`1 - (Step Efficiency * 0.5 + Time Efficiency * 0.5)`	Steps / Avg Steps
Learnability	`(1 - Basic Exploration Efficiency) * Text Level Coeff + Feature Coverage * 0.2`	Exploration time / Avg
Completeness	`Feature Coverage * 0.4 + Integrity * 0.6`	Implemented Features

Core Agent Design

app_evaluation_agent/services/agents/coordinator.py: CoordinatorAgent (bootstraps plans and test cases)
app_evaluation_agent/services/agents/planner.py: PlannerAgent (LLM plan + test case generation)
app_evaluation_agent/services/agents/analyzer.py: AnalyzerAgent (vision analysis + coordinate mapping)
app_evaluation_agent/services/agents/summarizer.py: SummarizerAgent (final evaluation summary)
app_evaluation_agent/services/agents/bug_triage.py: BugTriageAgent (extracts bugs from results)

Bug Management Specification

Severity level definitions

Level	Definition	Response Time	Example
P0	Critical blocker	24 hours	Crash on launch
P1	Severe abnormality	3 days	Payment cannot submit
P2	General abnormality	One iteration	Button unresponsive
P3	Minor issue	Next major release	UI contrast issue

Status transition rules

New -> In Progress -> Pending Verification -> Closed -> (optional Reopen)

Test Case Management

General Task Description

Task ID
Description
Expected Result
Priority

Version-Specific Steps

Numbered operational steps for each version

Product Deliverables

Functional specification
Evaluation report
Bug list
Bug tracking sheet
Test case set
Operation process dataset

Execution Modes

Secure Cloud Execution (Headless VM Testing)

Intended for CI/CD, regression testing, and scalable automation
Backend enqueues background tasks via Redis ARQ
Cloud executor (outside this repo) consumes jobs and launches ephemeral VMs

Interactive Local Execution (Desktop Runner)

Used for developer debugging and exploratory testing
Desktop runner polls for test cases, captures screenshots, and executes actions locally
Coordinate correction is handled server-side

High-Level Workflow

Cloud path

Client or CI submits an evaluation
Backend scans and stores artifacts
Evaluation is enqueued via Redis
Cloud executor pulls the job
VM runs the app headlessly
Results are sent back and persisted

Local path

Desktop runner uploads or selects an evaluation
Backend scans and stores artifacts
Runner polls /testcases/next
Runner captures screenshots and sends context
Backend vision agent returns actions
Runner executes actions locally
Results are stored and summarized

Component Breakdown

Entry and API Layer

FastAPI app: backend/app_evaluation_agent/main.py
Routes under api/v1/:
- apps.py - app + version management
- evaluations.py - evaluation CRUD and lifecycle
- testplans.py - test plan access
- testcases.py - test case assignment and updates
- vision.py - screenshot + LLM vision reasoning
- logs.py - log streaming/export

Agent Layer (services/agents/)

PlannerAgent: generates high-level plans and test cases
CoordinatorAgent: bootstraps evaluations and assigns test cases to executors
AnalyzerAgent: handles /vision/analyze, builds prompts, calls vLLM, remaps coords
SummarizerAgent: produces final evaluation reports after test case completion
BugTriageAgent: extracts and dedupes bugs by fingerprint
LLM client: llm_client.py

Business Services

services/apps.py - app + version management, evaluation creation
services/evaluations.py - evaluation lifecycle and planner bootstrap
services/testcases.py - test case assignment, completion, bug triage

Bug Tracking and Triage

Bug extraction happens when a runner patches a test case with results via PATCH /api/v1/testcases/{testcase_id}.
BugTriageAgent parses result payloads and emits 0..N bug drafts.
Bugs are deduped per app by fingerprint, with last_seen_at updated on repeats.
Each observation is stored as a BUG_OCCURRENCE linked to evaluation, test case, app version, step index, action/expected/actual, plus optional artifact URIs.
Fixes are recorded in BUG_FIX with fixed_in_version_id and optional verified_by_evaluation_id.
Severity/status enums are validated; state transitions are not enforced by the backend.

Vision and Coordinate Mapping

AnalyzerAgent consumes screenshot + AgentContext and calls the vision LLM
Coordinates are remapped using services/vllm_coordinate_mapper.py
Supports letterboxing, normalization, and capture origin offsets

Persistence Layer

SQLAlchemy models: backend/app_evaluation_agent/storage/models.py
Async engine/session: backend/app_evaluation_agent/storage/database.py
Schemas: backend/app_evaluation_agent/schemas/
Migrations: Alembic (backend/alembic/)

Background Tasks

Redis ARQ worker: backend/app_evaluation_agent/worker.py
Used for summarization and cloud job enqueueing
Local execution bypasses Redis where possible

Integrations and Utilities

Config: backend/app_evaluation_agent/utils/config.py (TOML-based)
Virus scanning: backend/app_evaluation_agent/integrations/virus_scanner.py
Artifact storage: backend/app_evaluation_agent/integrations/s3_client.py
Real-time events: backend/app_evaluation_agent/realtime.py
Logging: backend/app_evaluation_agent/logging_utils.py

Backend File Structure

backend/app_evaluation_agent/
├── main.py
├── worker.py
├── logging_utils.py
├── logs/
├── api/
│   └── v1/
│       ├── apps.py
│       ├── evaluations.py
│       ├── testplans.py
│       ├── testcases.py
│       ├── vision.py
│       └── logs.py
├── services/
│   ├── apps.py
│   ├── agents/
│   │   ├── planner.py
│   │   ├── coordinator.py
│   │   ├── analyzer.py
│   │   ├── summarizer.py
│   │   ├── bug_triage.py
│   │   ├── llm_client.py
│   │   └── prompt_loader.py
│   ├── prompts/
│   │   ├── planner/
│   │   ├── bug_triage/
│   │   └── summarizer/
│   ├── evaluations.py
│   ├── testcases.py
│   └── vllm_coordinate_mapper.py
├── storage/
│   ├── models.py
│   └── database.py
├── schemas/
│   ├── evaluation.py
│   ├── testplan.py
│   ├── testcase.py
│   └── agent.py
├── integrations/
│   ├── virus_scanner.py
│   └── s3_client.py
└── utils/
    └── config.py

Vision Pipeline Notes

Classical UI element detection is currently disabled
Vision endpoints accept a PNG screenshot and execution context
Returned model coordinates are preserved as raw values and remapped to screen pixels
This design supports future detector insertion and coordinate drift debugging

Database Schema

erDiagram
  APP {
    int id PK
    string name
    enum app_type
    datetime created_at
    datetime updated_at
  }

  APP_VERSION {
    int id PK
    int app_id FK
    int previous_version_id FK
    string version
    string artifact_uri
    string app_url
    datetime release_date
    text change_log
    datetime created_at
    datetime updated_at
  }

  APP_VERSION_LINEAGE {
    int app_version_id PK, FK
    int previous_version_id PK, FK
  }

  EVALUATION {
    int id PK
    int app_version_id FK
    enum status
    string execution_mode
    string assigned_executor_id
    json results
    string local_application_path
    string high_level_goal
    bool run_on_current_screen
    datetime created_at
    datetime updated_at
  }

  TEST_PLAN {
    int id PK
    int evaluation_id FK
    enum status
    json summary
    datetime created_at
    datetime updated_at
  }

  TEST_CASE {
    int id PK
    int plan_id FK
    int evaluation_id FK
    string name
    text description
    json input_data
    enum status
    json result
    int execution_order
    string assigned_executor_id
    datetime created_at
    datetime updated_at
  }

  BUG {
    int id PK
    int app_id FK
    string title
    text description
    enum severity_level
    int priority
    enum status
    int discovered_version_id FK
    string fingerprint
    json environment
    json reproduction_steps
    datetime first_seen_at
    datetime last_seen_at
    datetime created_at
    datetime updated_at
  }

  BUG_OCCURRENCE {
    int id PK
    int bug_id FK
    int evaluation_id FK
    int test_case_id FK
    int app_version_id FK
    int step_index
    json action
    text expected
    text actual
    json result_snapshot
    string screenshot_uri
    string log_uri
    json raw_model_coords
    datetime observed_at
    string executor_id
    datetime created_at
    datetime updated_at
  }

  BUG_FIX {
    int id PK
    int bug_id FK
    int fixed_in_version_id FK
    int verified_by_evaluation_id FK
    text note
    datetime created_at
  }

  APP ||--o{ APP_VERSION : has
  APP_VERSION ||--o{ EVALUATION : runs
  EVALUATION ||--o{ TEST_PLAN : owns
  EVALUATION ||--o{ TEST_CASE : owns
  TEST_PLAN ||--o{ TEST_CASE : contains
  APP_VERSION ||--o{ APP_VERSION_LINEAGE : has
  APP_VERSION_LINEAGE }o--|| APP_VERSION : previous
  APP ||--o{ BUG : owns
  APP_VERSION ||--o{ BUG : discovered_in
  BUG ||--o{ BUG_OCCURRENCE : observed
  EVALUATION ||--o{ BUG_OCCURRENCE : observed_in
  TEST_CASE ||--o{ BUG_OCCURRENCE : linked_to
  APP_VERSION ||--o{ BUG_OCCURRENCE : observed_on
  BUG ||--o{ BUG_FIX : fixed_in
  APP_VERSION ||--o{ BUG_FIX : fixed_on
  EVALUATION ||--o{ BUG_FIX : verified_by

API Reference

The canonical API documentation lives at docs/endpoints.md. Use the interactive docs when running the backend:

/docs (Swagger UI)
/redoc

Key operational endpoints:

POST /api/v1/evaluations (JSON)
POST /api/v1/evaluations/upload (desktop app upload)
POST /api/v1/evaluations/url (web app URL)
POST /api/v1/evaluations/live (use runner current screen)
GET /api/v1/testplans/{plan_id}
GET /api/v1/testcases/next
PATCH /api/v1/testcases/{testcase_id}
POST /api/v1/bugs
GET /api/v1/bugs/{bug_id}
PATCH /api/v1/bugs/{bug_id}
DELETE /api/v1/bugs/{bug_id}
GET /api/v1/bugs/{bug_id}/occurrences
POST /api/v1/bugs/{bug_id}/occurrences
GET /api/v1/bugs/{bug_id}/fixes
POST /api/v1/bugs/{bug_id}/fixes
DELETE /api/v1/bugs/{bug_id}/fixes/{fix_id}
POST /api/v1/vision/analyze
GET /api/v1/logs/export

Backend Setup and Installation

Prerequisites

Python 3.10+
Poetry
Docker + Docker Compose

1) Clone

git clone https://github.com/Region-AI/EvalAgent
cd Eval_Agent

2) Install dependencies

cd backend
poetry install

3) Configure

cp config/settings.example.toml config/settings.toml

Edit backend/config/settings.toml to configure:

PostgreSQL URL
Redis host
LLM base URL and API key
Model paths (if applicable)

4) Start backend services

cp docker-compose.example.yaml docker-compose.yaml
docker-compose up -d

5) Run migrations

cp env.example.py alembic/env.py
poetry run alembic upgrade head

6) Run the backend

Terminal 1 (background worker):

arq app_evaluation_agent.worker.WorkerSettings

Terminal 2 (API server):

uvicorn app_evaluation_agent.main:app --reload

API docs are available at http://127.0.0.1:8000/docs.

Testing

Backend

Most vision tests depend on a configured LLM endpoint and screenshot.png fixture:

cd backend
poetry run pytest tests/test_vllm_coordinate_mapper.py -q

Desktop

cd desktop
npx ts-node test/test-window-capture.ts

Docs Map

docs/overview.md
docs/architecture.md
docs/backend.md
docs/desktop.md
docs/endpoints.md
docs/troubleshooting.md
ARCHITECTURE_backend.md (legacy full backend architecture)
ARCHITECTURE_frontend.md (legacy full desktop architecture)
README_backend.md (legacy backend README)
README_frontend.md (legacy desktop README)

Star History

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
desktop		desktop
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Eval Agent

Table of Contents

Overview

Project Lineage

Architecture At A Glance

System Architecture Diagram

Execution Model

Desktop Runner (Execution Plane)

Core Responsibilities

Desktop System Architecture

Data and Control Flow

User Interface

IPC Contracts

Backend API Contract (Desktop)

Desktop Configuration

Desktop Build and Run

Desktop Development Notes

Desktop Troubleshooting

Backend (Control Plane)

Product Overview

Core Functional Requirements

Evaluation System Metrics

Core Agent Design

Bug Management Specification

Test Case Management

Product Deliverables

Execution Modes

High-Level Workflow

Component Breakdown

Bug Tracking and Triage

Vision and Coordinate Mapping

Persistence Layer

Background Tasks

Integrations and Utilities

Backend File Structure

Vision Pipeline Notes

Database Schema

API Reference

Backend Setup and Installation

Testing

Docs Map

Star History

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages