Skip to content

Fzkuji/GUI-Agent-Harness

Repository files navigation

GUI Agent Harness

Autonomous GUI agent — give it a task, it operates the desktop.
Visual memory • One-shot UI learning • Any LLM provider • Local or VM


🇺🇸 English · 🇨🇳 中文


News

  • [2026-06-05] 🏆 UI-Vision 68.64% — GPT-5.5, 5,479 samples across basic/functional/spatial splits. Results →
  • [2026-06-05] 🏆 MMBench-GUI-L2 91.52% — GPT-5.5, 3,594 samples. Results →
  • [2026-06-05] 🏆 ScreenSpot Pro 87.9% — GPT-5.5, 1,581 samples across 23 apps. Results →
  • [2026-06-02] 🏆 ScreenSpot v2 96.78% — GPT-5.5, 1,272 samples. Results →
  • [2026-04-18] 📦 OpenProgram — Renamed from Agentic Programming. GitHub
  • [2026-04-14] 🏆 OSWorld Multi-Apps 79.8% — 72.6/91 evaluated. Results →
  • [2026-04-07] 🤖 Agent-native architecture — Unified GUI perception + agent actions under single decision loop.
  • [2026-03-30] 📐 ImageContext — Scale-independent coordinate system, fixes crop bugs.
  • [2026-03-29] 🎬 v0.3 — Unified Actionsgui_action.py single entry point, auto platform detection.
  • [2026-03-23] 🏆 OSWorld Chrome 93.5% — 43/46 one attempt, 45/46 two attempts. Results →
  • [2026-03-10] 🚀 Initial release — GPA-GUI-Detector + Apple Vision OCR + template matching.

What is GUI Agent Harness?

A CLI tool that turns any LLM into a GUI automation agent. Give it a natural-language task, it operates the desktop autonomously — screenshots, clicks, types, verifies, and repeats until the task is done.

gui-agent --work-dir /private/tmp/gui-agent-desktop "Install the Orchis GNOME theme"
gui-agent --work-dir /private/tmp/gui-agent-vm --vm http://172.16.82.132:5000 "Open GitHub in Chrome and Python docs"

Built on OpenProgram — the runtime handles provider abstraction, context management, and structured LLM calls. The harness adds GUI perception (YOLO detection, OCR, template matching) and action execution (mouse, keyboard, clipboard).

Grounding Pipeline: Iterative Zoom

At the core of the harness is a dedicated GUI element grounding pipeline. Given a screenshot and a natural-language description of a target element, it outputs precise click coordinates through progressive refinement.

Screenshot + Target description
         │
         ▼
  Phase 1: Detection          GPA-GUI-Detector (YOLO) + OCR → all visible UI elements
         │
         ▼
  Phase 2: Candidate Match    Template-match against stored visual memory
         │
         ▼
  Phase 3: LLM Grounding      VLM sees full screen + component list → identifies target region
         │
         ▼
  Phase 4: Iterative Zoom     Crop → upscale → re-ground → verify, repeat up to 8 rounds
         │
         ▼
     Precise (x, y)

Key design decisions:

  • Multi-source perception — YOLO detection + OCR + visual memory templates provide rich spatial context to the VLM, so it reasons over labeled components rather than raw pixels alone.
  • Progressive refinement — Instead of one-shot coordinate prediction, the pipeline iteratively crops and zooms into candidate regions. Each round gives the VLM a higher-resolution view of a smaller area.
  • Verifier gate — After each zoom level, a separate verification step checks whether the predicted point actually lands on the target. False predictions are rejected before they become wrong clicks.
  • Cacheable prompt layout — Fixed rules are hoisted into a cacheable prefix; only the task, component list, and image change per call. This maximizes prompt cache hit rate across the 8-round pipeline.
  • Configurable scale strategypreserve mode keeps large images at native resolution (no information loss from downscaling small targets); fill mode matches legacy behavior for controlled comparisons.

Benchmark Results

Benchmark Samples Accuracy Paper Best Delta
MMBench-GUI-L2 (full) 3,594 91.52% 74.25% (UI-TARS-72B-DPO) +17.3
MMBench-GUI-L2 (basic) 1,787 94.89%
MMBench-GUI-L2 (advanced) 1,807 88.17%
ScreenSpot Pro (full) 1,581 87.9%
ScreenSpot v2 1,272 96.78%
UI-Vision (full) 5,479 68.64%
UI-Vision (basic) 1,772 73.1%
UI-Vision (functional) 1,772 67.0%
UI-Vision (spatial) 1,935 66.0%

Full per-platform breakdown: benchmarks/mmbench_gui_l2/ | benchmarks/screenspot_pro/

Agent Loop: Observe → Verify → Plan → Dispatch

For full task automation (beyond grounding), the harness runs a 4-phase loop:

  • Observe (Python) — Screenshot + YOLO detection + OCR + template match. Identifies visible UI state.
  • Verify (LLM) — Checks whether the previous action succeeded.
  • Plan (LLM) — Sees the screenshot, detected components, and verification result. Chooses one action.
  • Dispatch (Python) — Executes the action. For clicks, delegates to the iterative zoom grounding pipeline.

All phases are @agentic_function calls with structured feedback between steps.

Visual Memory

UI components are detected once, labeled by a VLM, and stored as templates. On subsequent encounters, template matching replaces expensive re-detection (~5x faster, ~60x fewer tokens). States are modeled as sets of visible components, matched by Jaccard similarity. Components auto-forget after 15 consecutive misses.

OSWorld Results

Multi-Apps: 79.8% (72.6/91) | Chrome: 93.5% (43/46)

Domain Tasks Passed Accuracy
Chrome 46 43 93.5%
Multi-Apps 91 63 79.8%

Full OSWorld results →

Quick Start

1. Install

The GUI agent is a normal OpenProgram program: programs live in openprogram/functions/agentics/, and anything cloned into that folder auto-registers on the next start. So you install the OpenProgram host, then clone this repo into that folder and run its installer — the same pattern any harness (including your own) uses to plug into OpenProgram.

Step 1 — Install the OpenProgram host

macOS / Linux

git clone https://github.com/Fzkuji/OpenProgram && cd OpenProgram
./scripts/install.sh

Windows (PowerShell)

git clone https://github.com/Fzkuji/OpenProgram; cd OpenProgram
.\scripts\install.ps1

Step 2 — Add the GUI agent

The quickest path is OpenProgram's program installer (the first-run wizard offers the same choice):

openprogram programs install gui     # clones this repo + installs its deps
                                     # (PyTorch: CPU wheel auto-selected on
                                     # GPU-less Linux, CUDA on NVIDIA boxes)

For explicit control over the torch variant and the asset setup (detector weight, OCR models, system tools), clone this repo into the agentics folder and run its own installer instead:

macOS / Linux

cd openprogram/functions/agentics
git clone https://github.com/Fzkuji/GUI-Agent-Harness
cd GUI-Agent-Harness
./scripts/install.sh            # auto-detects an NVIDIA GPU; --cpu / --cuda cuXXX to force

Windows (PowerShell)

cd openprogram\functions\agentics
git clone https://github.com/Fzkuji/GUI-Agent-Harness
cd GUI-Agent-Harness
.\scripts\install.ps1           # auto-detects an NVIDIA GPU; -Cpu / -Cuda cuXXX to force

It's one command, but the heavy lifting is platform-specific — here's exactly what it sets up on each OS, so nothing is left for you to chase down:

macOS Windows Linux
PyTorch universal MPS/CPU wheel NVIDIA-CUDA auto-detected, else CPU NVIDIA-CUDA auto-detected, else CPU
OCR engine Apple Vision + EasyOCR fallback EasyOCR en+ch_sim (~300 MB) EasyOCR en+ch_sim (~300 MB)
Detector GPA-GUI-Detector weight → ~/GPA-GUI-Detector/model.pt %USERPROFILE%\GPA-GUI-Detector\model.pt ~/GPA-GUI-Detector/model.pt
System tools Xcode CLT (Swift, for Apple Vision) none — Win32 + PowerShell clipboard built-in xclip (required) + wmctrl/xdotool/scrot, via apt/dnf/pacman
Manual step grant the terminal Screen Recording + Accessibility none none

macOS only: the agent cannot screenshot or click until you grant your terminal Screen Recording and Accessibility under System Settings → Privacy & Security. Apple Vision OCR also needs the Xcode command-line tools (xcode-select --install); the installer requests them, and EasyOCR is installed as a cross-platform fallback either way.

After Step 2, restart the worker (or hit Refresh on the web UI's Functions page): gui_agent is registered and shows up in the web UI and the gui-agent CLI. The first time you run openprogram it walks you through provider setup.

Offline pre-fetch, forcing a CUDA tag, or skipping pieces (--no-weights / --no-ocr / --no-system): docs/install.md.

How OpenProgram detects this harness (and how to build your own)

OpenProgram walks openprogram/functions/agentics/ at startup and loads any cloned repo that satisfies the harness contract:

GUI-Agent-Harness/                   ← cloned into functions/agentics/
├── pyproject.toml                   ← declares THIS repo's own deps only
└── gui_harness/                     ← importable package
    ├── __init__.py                  ← kept dependency-light (lazy heavy imports)
    └── agentics/
        └── __init__.py              ← exposes AGENTIC_FUNCTIONS = [gui_agent]

Importing gui_harness.agentics fires the @agentic_function decorators, which self-register the functions. Two rules keep this safe: the top-level __init__ must import cleanly on a machine without the harness's heavy deps (this repo lazy-loads cv2/torch for exactly that reason), and pyproject.toml must NOT declare openprogram as a dependency (the host already provides it). Full contract: docs/installing-harnesses.md.

2. Run

--work-dir is an absolute path the agent may write to; use a native path per OS.

# macOS / Linux — local desktop
gui-agent --work-dir /tmp/gui-agent-firefox --app firefox "Open Firefox, go to google.com"

# Windows (PowerShell) — local desktop
gui-agent --work-dir C:\temp\gui-agent-firefox --app firefox "Open Firefox, go to google.com"

# Any platform — drive a remote VM (e.g. OSWorld)
gui-agent --work-dir /tmp/gui-agent-vm --vm http://VM_IP:5000 "Install the Orchis GNOME theme"

Project Structure

GUI-Agent-Harness/
├── gui_harness/
│   ├── main.py                   # CLI entry + agent loop
│   ├── openprogram_compat.py     # OpenProgram boundary
│   ├── action/input.py           # Mouse, keyboard, clipboard
│   ├── perception/               # Screenshot, YOLO detection, OCR
│   ├── planning/
│   │   ├── component_memory.py   # Visual memory + template matching
│   │   └── screenspot_locator.py # Iterative zoom grounding pipeline
│   └── adapters/vm_adapter.py    # Remote VM I/O
├── benchmarks/
│   ├── screenspot_pro/           # ScreenSpot Pro (1,581 samples, 87.9%)
│   ├── screenspot_v2/            # ScreenSpot v2 (1,272 samples, 95.83%)
│   ├── mmbench_gui_l2/           # MMBench-GUI-L2 (3,594 samples, 91.52%)
│   ├── ui_vision/                # UI-Vision (5,479 samples, 68.64%)
│   └── osworld/                  # OSWorld
├── memory/                       # Per-app visual templates
├── SKILL.md                      # LLM skill definition
└── pyproject.toml

License

MIT — see LICENSE.

Citation

@misc{fu2026gui-agent-harness,
  author       = {Fu, Zichuan},
  title        = {GUI Agent Harness: Autonomous GUI Automation with Visual Memory},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Fzkuji/GUI-Agent-Harness},
}

Built with OpenProgram

About

Autonomous GUI agent — give it a task, it operates the desktop. Visual memory, one-shot UI learning. | 自主GUI代理——给它一个任务,它操作桌面。视觉记忆,一次学习即可操作。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages