GitHub - Fzkuji/GUI-Agent-Harness: Autonomous GUI agent — give it a task, it operates the desktop. Visual memory, one-shot UI learning. | 自主GUI代理——给它一个任务，它操作桌面。视觉记忆，一次学习即可操作。

Autonomous GUI agent — give it a task, it operates the desktop.
_{Visual memory • One-shot UI learning • Any LLM provider • Local or VM}

🇺🇸 English · 🇨🇳 中文

News

[2026-06-05] 🏆 UI-Vision 68.64% — GPT-5.5, 5,479 samples across basic/functional/spatial splits. Results →
[2026-06-05] 🏆 MMBench-GUI-L2 91.52% — GPT-5.5, 3,594 samples. Results →
[2026-06-05] 🏆 ScreenSpot Pro 87.9% — GPT-5.5, 1,581 samples across 23 apps. Results →
[2026-06-02] 🏆 ScreenSpot v2 96.78% — GPT-5.5, 1,272 samples. Results →
[2026-04-18] 📦 OpenProgram — Renamed from Agentic Programming. GitHub
[2026-04-14] 🏆 OSWorld Multi-Apps 79.8% — 72.6/91 evaluated. Results →
[2026-04-07] 🤖 Agent-native architecture — Unified GUI perception + agent actions under single decision loop.
[2026-03-30] 📐 ImageContext — Scale-independent coordinate system, fixes crop bugs.
[2026-03-29] 🎬 v0.3 — Unified Actions — gui_action.py single entry point, auto platform detection.
[2026-03-23] 🏆 OSWorld Chrome 93.5% — 43/46 one attempt, 45/46 two attempts. Results →
[2026-03-10] 🚀 Initial release — GPA-GUI-Detector + Apple Vision OCR + template matching.

What is GUI Agent Harness?

A CLI tool that turns any LLM into a GUI automation agent. Give it a natural-language task, it operates the desktop autonomously — screenshots, clicks, types, verifies, and repeats until the task is done.

gui-agent --work-dir /private/tmp/gui-agent-desktop "Install the Orchis GNOME theme"
gui-agent --work-dir /private/tmp/gui-agent-vm --vm http://172.16.82.132:5000 "Open GitHub in Chrome and Python docs"

Built on OpenProgram — the runtime handles provider abstraction, context management, and structured LLM calls. The harness adds GUI perception (YOLO detection, OCR, template matching) and action execution (mouse, keyboard, clipboard).

Grounding Pipeline: Iterative Zoom

At the core of the harness is a dedicated GUI element grounding pipeline. Given a screenshot and a natural-language description of a target element, it outputs precise click coordinates through progressive refinement.

Screenshot + Target description
         │
         ▼
  Phase 1: Detection          GPA-GUI-Detector (YOLO) + OCR → all visible UI elements
         │
         ▼
  Phase 2: Candidate Match    Template-match against stored visual memory
         │
         ▼
  Phase 3: LLM Grounding      VLM sees full screen + component list → identifies target region
         │
         ▼
  Phase 4: Iterative Zoom     Crop → upscale → re-ground → verify, repeat up to 8 rounds
         │
         ▼
     Precise (x, y)

Key design decisions:

Multi-source perception — YOLO detection + OCR + visual memory templates provide rich spatial context to the VLM, so it reasons over labeled components rather than raw pixels alone.
Progressive refinement — Instead of one-shot coordinate prediction, the pipeline iteratively crops and zooms into candidate regions. Each round gives the VLM a higher-resolution view of a smaller area.
Verifier gate — After each zoom level, a separate verification step checks whether the predicted point actually lands on the target. False predictions are rejected before they become wrong clicks.
Cacheable prompt layout — Fixed rules are hoisted into a cacheable prefix; only the task, component list, and image change per call. This maximizes prompt cache hit rate across the 8-round pipeline.
Configurable scale strategy — preserve mode keeps large images at native resolution (no information loss from downscaling small targets); fill mode matches legacy behavior for controlled comparisons.

Benchmark Results

Benchmark	Samples	Accuracy	Paper Best	Delta
MMBench-GUI-L2 (full)	3,594	91.52%	74.25% (UI-TARS-72B-DPO)	+17.3
MMBench-GUI-L2 (basic)	1,787	94.89%	—	—
MMBench-GUI-L2 (advanced)	1,807	88.17%	—	—
ScreenSpot Pro (full)	1,581	87.9%	—	—
ScreenSpot v2	1,272	96.78%	—	—
UI-Vision (full)	5,479	68.64%	—	—
UI-Vision (basic)	1,772	73.1%	—	—
UI-Vision (functional)	1,772	67.0%	—	—
UI-Vision (spatial)	1,935	66.0%	—	—

Full per-platform breakdown: benchmarks/mmbench_gui_l2/ | benchmarks/screenspot_pro/

Agent Loop: Observe → Verify → Plan → Dispatch

For full task automation (beyond grounding), the harness runs a 4-phase loop:

Observe (Python) — Screenshot + YOLO detection + OCR + template match. Identifies visible UI state.
Verify (LLM) — Checks whether the previous action succeeded.
Plan (LLM) — Sees the screenshot, detected components, and verification result. Chooses one action.
Dispatch (Python) — Executes the action. For clicks, delegates to the iterative zoom grounding pipeline.

All phases are @agentic_function calls with structured feedback between steps.

Visual Memory

UI components are detected once, labeled by a VLM, and stored as templates. On subsequent encounters, template matching replaces expensive re-detection (~5x faster, ~60x fewer tokens). States are modeled as sets of visible components, matched by Jaccard similarity. Components auto-forget after 15 consecutive misses.

OSWorld Results

Multi-Apps: 79.8% (72.6/91) | Chrome: 93.5% (43/46)

Domain	Tasks	Passed	Accuracy
Chrome	46	43	93.5%
Multi-Apps	91	63	79.8%

Full OSWorld results →

Quick Start

1. Install

The GUI agent is a normal OpenProgram program: programs live in openprogram/functions/agentics/, and anything cloned into that folder auto-registers on the next start. So you install the OpenProgram host, then clone this repo into that folder and run its installer — the same pattern any harness (including your own) uses to plug into OpenProgram.

Step 1 — Install the OpenProgram host

macOS / Linux

git clone https://github.com/Fzkuji/OpenProgram && cd OpenProgram
./scripts/install.sh

Windows (PowerShell)

git clone https://github.com/Fzkuji/OpenProgram; cd OpenProgram
.\scripts\install.ps1

Step 2 — Add the GUI agent

The quickest path is OpenProgram's program installer (the first-run wizard offers the same choice):

openprogram programs install gui     # clones this repo + installs its deps
                                     # (PyTorch: CPU wheel auto-selected on
                                     # GPU-less Linux, CUDA on NVIDIA boxes)

For explicit control over the torch variant and the asset setup (detector weight, OCR models, system tools), clone this repo into the agentics folder and run its own installer instead:

macOS / Linux

cd openprogram/functions/agentics
git clone https://github.com/Fzkuji/GUI-Agent-Harness
cd GUI-Agent-Harness
./scripts/install.sh            # auto-detects an NVIDIA GPU; --cpu / --cuda cuXXX to force

Windows (PowerShell)

cd openprogram\functions\agentics
git clone https://github.com/Fzkuji/GUI-Agent-Harness
cd GUI-Agent-Harness
.\scripts\install.ps1           # auto-detects an NVIDIA GPU; -Cpu / -Cuda cuXXX to force

It's one command, but the heavy lifting is platform-specific — here's exactly what it sets up on each OS, so nothing is left for you to chase down:

	macOS	Windows	Linux
PyTorch	universal MPS/CPU wheel	NVIDIA-CUDA auto-detected, else CPU	NVIDIA-CUDA auto-detected, else CPU
OCR engine	Apple Vision + EasyOCR fallback	EasyOCR `en`+`ch_sim` (~300 MB)	EasyOCR `en`+`ch_sim` (~300 MB)
Detector	GPA-GUI-Detector weight → `~/GPA-GUI-Detector/model.pt`	→ `%USERPROFILE%\GPA-GUI-Detector\model.pt`	→ `~/GPA-GUI-Detector/model.pt`
System tools	Xcode CLT (Swift, for Apple Vision)	none — Win32 + PowerShell clipboard built-in	`xclip` (required) + `wmctrl`/`xdotool`/`scrot`, via apt/dnf/pacman
Manual step	grant the terminal Screen Recording + Accessibility	none	none

macOS only: the agent cannot screenshot or click until you grant your terminal Screen Recording and Accessibility under System Settings → Privacy & Security. Apple Vision OCR also needs the Xcode command-line tools (xcode-select --install); the installer requests them, and EasyOCR is installed as a cross-platform fallback either way.

After Step 2, restart the worker (or hit Refresh on the web UI's Functions page): gui_agent is registered and shows up in the web UI and the gui-agent CLI. The first time you run openprogram it walks you through provider setup.

Offline pre-fetch, forcing a CUDA tag, or skipping pieces (--no-weights / --no-ocr / --no-system): docs/install.md.

How OpenProgram detects this harness (and how to build your own)

OpenProgram walks openprogram/functions/agentics/ at startup and loads any cloned repo that satisfies the harness contract:

GUI-Agent-Harness/                   ← cloned into functions/agentics/
├── pyproject.toml                   ← declares THIS repo's own deps only
└── gui_harness/                     ← importable package
    ├── __init__.py                  ← kept dependency-light (lazy heavy imports)
    └── agentics/
        └── __init__.py              ← exposes AGENTIC_FUNCTIONS = [gui_agent]

Importing gui_harness.agentics fires the @agentic_function decorators, which self-register the functions. Two rules keep this safe: the top-level __init__ must import cleanly on a machine without the harness's heavy deps (this repo lazy-loads cv2/torch for exactly that reason), and pyproject.toml must NOT declare openprogram as a dependency (the host already provides it). Full contract: docs/installing-harnesses.md.

2. Run

--work-dir is an absolute path the agent may write to; use a native path per OS.

# macOS / Linux — local desktop
gui-agent --work-dir /tmp/gui-agent-firefox --app firefox "Open Firefox, go to google.com"

# Windows (PowerShell) — local desktop
gui-agent --work-dir C:\temp\gui-agent-firefox --app firefox "Open Firefox, go to google.com"

# Any platform — drive a remote VM (e.g. OSWorld)
gui-agent --work-dir /tmp/gui-agent-vm --vm http://VM_IP:5000 "Install the Orchis GNOME theme"

Project Structure

GUI-Agent-Harness/
├── gui_harness/
│   ├── main.py                   # CLI entry + agent loop
│   ├── openprogram_compat.py     # OpenProgram boundary
│   ├── action/input.py           # Mouse, keyboard, clipboard
│   ├── perception/               # Screenshot, YOLO detection, OCR
│   ├── planning/
│   │   ├── component_memory.py   # Visual memory + template matching
│   │   └── screenspot_locator.py # Iterative zoom grounding pipeline
│   └── adapters/vm_adapter.py    # Remote VM I/O
├── benchmarks/
│   ├── screenspot_pro/           # ScreenSpot Pro (1,581 samples, 87.9%)
│   ├── screenspot_v2/            # ScreenSpot v2 (1,272 samples, 95.83%)
│   ├── mmbench_gui_l2/           # MMBench-GUI-L2 (3,594 samples, 91.52%)
│   ├── ui_vision/                # UI-Vision (5,479 samples, 68.64%)
│   └── osworld/                  # OSWorld
├── memory/                       # Per-app visual templates
├── SKILL.md                      # LLM skill definition
└── pyproject.toml

License

MIT — see LICENSE.

Citation

@misc{fu2026gui-agent-harness,
  author       = {Fu, Zichuan},
  title        = {GUI Agent Harness: Autonomous GUI Automation with Visual Memory},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Fzkuji/GUI-Agent-Harness},
}

_{Built with OpenProgram}

Name		Name	Last commit message	Last commit date
Latest commit History 593 Commits
actions		actions
assets		assets
benchmarks		benchmarks
desktop_env		desktop_env
docs		docs
gui_agent_harness.egg-info		gui_agent_harness.egg-info
gui_harness		gui_harness
memory/apps		memory/apps
platforms		platforms
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

What is GUI Agent Harness?

Grounding Pipeline: Iterative Zoom

Benchmark Results

Agent Loop: Observe → Verify → Plan → Dispatch

Visual Memory

OSWorld Results

Quick Start

1. Install

Step 1 — Install the OpenProgram host

Step 2 — Add the GUI agent

2. Run

Project Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News

What is GUI Agent Harness?

Grounding Pipeline: Iterative Zoom

Benchmark Results

Agent Loop: Observe → Verify → Plan → Dispatch

Visual Memory

OSWorld Results

Quick Start

1. Install

Step 1 — Install the OpenProgram host

Step 2 — Add the GUI agent

2. Run

Project Structure

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages