Android Agent Reliability Runtime

Project URL: https://github.com/rty90/Android-Agent-Reliability-Runtime

Android Agent Reliability Runtime is a debugging, safety, and recovery layer for Android GUI agents. It is not trying to be yet another fully autonomous mobile agent. The existing agent is treated as an execution kernel; this project adds the runtime layer that decides when the agent should act, wait, stop, diagnose, or ask a human to take over.

Why This Exists

Mobile GUI agents often fail silently:

They tap while the screen is still loading.
They mistake blocked pages for task success.
They repeat actions that do not change the UI.
They treat one-off failures as reusable memory.
They leave no reproducible trace when something goes wrong.

This project focuses on reliability instead of raw autonomy. The goal is not to make agents act more. The goal is to know when they should stop.

Runtime Loop

Read screen
-> Normalize state
-> Classify readiness
-> Detect blocker / loop / risk
-> Propose action
-> Policy gate
-> Execute or ask human
-> Verify progress
-> Diagnose failure
-> Store trace
-> Promote reusable lessons only when validated

Core Ideas

Readiness First

Before a model or procedure proposes an action, the runtime classifies the screen as:

ready
loading
blocked
uncertain
complete

If the screen is not ready, normal actions are blocked. The runtime should only allow safe actions such as wait, diagnose, or manual_handoff.

Progress Is Verified

Executing a tap is not the same as making progress. A successful action should mean:

action_executed == true
state_progress_verified == true

If the UI does not meaningfully change after an action, the runtime should mark the step as no_progress, stuck, or false_success, not success.

Failures Are Diagnosed

Failures should produce clear, stable labels such as:

loading_loop
permission_blocker
modal_blocker
wrong_page
target_missing
no_ui_change
false_success
uncertain_state
unsafe_action_blocked

Memory Is Conservative

Raw failures should not become operational memory automatically. The intended memory pipeline is:

raw_trace -> candidate_lesson -> promoted_lesson

Lessons should default to hints. They should become control rules only after repeated evidence or human approval.

Architecture

Important modules:

app/utils/adb.py - low-level ADB wrapper for device interaction.
app/executor.py - execution engine for bounded skills and plans.
app/skills/ - atomic Android actions such as tap, type, wait, back, and search.
app/readiness.py - readiness classification for ready/loading/blocked/uncertain states.
app/ui_facts.py - reusable UI fact extraction helpers.
app/ui_policy.py - generic blocker and policy detection.
app/ui_state.py - normalized UI state and goal-progress assessment.
app/procedural_skills.py - generic procedure layer for common safe actions.
app/reasoning_orchestrator.py - action proposer that is now gated by readiness.
app/diagnostics.py - stable failure diagnostic reports.
scripts/chaos_ui_harness.py - deterministic blocker and overlay regression harness.
scripts/chaos_ui_e2e_smoke.py - minimal execute-and-verify smoke test.
scripts/long_tail_agent_smoke.py - mixed long-tail test runner for messy real cases.

Guided UI reasoning order:

read_screen
-> ui_facts / ui_policy / ui_state / readiness
-> procedural_skills
-> memory / model fallback
-> executor
-> verifier / diagnostics

Setup

Requirements:

Python 3.8+
Android Studio Emulator or an Android device with ADB enabled
Android platform-tools available through adb
Optional: chaos fixture APK for deterministic blocker tests

Install Python dependencies:

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Check that a device is online:

adb devices

Basic CLI Usage

Show CLI help:

python -m app.main --help

Read the current screen:

python -m app.main --task "read the current screen and summarize it" --task-type read_current_screen

Run a guided UI task with the reasoning stack:

python -m app.main --task "open settings and inspect the current page" --task-type guided_ui_task --reasoner-backend stack --agent-mode interactive --max-steps 3 --auto-confirm

Run coach mode, where the system suggests actions while a human operates:

python -m app.main --task "open chrome and search for llm" --task-type guided_ui_task --agent-mode coach --reasoner-backend stack

Supported Task Types

The older bounded flows are still available as execution-kernel capabilities:

send_message
extract_and_copy
create_reminder
read_current_screen
guided_ui_task
unsupported

The new project direction is centered on guided_ui_task, diagnostics, coach mode, and reliability testing.

Model Configuration

The runtime can use local or OpenAI-compatible model services for reasoning, but models should be treated as action proposers rather than final decision-makers.

Common environment variables:

$env:LOCAL_TEXT_REASONER_BASE_URL="http://127.0.0.1:9000/v1"
$env:LOCAL_TEXT_REASONER_MODEL="Qwen/Qwen3.5-0.8B"
$env:REASONING_REQUEST_TIMEOUT_SECONDS="30"
$env:REASONING_DISABLE_LOCAL_TEXT_AFTER_FAILURE="1"
$env:REASONING_ENABLE_LOCAL_VL="0"

Do not commit API keys. Use environment variables or your shell profile.

Chaos UI Regression Harness

The chaos harness is a deterministic ADB regression tool for UI blockers:

permission dialogs
onboarding overlays
bottom sheets
loading states
error states
stylus / IME overlays
Chrome search overlay cases

Default fixture APK path used during development:

F:\virtualver\app\build\outputs\apk\debug\app-debug.apk

Example ADB path on Windows:

C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe

Run one dry-run decision case:

python scripts\chaos_ui_harness.py --case fixture_input_surface --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"

Recommended smoke cases:

python scripts\chaos_ui_harness.py --case fixture_notification_permission --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_input_surface --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_loading_state --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_error_state --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case chrome_search_stylus_overlay --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe"

Run the minimal execute-and-verify E2E smoke:

python scripts\chaos_ui_e2e_smoke.py --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"

Artifacts are written under:

data\tmp\chaos\...
data\tmp\chaos_e2e\...

Long-Tail Agent Smoke

Use this when you want a longer mixed run with real-ish and randomized goals. It mixes chaos fixture blockers, real Settings read-only inspection, Chrome random search questions, and one execute-and-verify input E2E.

python scripts\long_tail_agent_smoke.py --iterations 18 --seed 20260502 --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"

Use the Chrome torture profile for messy real web pages that can expose blank WebView loads, JS challenges, cookie/captcha blockers, and repeated-search mistakes:

python scripts\long_tail_agent_smoke.py --iterations 8 --seed 20260506 --profile chrome_torture --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk" --skip-install

Artifacts are written under:

data\tmp\long_tail\long_tail_<timestamp>_seed_<seed>\long_tail_report.json

The long-tail runner keeps screenshots, XML, summaries, decisions, and diagnostics for every round.

Failure Diagnostics

Agent and harness failures write a stable diagnostic JSON report using schema agent.diagnostic.v1.

Example shape:

{
  "schema_version": "agent.diagnostic.v1",
  "status": "fail",
  "kind": "adb_error | unhandled_exception | agent_result_failure | chaos_harness_failure | chaos_e2e_failure",
  "human_summary": "Short explanation for humans",
  "error": {"type": "...", "message": "...", "traceback": "..."},
  "device": {
    "requested_device": "emulator-5554",
    "connected": true,
    "current_focus": "...",
    "foreground_package": "...",
    "top_activity": "...",
    "crash_log_tail": "..."
  },
  "artifacts": {
    "diagnostic_report_path": "...",
    "screenshot_path": "...",
    "ui_dump_path": "...",
    "screen_summary_path": "..."
  }
}

Default diagnostic locations:

data\tmp\diagnostics\...
data\tmp\chaos\...\diagnostics\diagnostic.json
data\tmp\chaos_e2e\...\diagnostics\diagnostic.json

device.crash_log_tail is the tail of adb logcat -b crash. It is useful for emulator or app crash clues, but it may include earlier crashes if the buffer was not cleared before the run.

Tests

Run all unit tests:

python -m unittest discover -s tests -v

Run the focused reliability tests:

python -m unittest tests.test_readiness tests.test_ui_state tests.test_procedural_skills tests.test_reasoning_orchestrator tests.test_diagnostics tests.test_adb -v

Syntax check:

python -m py_compile app\readiness.py app\ui_state.py app\reasoning_orchestrator.py scripts\long_tail_agent_smoke.py

Safety Notes

The project is emulator-first.
The runtime reads foreground UI state; it does not inspect private app data.
Browser WebView content is often incomplete in uiautomator XML, so readiness may conservatively return uncertain.
Captcha, login walls, and web challenges should normally trigger diagnosis or human handoff rather than automated bypass attempts.
High-risk actions should require confirmation.

Roadmap

Add a visual readiness layer for Chrome/WebView pages.
Add stronger post-action progress verification.
Add raw_trace -> candidate_lesson -> promoted_lesson storage.
Make coach mode the default user-facing experience.
Improve failure taxonomy and recovery recommendations.
Keep procedures generic; avoid app-specific if-else growth.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
app		app
scripts		scripts
tests		tests
.gitignore		.gitignore
GUIDE_ZH.md		GUIDE_ZH.md
README.md		README.md
Run.txt		Run.txt
requirements.txt		requirements.txt
screenshots.png		screenshots.png
test1.png		test1.png
test2.png		test2.png
test3.png		test3.png
test4.png		test4.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Android Agent Reliability Runtime

Why This Exists

Runtime Loop

Core Ideas

Readiness First

Progress Is Verified

Failures Are Diagnosed

Memory Is Conservative

Architecture

Setup

Basic CLI Usage

Supported Task Types

Model Configuration

Chaos UI Regression Harness

Long-Tail Agent Smoke

Failure Diagnostics

Tests

Safety Notes

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Android Agent Reliability Runtime

Why This Exists

Runtime Loop

Core Ideas

Readiness First

Progress Is Verified

Failures Are Diagnosed

Memory Is Conservative

Architecture

Setup

Basic CLI Usage

Supported Task Types

Model Configuration

Chaos UI Regression Harness

Long-Tail Agent Smoke

Failure Diagnostics

Tests

Safety Notes

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages