Project URL: https://github.com/rty90/Android-Agent-Reliability-Runtime
Android Agent Reliability Runtime is a debugging, safety, and recovery layer for Android GUI agents. It is not trying to be yet another fully autonomous mobile agent. The existing agent is treated as an execution kernel; this project adds the runtime layer that decides when the agent should act, wait, stop, diagnose, or ask a human to take over.
Mobile GUI agents often fail silently:
- They tap while the screen is still loading.
- They mistake blocked pages for task success.
- They repeat actions that do not change the UI.
- They treat one-off failures as reusable memory.
- They leave no reproducible trace when something goes wrong.
This project focuses on reliability instead of raw autonomy. The goal is not to make agents act more. The goal is to know when they should stop.
Read screen
-> Normalize state
-> Classify readiness
-> Detect blocker / loop / risk
-> Propose action
-> Policy gate
-> Execute or ask human
-> Verify progress
-> Diagnose failure
-> Store trace
-> Promote reusable lessons only when validated
Before a model or procedure proposes an action, the runtime classifies the screen as:
readyloadingblockeduncertaincomplete
If the screen is not ready, normal actions are blocked. The runtime should only
allow safe actions such as wait, diagnose, or manual_handoff.
Executing a tap is not the same as making progress. A successful action should mean:
action_executed == true
state_progress_verified == true
If the UI does not meaningfully change after an action, the runtime should mark
the step as no_progress, stuck, or false_success, not success.
Failures should produce clear, stable labels such as:
loading_looppermission_blockermodal_blockerwrong_pagetarget_missingno_ui_changefalse_successuncertain_stateunsafe_action_blocked
Raw failures should not become operational memory automatically. The intended memory pipeline is:
raw_trace -> candidate_lesson -> promoted_lesson
Lessons should default to hints. They should become control rules only after repeated evidence or human approval.
Important modules:
app/utils/adb.py- low-level ADB wrapper for device interaction.app/executor.py- execution engine for bounded skills and plans.app/skills/- atomic Android actions such as tap, type, wait, back, and search.app/readiness.py- readiness classification for ready/loading/blocked/uncertain states.app/ui_facts.py- reusable UI fact extraction helpers.app/ui_policy.py- generic blocker and policy detection.app/ui_state.py- normalized UI state and goal-progress assessment.app/procedural_skills.py- generic procedure layer for common safe actions.app/reasoning_orchestrator.py- action proposer that is now gated by readiness.app/diagnostics.py- stable failure diagnostic reports.scripts/chaos_ui_harness.py- deterministic blocker and overlay regression harness.scripts/chaos_ui_e2e_smoke.py- minimal execute-and-verify smoke test.scripts/long_tail_agent_smoke.py- mixed long-tail test runner for messy real cases.
Guided UI reasoning order:
read_screen
-> ui_facts / ui_policy / ui_state / readiness
-> procedural_skills
-> memory / model fallback
-> executor
-> verifier / diagnostics
Requirements:
- Python 3.8+
- Android Studio Emulator or an Android device with ADB enabled
- Android platform-tools available through
adb - Optional: chaos fixture APK for deterministic blocker tests
Install Python dependencies:
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtCheck that a device is online:
adb devicesShow CLI help:
python -m app.main --helpRead the current screen:
python -m app.main --task "read the current screen and summarize it" --task-type read_current_screenRun a guided UI task with the reasoning stack:
python -m app.main --task "open settings and inspect the current page" --task-type guided_ui_task --reasoner-backend stack --agent-mode interactive --max-steps 3 --auto-confirmRun coach mode, where the system suggests actions while a human operates:
python -m app.main --task "open chrome and search for llm" --task-type guided_ui_task --agent-mode coach --reasoner-backend stackThe older bounded flows are still available as execution-kernel capabilities:
send_messageextract_and_copycreate_reminderread_current_screenguided_ui_taskunsupported
The new project direction is centered on guided_ui_task, diagnostics, coach
mode, and reliability testing.
The runtime can use local or OpenAI-compatible model services for reasoning, but models should be treated as action proposers rather than final decision-makers.
Common environment variables:
$env:LOCAL_TEXT_REASONER_BASE_URL="http://127.0.0.1:9000/v1"
$env:LOCAL_TEXT_REASONER_MODEL="Qwen/Qwen3.5-0.8B"
$env:REASONING_REQUEST_TIMEOUT_SECONDS="30"
$env:REASONING_DISABLE_LOCAL_TEXT_AFTER_FAILURE="1"
$env:REASONING_ENABLE_LOCAL_VL="0"Do not commit API keys. Use environment variables or your shell profile.
The chaos harness is a deterministic ADB regression tool for UI blockers:
- permission dialogs
- onboarding overlays
- bottom sheets
- loading states
- error states
- stylus / IME overlays
- Chrome search overlay cases
Default fixture APK path used during development:
F:\virtualver\app\build\outputs\apk\debug\app-debug.apkExample ADB path on Windows:
C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exeRun one dry-run decision case:
python scripts\chaos_ui_harness.py --case fixture_input_surface --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"Recommended smoke cases:
python scripts\chaos_ui_harness.py --case fixture_notification_permission --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_input_surface --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_loading_state --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case fixture_error_state --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"
python scripts\chaos_ui_harness.py --case chrome_search_stylus_overlay --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe"Run the minimal execute-and-verify E2E smoke:
python scripts\chaos_ui_e2e_smoke.py --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"Artifacts are written under:
data\tmp\chaos\...
data\tmp\chaos_e2e\...Use this when you want a longer mixed run with real-ish and randomized goals. It mixes chaos fixture blockers, real Settings read-only inspection, Chrome random search questions, and one execute-and-verify input E2E.
python scripts\long_tail_agent_smoke.py --iterations 18 --seed 20260502 --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk"Use the Chrome torture profile for messy real web pages that can expose blank WebView loads, JS challenges, cookie/captcha blockers, and repeated-search mistakes:
python scripts\long_tail_agent_smoke.py --iterations 8 --seed 20260506 --profile chrome_torture --device-id emulator-5554 --adb-path "C:\Users\zhufe\AppData\Local\Android\Sdk\platform-tools\adb.exe" --fixture-apk "F:\virtualver\app\build\outputs\apk\debug\app-debug.apk" --skip-installArtifacts are written under:
data\tmp\long_tail\long_tail_<timestamp>_seed_<seed>\long_tail_report.jsonThe long-tail runner keeps screenshots, XML, summaries, decisions, and diagnostics for every round.
Agent and harness failures write a stable diagnostic JSON report using schema
agent.diagnostic.v1.
Example shape:
{
"schema_version": "agent.diagnostic.v1",
"status": "fail",
"kind": "adb_error | unhandled_exception | agent_result_failure | chaos_harness_failure | chaos_e2e_failure",
"human_summary": "Short explanation for humans",
"error": {"type": "...", "message": "...", "traceback": "..."},
"device": {
"requested_device": "emulator-5554",
"connected": true,
"current_focus": "...",
"foreground_package": "...",
"top_activity": "...",
"crash_log_tail": "..."
},
"artifacts": {
"diagnostic_report_path": "...",
"screenshot_path": "...",
"ui_dump_path": "...",
"screen_summary_path": "..."
}
}Default diagnostic locations:
data\tmp\diagnostics\...
data\tmp\chaos\...\diagnostics\diagnostic.json
data\tmp\chaos_e2e\...\diagnostics\diagnostic.jsondevice.crash_log_tail is the tail of adb logcat -b crash. It is useful for
emulator or app crash clues, but it may include earlier crashes if the buffer
was not cleared before the run.
Run all unit tests:
python -m unittest discover -s tests -vRun the focused reliability tests:
python -m unittest tests.test_readiness tests.test_ui_state tests.test_procedural_skills tests.test_reasoning_orchestrator tests.test_diagnostics tests.test_adb -vSyntax check:
python -m py_compile app\readiness.py app\ui_state.py app\reasoning_orchestrator.py scripts\long_tail_agent_smoke.py- The project is emulator-first.
- The runtime reads foreground UI state; it does not inspect private app data.
- Browser WebView content is often incomplete in
uiautomatorXML, so readiness may conservatively returnuncertain. - Captcha, login walls, and web challenges should normally trigger diagnosis or human handoff rather than automated bypass attempts.
- High-risk actions should require confirmation.
- Add a visual readiness layer for Chrome/WebView pages.
- Add stronger post-action progress verification.
- Add
raw_trace -> candidate_lesson -> promoted_lessonstorage. - Make coach mode the default user-facing experience.
- Improve failure taxonomy and recovery recommendations.
- Keep procedures generic; avoid app-specific if-else growth.