Skip to content

Latest commit

 

History

History
420 lines (331 loc) · 16.7 KB

File metadata and controls

420 lines (331 loc) · 16.7 KB

baremobile — Agent Integration Guide

Use this file as context when building agents that control Android/iOS devices via baremobile.

Core Loop

Every agent interaction follows observe-think-act:

import { connect } from 'baremobile';

const page = await connect();    // auto-detect device (auto-reconnects WiFi if needed)
let snapshot = await page.snapshot();  // observe

// Agent reads snapshot, picks action
await page.tap(5);               // act
snapshot = await page.snapshot(); // observe again

Always snapshot after every action. Refs reset per snapshot — never cache them.

Snapshot Format

- ScrollView [ref=1]
  - Group
    - Text "Settings"
    - Group [ref=2]
      - Text "Search settings"
  - List
    - Group [ref=3]
      - Text "Wi-Fi"
      - Switch [ref=4] (Wi-Fi) [checked]
    - Group [ref=5] [disabled]
      - Text "Airplane mode"

What to read:

  • [ref=N] — interactive element, use with tap/type/scroll
  • "quoted text" — visible text on screen
  • (parenthesized) — contentDesc / accessibility label
  • [checked], [selected], [focused], [disabled] — element state
  • Indentation = nesting (parent-child)

Roles: Text, TextInput, Button, Image, ImageButton, CheckBox, Switch, Radio, Toggle, Slider, Progress, Select, List, ScrollView, Group, TabList, Tab. Unknown classes show their short Java class name.

Page Methods

Navigation

await page.launch('com.android.settings');  // open app by package
await page.intent('android.settings.BLUETOOTH_SETTINGS');  // deep nav via intent
await page.back();                          // press back
await page.home();                          // press home
await page.press('recent');                 // app switcher

Reading

const yaml = await page.snapshot();    // pruned YAML with refs
const png = await page.screenshot();   // PNG buffer

Interaction

await page.tap(ref);                        // tap element
await page.tapXY(540, 1200);               // tap by pixel coordinates
await page.tapGrid('C5');                  // tap by grid cell
await page.type(ref, 'text');               // type into field
await page.type(ref, 'new', {clear: true}); // clear field first, then type
await page.press('enter');                  // press key
await page.scroll(ref, 'down');             // scroll within element
await page.longPress(ref);                  // long press
await page.swipe(x1, y1, x2, y2, 300);     // raw swipe

Waiting

const snap = await page.waitForText('Bluetooth', 10000);  // poll until text appears
const snap = await page.waitForState(3, 'checked', 10000); // poll until state matches
// States: 'enabled', 'disabled', 'checked', 'unchecked', 'focused', 'selected'

Keys for press()

back, home, enter, delete, tab, escape, up, down, left, right, space, power, volup, voldown, recent

Common Patterns

Type into a field

Snapshot shows:  TextInput [ref=3] "Search settings" [focused]
  • If [focused] — just type, no extra tap needed: page.type(3, 'wifi')
  • If not focused — page.type(3, 'wifi') will tap first automatically
  • To replace existing text: page.type(3, 'new text', {clear: true})

Navigate a list

Snapshot shows:  ScrollView [ref=1] → List → Group [ref=2] "Wi-Fi" ...
  • Tap an item: page.tap(2)
  • Scroll for more: page.scroll(1, 'down') then snapshot again
  • Items at the bottom may not be visible — scroll and re-snapshot

Handle a dialog

Snapshot shows:  Text "Allow access?" → Button [ref=5] "Allow" → Button [ref=6] "Deny"
  • Read dialog text, decide, tap the appropriate button
  • Dialogs always have their buttons in the snapshot with refs

Open an app

await page.launch('com.android.settings');
await new Promise(r => setTimeout(r, 2000)); // wait for app to load
const snapshot = await page.snapshot();

Common packages: com.android.settings, com.android.chrome, com.google.android.apps.messaging, com.google.android.dialer, com.android.contacts

Deep navigation with intents

await page.intent('android.settings.BLUETOOTH_SETTINGS');
await page.intent('android.settings.WIFI_SETTINGS');
await page.intent('android.settings.DISPLAY_SETTINGS');
await page.intent('android.settings.SOUND_SETTINGS');
await page.intent('android.settings.LOCATION_SOURCE_SETTINGS');
await page.intent('android.settings.AIRPLANE_MODE_SETTINGS');
await page.intent('android.settings.APPLICATION_SETTINGS');
// With extras:
await page.intent('android.intent.action.VIEW', { url: 'https://example.com' });

Skip multi-step navigation when you know the intent action.

Vision fallback (when ARIA tree fails)

const png = await page.screenshot();    // get visual
const grid = await page.grid();         // get grid info
console.log(grid.text);                 // "Screen: 1080×2400, Grid: 10 cols (A-J) × 22 rows..."
// Send screenshot + grid.text to vision model
// Model responds: "tap C5"
await page.tapGrid('C5');               // or page.tapXY(x, y)

Use when: Flutter apps crash uiautomator, WebView content invisible, snapshot seems wrong.

Send a message (multi-step)

  1. launch('com.google.android.apps.messaging')
  2. Snapshot → find "Start chat" button → tap(ref)
  3. Snapshot → find TextInput for "To:" → type(ref, '5551234567')
  4. Snapshot → find suggestion like "Send to (555) 123-4567" → tap(ref)
  5. Snapshot → find compose TextInput → type(ref, 'Hello!')
  6. Snapshot → find "Send SMS" button → tap(ref)

Each step: snapshot, read, decide, act. The agent adapts to whatever the UI shows.

Pick an emoji

  1. In compose view, find emoji button (contentDesc contains "emoji") → tap(ref)
  2. Snapshot → emoji grid appears, each emoji is View [ref=N] (😀) with name in contentDesc
  3. Tap the emoji ref → it inserts into the TextInput
  4. Press back or tap outside to close emoji panel

Attach a file

  1. Find attach/+ button (contentDesc "Show attach" or "Show more options") → tap(ref)
  2. Snapshot → options appear: Gallery, Files, Location, etc. → tap(ref) for Files
  3. System file picker opens → snapshot shows folders and files with refs
  4. Navigate to file → tap(ref) to select

Unlock the screen

await page.press('power');           // wake
await page.swipe(540, 1800, 540, 800, 300);  // swipe up
await page.type(ref, '1234');        // PIN (if needed)
await page.press('enter');

Gotchas

Core ADB + Termux ADB (screen control)

Refs reset every snapshot. Never store a ref and use it after another snapshot. Always re-read.

Snapshot takes 1-5 seconds. uiautomator dump is slow, especially on emulators. Don't snapshot in a tight loop.

Wait after actions. UI needs time to settle. Wait 500ms-2s after taps, 2-3s after launching apps.

Some list items aren't clickable. Android file picker drawer items, some system UI elements don't have clickable=true so they don't get refs. Use raw swipe() to coordinates as fallback.

WebView content is invisible. uiautomator can't see inside WebViews. If the snapshot looks empty/shallow in a browser or hybrid app, that's why. Future: CDP bridge.

Switch/toggle may disappear when off. Android sometimes removes unchecked Switch/Toggle elements from the accessibility tree. On the Bluetooth page, when BT is off the Switch disappears — only Text "Use Bluetooth" remains. No switch present = off. Don't look for Switch [unchecked].

Toggles have transitional states. After tapping a system toggle (Bluetooth, WiFi), it briefly shows [disabled] while the hardware state changes. Use waitForText() or waitForState() instead of fixed delays to confirm the action completed.

HTML entities in text. Decoded at parse time. &amp;&, &lt;<, etc. Snapshots show clean text.

Emojis show as entities in contentDesc. View [ref=8] (&#128512;) means the emoji 😀. The agent can read the unicode codepoint or just tap by ref position in the grid.

type() is word-by-word. On API 35+, adb input text is broken for spaces. baremobile splits text into words and injects KEYCODE_SPACE between them. This means typing is slower for long strings. Shell special characters (& | ; $ ~ # % ^ * { } [ ] ! ? and quotes) are escaped automatically.

Termux ADB only

Wireless debugging drops on reboot. Must re-enable in Developer Options and re-pair after every device restart. The connection is not persistent.

Pairing port differs from connect port. The port shown when tapping "Pair device with pairing code" is NOT the port for adb connect. The connect port is shown on the main Wireless debugging screen.

Termux:API only

No screen control. Termux:API cannot read the screen, take snapshots, or tap elements. It provides direct Android API access only (SMS, calls, location, etc.). Use Termux ADB for screen control.

Commands are blocking. termux-* commands run synchronously. location() can take several seconds waiting for a GPS fix. cameraPhoto() blocks until capture completes.

Some commands need a real device. smsSend(), call(), location() require hardware (SIM card, GPS) that emulators don't have. batteryStatus(), clipboardGet/Set(), volumeGet(), wifiInfo(), vibrate() work on emulators.

Termux:API addon must be installed separately. The termux-api package (CLI tools) AND the Termux:API Android app (F-Droid) are both required. Missing the app causes silent failures.

Termux Setup (on-device control)

baremobile can run inside Termux on the phone itself — no USB, no host machine.

Termux + ADB (full screen control)

# In Termux:
pkg install android-tools nodejs-lts

# On the phone: Settings → Developer options → Wireless debugging → ON
# Tap "Pair device with pairing code" — note the port + code
adb pair localhost:PORT CODE

# Note the connect port (shown on Wireless debugging screen, different from pairing port)
adb connect localhost:PORT

# Verify
adb devices  # should show localhost:PORT  device

Then in Node.js:

import { connect } from 'baremobile';
const page = await connect({ termux: true });  // or auto-detects
const snap = await page.snapshot();

Limitations: Wireless debugging must be re-enabled after every reboot. The pairing code is one-time but the connection drops on reboot.

Termux:API (direct Android APIs, no ADB)

Install Termux:API addon from F-Droid, then:

pkg install termux-api
import * as api from 'baremobile/src/termux-api.js';

// Check availability
if (await api.isAvailable()) {
  await api.smsSend('5551234', 'Hello from baremobile!');
  const inbox = await api.smsList({ limit: 5, type: 'inbox' });
  await api.call('5551234');
  const loc = await api.location({ provider: 'network' });
  const battery = await api.batteryStatus();
  await api.clipboardSet('copied text');
  const text = await api.clipboardGet();
  await api.notify('Agent', 'Task complete', { sound: true });
  await api.torch(true);  // flashlight on
  await api.vibrate({ duration: 500 });
}

Termux:API is not screen control — it's direct Android API access. Use it for SMS, calls, location, camera, clipboard. Faster and more reliable than tapping through the UI.

iOS (WDA-based)

Same snapshot() / tap(ref) pattern as Android. WDA XML is translated into the shared prune/format pipeline, producing identical YAML output.

Quick start

import { connect } from 'baremobile/src/ios.js';

const page = await connect();
console.log(await page.snapshot());
await page.tap(1);
await page.type(2, 'hello');
await page.launch('com.apple.Preferences');
await page.back();
await page.screenshot();
page.close();

iOS Page Methods

Method What it does
page.snapshot() Hierarchical YAML (same format as Android)
page.tap(ref) Coordinate tap at bounds center
page.type(ref, text, opts) Tap to focus + WDA keys. {clear: true} to clear first
page.scroll(ref, direction) Swipe within element bounds (up/down/left/right)
page.swipe(x1, y1, x2, y2, duration) Raw swipe between coordinates
page.longPress(ref) Long press at bounds center (1s)
page.tapXY(x, y) Tap by pixel coordinates
page.back() Find back button in refMap, fallback to swipe-from-left-edge
page.home() WDA homescreen
page.launch(bundleId) Launch app by bundle ID
page.screenshot() PNG buffer
page.waitForText(text, timeout) Poll snapshot until text appears
page.press(key) home, volumeup, volumedown only
page.unlock(passcode) Unlock device (throws if wrong passcode)
page.close() Close connection and clean up

Key differences from Android

  • Bundle IDs, not package namescom.apple.Preferences not com.android.settings
  • No intents — use page.launch(bundleId) for app navigation
  • No grid/tapGrid — coordinate tap from bounds is reliable
  • Back is semantic — searches refMap for back button, falls back to swipe gesture
  • press() is limited — only home, volumeup, volumedown. Use tap(ref) for UI buttons.

Requirements

  • WDA on device — signed with free Apple ID (7-day cert, re-sign weekly)
  • pymobiledevice3 — setup only (tunnel, DDI mount, WDA launch). Python 3.12.
  • USB cable required — WiFi tunnel needs Mac/Xcode, not possible on Linux
  • Developer Mode on iPhone — required for developer services

Setup

baremobile setup           # interactive wizard — Android (emulator/USB/WiFi/Termux) + iOS
baremobile ios resign      # re-sign WDA when cert expires (every 7 days)
baremobile ios teardown    # kill tunnel/WDA processes

MCP Server

MCP server (mcp-server.js) for Claude Code and other MCP clients.

claude mcp add baremobile -- node /path/to/baremobile/mcp-server.js

Tools (11, dual-platform)

All tools accept optional platform: "android" | "ios" (default: android).

Tool Params Returns
snapshot maxChars?, platform? YAML tree (or file path if >30K chars)
tap ref, platform? 'ok'
type ref, text, clear?, platform? 'ok'
press key, platform? 'ok'
scroll ref, direction, platform? 'ok'
swipe x1, y1, x2, y2, duration?, platform? 'ok'
long_press ref, platform? 'ok'
launch pkg, platform? 'ok'
screenshot platform? base64 PNG
back platform? 'ok'
find_by_text text, platform? ref number or null

Action tools return 'ok' — call snapshot to observe the result. Large snapshots saved to .baremobile/screen-{timestamp}.yml when exceeding maxChars (default 30,000). iOS cert warning prepended to first snapshot if cert is >6 days old.

CLI

Session-based control for shell scripting and automation.

Session lifecycle

baremobile open [--device=SERIAL] [--platform=android|ios]
baremobile status
baremobile close

Commands

# Screen
baremobile snapshot                  # -> .baremobile/screen-*.yml
baremobile screenshot                # -> .baremobile/screenshot-*.png
baremobile grid                      # screen grid info (for vision fallback)

# Interaction
baremobile tap <ref>
baremobile tap-xy <x> <y>
baremobile tap-grid <cell>
baremobile type <ref> <text> [--clear]
baremobile press <key>
baremobile scroll <ref> <direction>
baremobile swipe <x1> <y1> <x2> <y2> [--duration=N]
baremobile long-press <ref>
baremobile launch <pkg>
baremobile intent <action> [--extra-string key=val ...]
baremobile back
baremobile home

# Waiting
baremobile wait-text <text> [--timeout=N]
baremobile wait-state <ref> <state> [--timeout=N]

# iOS management
baremobile setup
baremobile ios resign
baremobile ios teardown

# Logging
baremobile logcat [--filter=TAG] [--clear]

Output conventions

All output goes to .baremobile/ in the current directory. Action commands print ok. File-producing commands print the file path. Errors go to stderr with non-zero exit.

JSON mode (--json)

baremobile open --json       # {"ok":true,"pid":1234,"port":40049}
baremobile snapshot --json   # {"ok":true,"file":"/path/.baremobile/screen-*.yml"}
baremobile tap 4 --json      # {"ok":true}
baremobile status --json     # {"ok":false,"error":"No session found."}

Every response has ok: true|false. File-producing commands include file. Errors include error.

Error Recovery

If an action doesn't seem to work:

  1. waitForText — use waitForText('expected text', 5000) instead of guessing delays
  2. Snapshot again — the UI may have changed during the action
  3. Screenshot + visionscreenshot() + grid() if the ARIA tree looks wrong
  4. Press back — if stuck in an unexpected state, back out and retry
  5. Home + relaunch — nuclear option to reset to known state