-
Notifications
You must be signed in to change notification settings - Fork 130
Open
Description
Hi all,
I'm a user of AndroidWorld (AW) and I think the benchmark is great. However, recent experiments suggest that current LLM-based GUI agents can "solve" the existing benchmark quite easily (regardless of whether they overfit on IID data). This indicates that the benchmark may no longer be sufficiently challenging.
Since state initialization in the emulator and task success evaluation are still challenging, I am considering enhancing AW using its existing apps by enriching task designs and evaluation metrics.
Potential Task Directions
| Category | Why It Might Be Challenging | Example |
|---|---|---|
| Cross-App & Multi-Goal Tasks | Workflows across multiple apps; single instruction has multiple subtasks | Browser → Markor → Broccoli; “Take today’s calendar screenshot, save to Markor, send to Alice, add related recipe in Broccoli” |
| Long-Term Memory / History Dependency | Task may require recalling previous states or results | “Was the previous task completed?” |
| Browser / Async Tasks | Dynamic web pages, forms, async content | Fill complex forms; click buttons after async JS loads |
| Popup / Distraction Handling & Network Robustness | Random system prompts, ads, delays | Random “allow/deny” prompts; slow-loading webpage |
| Time-Controlled Tasks | Delayed execution | “Record a 10-second audio clip” |
| Non-Home Initial State | Task starts from mid-app or non-home page | App opened in a subpage; partially filled form |
| Plan-Follow | Execute pre-defined plan, recover from failure | Follow a sequence of steps; adapt if a step fails |
| Ambiguous Instructions | Vague or incomplete instructions | “Organize yesterday’s notes” |
Possible Evaluation Metrics
- Step Efficiency – ratio of optimal steps vs executed steps
- Time Efficiency – how long it takes to complete the task
- Partial Success – completion rate for each subtask
Other Considerations
- State Initialization – Currently, AW operates apps directly. One option is to save a separate AVD snapshot for resets, though this may be slower.
- Task Evaluation – Depending on the task, success can be determined using XML or a11y tree parsing, or by reading system/app state. Key checkpoints could be added during tasks. I prefer not using LLM-based judgment for evaluation.
Questions for discussion:
- Are these task directions reasonable and meaningful?
- Are there other factors we should consider to better assess the agent's ability?
- Any suggestions for initial task designs or evaluation metrics?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels