Skip to content

Discussion: Enhancing the AndroidWorld #339

@Silung

Description

@Silung

Hi all,

I'm a user of AndroidWorld (AW) and I think the benchmark is great. However, recent experiments suggest that current LLM-based GUI agents can "solve" the existing benchmark quite easily (regardless of whether they overfit on IID data). This indicates that the benchmark may no longer be sufficiently challenging.

Since state initialization in the emulator and task success evaluation are still challenging, I am considering enhancing AW using its existing apps by enriching task designs and evaluation metrics.

Potential Task Directions

Category Why It Might Be Challenging Example
Cross-App & Multi-Goal Tasks Workflows across multiple apps; single instruction has multiple subtasks Browser → Markor → Broccoli; “Take today’s calendar screenshot, save to Markor, send to Alice, add related recipe in Broccoli”
Long-Term Memory / History Dependency Task may require recalling previous states or results “Was the previous task completed?”
Browser / Async Tasks Dynamic web pages, forms, async content Fill complex forms; click buttons after async JS loads
Popup / Distraction Handling & Network Robustness Random system prompts, ads, delays Random “allow/deny” prompts; slow-loading webpage
Time-Controlled Tasks Delayed execution “Record a 10-second audio clip”
Non-Home Initial State Task starts from mid-app or non-home page App opened in a subpage; partially filled form
Plan-Follow Execute pre-defined plan, recover from failure Follow a sequence of steps; adapt if a step fails
Ambiguous Instructions Vague or incomplete instructions “Organize yesterday’s notes”

Possible Evaluation Metrics

  • Step Efficiency – ratio of optimal steps vs executed steps
  • Time Efficiency – how long it takes to complete the task
  • Partial Success – completion rate for each subtask

Other Considerations

  • State Initialization – Currently, AW operates apps directly. One option is to save a separate AVD snapshot for resets, though this may be slower.
  • Task Evaluation – Depending on the task, success can be determined using XML or a11y tree parsing, or by reading system/app state. Key checkpoints could be added during tasks. I prefer not using LLM-based judgment for evaluation.

Questions for discussion:

  • Are these task directions reasonable and meaningful?
  • Are there other factors we should consider to better assess the agent's ability?
  • Any suggestions for initial task designs or evaluation metrics?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions