Discussion: Enhancing the AndroidWorld

Hi all,  

I'm a user of AndroidWorld (AW) and I think the benchmark is great. However, **recent experiments suggest that current LLM-based GUI agents can "solve" the existing benchmark quite easily** (regardless of whether they overfit on IID data). This indicates that the benchmark may no longer be sufficiently challenging.

Since state initialization in the emulator and task success evaluation are still challenging, I am considering enhancing AW using its existing apps by **enriching task designs and evaluation metrics**.

## Potential Task Directions

| Category                                              | Why It Might Be Challenging                                  | Example                                                      |
| ----------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Cross-App & Multi-Goal Tasks**                      | Workflows across multiple apps; single instruction has multiple subtasks | Browser → Markor → Broccoli; “Take today’s calendar screenshot, save to Markor, send to Alice, add related recipe in Broccoli” |
| **Long-Term Memory / History Dependency**             | Task may require recalling previous states or results        | “Was the previous task completed?”                           |
| **Browser / Async Tasks**                             | Dynamic web pages, forms, async content                      | Fill complex forms; click buttons after async JS loads       |
| **Popup / Distraction Handling & Network Robustness** | Random system prompts, ads, delays                           | Random “allow/deny” prompts; slow-loading webpage           |
| **Time-Controlled Tasks**                             | Delayed execution                                            | “Record a 10-second audio clip”                              |
| **Non-Home Initial State**                            | Task starts from mid-app or non-home page                    | App opened in a subpage; partially filled form              |
| **Plan-Follow**                                       | Execute pre-defined plan, recover from failure               | Follow a sequence of steps; adapt if a step fails           |
| **Ambiguous Instructions**                            | Vague or incomplete instructions                             | “Organize yesterday’s notes”                                 |

## Possible Evaluation Metrics

- **Step Efficiency** – ratio of optimal steps vs executed steps  
- **Time Efficiency** – how long it takes to complete the task  
- **Partial Success** – completion rate for each subtask  

## Other Considerations

- **State Initialization** – Currently, AW operates apps directly. One option is to save a separate AVD snapshot for resets, though this may be slower.  
- **Task Evaluation** – Depending on the task, success can be determined using XML or a11y tree parsing, or by reading system/app state. Key checkpoints could be added during tasks. I prefer **not using LLM-based judgment** for evaluation.  

## Questions for discussion:  

- Are these task directions reasonable and meaningful?  
- Are there other factors we should consider to better assess the agent's ability?
- Any suggestions for initial task designs or evaluation metrics? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Enhancing the AndroidWorld #339

Potential Task Directions

Possible Evaluation Metrics

Other Considerations

Questions for discussion:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Why It Might Be Challenging	Example
Cross-App & Multi-Goal Tasks	Workflows across multiple apps; single instruction has multiple subtasks	Browser → Markor → Broccoli; “Take today’s calendar screenshot, save to Markor, send to Alice, add related recipe in Broccoli”
Long-Term Memory / History Dependency	Task may require recalling previous states or results	“Was the previous task completed?”
Browser / Async Tasks	Dynamic web pages, forms, async content	Fill complex forms; click buttons after async JS loads
Popup / Distraction Handling & Network Robustness	Random system prompts, ads, delays	Random “allow/deny” prompts; slow-loading webpage
Time-Controlled Tasks	Delayed execution	“Record a 10-second audio clip”
Non-Home Initial State	Task starts from mid-app or non-home page	App opened in a subpage; partially filled form
Plan-Follow	Execute pre-defined plan, recover from failure	Follow a sequence of steps; adapt if a step fails
Ambiguous Instructions	Vague or incomplete instructions	“Organize yesterday’s notes”

Discussion: Enhancing the AndroidWorld #339

Description

Potential Task Directions

Possible Evaluation Metrics

Other Considerations

Questions for discussion:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions