User Input Output
───────── ──────
"--prompt 'Go to X and extract Y'" evidence/run_XXXX/
or ├── sample_001/
"--task spec.json --input data.csv" │ ├── 01_page.png (SHA-256 hashed)
│ ├── result.json (extracted fields)
│ │ └── action_log.json (step trace)
▼ ├── sample_002/...
┌─────────────────┐ └── combined.csv
│ TASK PLANNER │ (--prompt only)
│ Claude converts│
│ plain English │
│ → task spec │
│ → sample URLs │
└────────┬────────┘
│
▼
┌─────────────────┐
│ ORCHESTRATOR │ main.py
│ Load samples │
│ Skip completed │ ← idempotent restart
│ Launch workers │
└────────┬────────┘
│
├──────────────┬──────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ WORKER 1 │ │ WORKER 2 │ │ WORKER N │
│ BrowserCtx │ │ BrowserCtx │ │ BrowserCtx │ ← isolated sessions
│ (sample A) │ │ (sample B) │ │ (sample N) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ AGENT LOOP (per sample) │
│ │
│ for step in range(max_steps): │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ OBSERVE │──▶│ DECIDE │──▶│ ACT │ │
│ │ │ │ │ │ │ │
│ │ DOM │ │ Claude │ │Playwright│ │
│ │ a11y │ │ tool_use│ │ goto │ │
│ │ tree │ │ returns │ │ click │ │
│ │ pruned │ │ 1 of 10 │ │ type │ │
│ │ to ~80 │ │ actions │ │ screenshot│ │
│ │ nodes │ │ │ │ done/fail│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ │ │
│ └───────────────────────────┘ │
│ loop until done/fail │
└──────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ MERGE CSV │ main.py reads all result.json
│ Sort by ID │ → combined.csv
└─────────────────┘
Only used with --prompt. Calls Claude once to convert natural language into:
Input: "Go to torvalds GitHub and extract name, followers, pinned repos"
Output: {
task_spec: { system_prompt, goal, output_schema, keywords, max_steps... },
samples: [{ sample_id: "torvalds", url: "https://github.com/torvalds" }]
}
The generated spec is saved to evidence/run_XXXX/generated_task_spec.json for inspection. After this, execution is identical to using a pre-built --task JSON file.
Coordinates the run:
- Loads task spec (from file or planner)
- Loads samples (from CSV,
--url, or planner) - Checks
evidence/run_XXXX/for already-completed samples → skips them (--resume) - Launches N workers via
asyncio.gather+Semaphore(N)for bounded concurrency - After all workers finish: merges all
result.json→combined.csv
One worker per sample. Creates an isolated BrowserContext — own cookies, own session, no bleed between workers. If auth_profile is set (e.g., LinkedIn), loads saved cookies into the context.
ctx = await browser.new_context(color_scheme="light", storage_state=auth_file)
page = await ctx.new_page()
await agent_loop.run(page, sample, task_spec, output_mgr)All exceptions caught → written to result.json. Worker never crashes the batch.
A ReAct cycle that repeats until done, fail, or max_steps:
OBSERVE — DOM extractor reads the page's accessibility tree via Playwright's aria_snapshot(). Raw tree (~2000 nodes) is pruned through 4 passes:
Pass 1: Skip navigation/banner/footer blocks (entire subtrees removed)
Pass 2: Keep semantic roles only (link, button, heading, textbox, checkbox...)
Pass 3: Boost nodes matching task keywords, keep links/buttons always
Pass 4: Trim to 120 nodes max
Result: compact indexed text like:
[0] [heading] "Linus Torvalds"
[1] [link] "linux" → https://github.com/torvalds/linux
[2] [button] "Follow"
[3] [textbox] "Search" (value="hello") ← current input values enriched
If dom_confidence < 0.6 (canvas/SVG-heavy pages), vision activates — takes a screenshot and asks Claude a targeted question.
DECIDE — Sends to Claude via Anthropic SDK:
system: task spec's system_prompt (static, prompt-cached across steps)messages: one user message with page state + budget-fitted history (5-25 items) + goal + output schema + reflection contexttools: 12 action definitions (each with optional reflection fields)tool_choice: {"type": "any"}— forces structured output, never prose
Claude returns one or more tool calls. Each includes optional structured reflection:
evaluation_previous_step: did the last action work?memory_update: key fact to carry forwardnext_goal: what the agent intends next
When ENABLE_MULTI_ACTIONS=true, multiple actions can execute per LLM call (max 3 by default).
ACT — Dispatches the action to Playwright:
| Action | Playwright Call | Element Resolution |
|---|---|---|
goto(url) |
page.goto() |
Direct URL |
click(selector) |
3-strategy: index → text → CSS | page.get_by_role() / page.get_by_text() |
type(selector, text) |
page.fill() |
Same 3-strategy |
scroll(direction) |
page.mouse.wheel() |
N/A |
screenshot(label) |
page.screenshot() |
N/A, saves with SHA-256 |
extract(selector) |
locator.inner_text() |
Same 3-strategy |
wait(selector) |
wait_for_selector() |
Text or CSS |
download(selector) |
page.expect_download() + save |
Same 3-strategy |
select_option(selector, value) |
locator.select_option() |
Index/text/label to native <select> |
done(extracted) |
Validates + writes result | N/A |
fail(note) |
Writes failure + exits | N/A |
save_progress(extracted, note) |
Checkpoint data, continue | N/A, deep-merges with previous |
Every action returns ActionResult(success, description, error) — never raises. Each dispatch is wrapped in a 60-second timeout to prevent hung workers.
CHECK — When agent calls done:
- Verify
required_fieldsare present and not None (but 0/false are valid) - Verify
required_artifactslabels match saved screenshot filenames - If missing + steps remain → bounce back with notice
- If missing + last step → write
needs_review - If all good → write
result.json+action_log.json
SELF-CORRECTION (Escalating Recovery):
The agent has a multi-layered recovery system inspired by browser-use's decision hygiene:
- Structured reflection: every action includes
evaluation_previous_step,memory_update, andnext_goalfields — explicit working memory instead of incidental text - Stagnation detection: page signature (URL + DOM hash) tracked across steps. Same page + no new data triggers escalation:
- Level 1 (3 steps stagnant): gentle nudge — "try a different approach"
- Level 2 (5 steps stagnant): forceful demand — "CHANGE YOUR STRATEGY NOW" + checkpoint saved
- Level 3 (8 steps stagnant): forced consolidation — "MUST call done or fail"
- Budget pressure warnings: one-time notices at 75% ("start consolidating") and 90% ("save/finalize NOW") of step budget
- Last-step tool restriction: on the final step, only
doneandfailare available — no wasted actions - Spam detection: 4+ identical action types on the same URL → forced stop
- Failure recovery: 3+ consecutive failures → inject list of visible interactive elements
- Final consolidation: when max_steps exhausted or LLM fails with accumulated data, one last LLM call produces best-effort structured output
Primary perception. Converts browser page into LLM-digestible text.
aria_snapshot() → parse YAML → filter semantic → keyword boost → trim → enrich input values
Input value enrichment: Reads current values from live <input> elements via JavaScript and attaches them to DOM nodes. This prevents the agent from re-filling already-filled form fields.
CDP fallback: If aria_snapshot() returns < 5 nodes (broken a11y tree), falls back to Chrome DevTools Protocol Accessibility.getFullAXTree.
DOM confidence: A penalty score starting at 1.0, computed by injecting JavaScript into the live page:
score = 1.0
score -= 0.3 × (canvas_count / total_nodes) # <canvas> elements are black boxes to the a11y tree
score -= 0.2 × (missing_aria_labels / interactive) # icon-only buttons with no text and no aria-label
score -= 0.1 × (svg_count / interactive) # SVG status icons (✓/✗) have no text equivalent
if semantic_nodes < 10: score -= 0.3 # barely any meaningful elements found
- canvas_count:
document.querySelectorAll('canvas').length— charts, maps, drawing apps are invisible to DOM - missing_aria_labels: buttons/links/inputs that have no
aria-labeland no visible text (e.g.,<button><svg>...</svg></button>— a hamburger menu icon). The agent can't click what it can't name - svg_count: SVGs often represent visual-only status indicators (green checkmark, red X) that the DOM sees as
[img]with no text - semantic_nodes: count of nodes with meaningful roles (heading, link, button, textbox, etc). Below 10 = page is mostly canvas/images or still loading
Normal pages (GitHub, LinkedIn) score ~0.9. A dashboard with SVG charts and icon buttons might score 0.4 — that triggers vision.
Activated when dom_confidence < 0.6. The question sent to Claude is targeted, not open-ended — it tells Claude exactly what the DOM already captured and asks what's missing:
f"The DOM shows interactive elements but some visual information is missing. "
f"Based on the task goal: '{task_goal}', "
f"what information is visible in this screenshot that the following DOM text does not capture?\n\n"
f"DOM text:\n{dom_context[:1000]}\n\n"
f"Focus on: status icons, color-coded badges, visual indicators, "
f"and any text rendered as images or SVGs."It's not "describe this page." It's: "here's what the DOM already captured, here's the task goal — what visual info is the DOM missing?"
Example: Task is "check CI pipeline status." The page has green/red SVG checkmarks next to build steps. The DOM only sees [img] or [svg] with no text. Vision sees the screenshot and responds: "The icon next to 'build/test' is a green checkmark — status is passing."
The flow: our code reads DOM → our code computes confidence → if low, our code takes a screenshot + builds the targeted question → Claude vision answers → the answer is appended to the DOM text that goes into the DECIDE step.
Uses AsyncAnthropic with shared module-level client for connection pooling.
Deterministic evidence packaging per sample:
- Screenshots:
{counter:02d}_{label}.png— sequential, never renamed - SHA-256 hash computed at write time, stored in
result.json - Atomic writes:
.tmp→Path.replace()— no partial files on crash - Download filenames sanitized (path traversal prevention)
combined.csv: single merge at batch end, sorted by sample_id, non-scalar values JSON-serialized
Per-domain throttling via asyncio.Lock:
RATE_LIMITS = {
"linkedin.com": 3.0s,
"github.com": 0.5s,
"default": 0.2s,
}Concurrency-safe — all workers share one event loop, one lock.
Standard tasks (profile extraction, single-page audit) complete in 2-10 steps. Long-horizon tasks (multi-page audits, cross-link navigation chains) need 30-50+ steps. Multiple mechanisms make this work:
A 10th agent action. Checkpoints partial data without stopping the loop:
Step 8: save_progress({ "prs": [{ "title": "Fix editor...", "author": "alice" }] })
→ checkpoint.json updated, agent continues
Step 16: save_progress({ "prs": [{ "title": "Refactor sync...", "author": "bob" }] })
→ data merged with previous checkpoint, agent continues
Step 22: done({ "total_prs_audited": 2, "all_checks_passed": true })
→ accumulated + final data merged → result.json
Data is deep-merged across calls — arrays append, dicts recurse. If the agent crashes at step 20, checkpoint.json has all data from steps 8 and 16.
Written to the sample's evidence folder every 5 steps and on every save_progress call. You can watch it update in real-time:
{
"sample_id": "pr_chain_audit",
"status": "in_progress",
"step": 16,
"accumulated_data": {
"prs": [
{ "title": "Fix editor crash", "author": "alice", "reviewers": ["bob"] },
{ "title": "Refactor sync module", "author": "bob", "reviewers": ["alice", "carol"] }
]
},
"progress_notes": ["Completed PR #1 of 5", "Completed PR #2 of 5"],
"artifacts_so_far": [{"filename": "01_pr_overview.png", "sha256": "..."}],
"steps_logged": 16,
"updated_at": "2026-03-27T18:30:00Z"
}Monitor it live: watch -n 1 cat evidence/run_XXXX/sample_id/checkpoint.json
Every 10 steps, Claude (fast model — Haiku) summarizes the old history into 2-3 sentences:
Steps 1-10: Navigated to the merged PR list, clicked into PR #305569 by benibenj.
Extracted title, author, and reviewer (justschen). Took screenshot of PR overview.
Saved progress with PR #1 data and navigated back to the list.
The agent always sees in its prompt:
- Structured LLM summaries of earlier work (FOUND/GAPS/NEXT format)
- Dynamic budget-fitted recent actions (5-25 items, importance-scored)
- Full accumulated data from save_progress (what was collected)
- Structured run state (failed URLs, blocked selectors, dead ends, exhausted pages)
- Step budget ("Step 16 of 40 — 24 remaining")
- Budget warnings at 75% and 90% thresholds
- Memory hints from earlier successful samples in the same run (first 3 steps only)
- Reflection context from recent actions (memory updates, goals)
Uses the fast/cheap model so summary calls cost < $0.001 each.
When the agent clicks a "Next", "Load more", "Page 2", etc., the system detects it and grants +3 bonus steps to the step budget. This means pagination doesn't eat into the task's working budget:
Step 15 | click("Next page") → OK → Pagination detected → +3 bonus (effective_max=43)
Step 25 | click("Load more") → OK → Pagination detected → +3 bonus (effective_max=46)
Detection is keyword-based: next, next page, load more, show more, older, newer, », ›, etc.
Replaces the original flat watchdog with a 3-level escalating system using page signature hashing:
- Page signature = MD5 of (normalized URL + first 2K of DOM text)
- Same signature + no new data → stagnation counter increments
| Level | Trigger | Response |
|---|---|---|
| 1 (gentle) | 3 stagnant steps | "Try a different approach, scroll, or extract" |
| 2 (forceful) | 5 stagnant steps | "CHANGE YOUR STRATEGY NOW" + checkpoint saved |
| 3 (critical) | 8 stagnant steps | "MUST call done or fail. No more browsing." |
The watchdog resets when genuinely new data arrives (successful extract or save_progress with new data).
For tasks involving 10+ items with individual URLs (e.g., "extract all 200 org members"), the planner can flag needs_discovery: true. The orchestrator then:
- Runs a discovery phase — one agent paginates the listing page, collects all URLs
- Each discovered URL becomes a separate parallel sample
- Samples are distributed across N concurrent workers
This means a "200 org members" task becomes 200 parallel workers (bounded by concurrency limit), each doing a simple 3-5 step extraction — much faster and more reliable than one agent doing 500+ steps.
Decision boundary:
- Prompt mode (
--prompt): the planner decides whether discovery is needed - Manual task mode (
--task): discovery is explicit via--discover --start-url - Workers never self-trigger discovery: it runs once at the orchestrator layer, then execution workers process the discovered samples
Manual-mode fallback:
- if no
--input,--url, or--discoveris provided, the orchestrator will infer the safest path from the task spec - discovery task + concrete
start_url→ auto-run discovery - execution task + concrete
start_url→ auto-run one sample - placeholder URLs like
https://github.com/{username}are not auto-runnable and still require input data
The agent doesn't just stop at max_steps. Multiple termination conditions are checked before every step:
| Trigger | Status | Logic |
|---|---|---|
Agent calls done + all requirements met |
done |
Machine-verified fields + artifacts |
Agent calls done + array count < expected_items |
partial_success |
Got some but not all items |
Wall-clock timeout (max_time_seconds) |
partial_success or failed |
Real time limit for long-running tasks |
| Network circuit breaker (5 consecutive infra errors) | partial_success or failed |
Site down, DNS failure, browser crash |
| Watchdog stall (5 steps, no new data) | warning injected | Agent gets hard nudge to produce data or stop |
max_steps exhausted |
failed |
Hard ceiling (accumulated data saved) |
| LLM API error | failed |
Claude unreachable |
Agent calls fail(reason) |
failed |
Agent gives up intentionally |
| Budget 75% reached | warning injected | "Start consolidating results" |
| Budget 90% reached | warning injected | "Save/finalize NOW" |
| Final step | tools restricted | Only done and fail available |
| LLM API error (with data) | partial_success |
Final consolidation attempted first |
max_steps exhausted (with data) |
partial_success |
Final consolidation attempted first |
Infrastructure error detection classifies errors as infra (timeout, DNS, connection refused, page crashed, SSL) vs logic (element not found, click failed). Only infra errors count toward the circuit breaker — a click failing because the wrong selector was used does NOT trigger early termination.
partial_success status — when the agent collected some data but couldn't finish (e.g., 4 of 5 PRs audited, then the 5th page 404'd), the result is partial_success not failed. The accumulated data is preserved in result.json.
expected_items — task specs can set expected_items: 5. When save_progress is called 5 times, the agent gets a nudge: "All items collected. Call done now." The final done validation also checks array lengths against this count.
New task spec fields:
{
"max_steps": 50,
"max_time_seconds": 300,
"expected_items": 5,
"max_consecutive_network_errors": 5
}If the agent hits any termination condition, accumulated data is not lost:
checkpoint.jsonhas the latest checkpoint (written on every termination)result.jsonincludes accumulated data (with appropriate status)action_log.jsonhas the full step trace up to the termination point
The agent learns within a run. Memory is stored inside each run's evidence folder — no cross-run leakage, no stale patterns from old tasks.
evidence/run_2026-03-29_030929/
├── memory/
│ ├── patterns.json ← learned from successful samples in THIS run
│ └── failures.json ← failure patterns from THIS run
├── combined.csv
├── commit_001/
└── commit_002/
| Type | File | Learned from | Contains |
|---|---|---|---|
| Procedural patterns | evidence/run_XXXX/memory/patterns.json |
Successful done samples |
Action sequences, navigation tips, things to avoid |
| Episodic warnings | evidence/run_XXXX/memory/failures.json |
failed / partial_success samples |
Dead URLs, broken selectors, failure reasons |
How it works:
- Sample N finishes → Claude Haiku distills its action log into abstract navigation patterns
- Pattern saved to
evidence/run_XXXX/memory/patterns.json - Sample N+1 starts → loads memory → gets tips from earlier samples in this run
- Patterns are domain-keyed and task-aware —
get_hints()ranks by keyword overlap with the current goal - Each run starts fresh — no old patterns from different tasks can interfere
- On
--resume, the existing memory is loaded and updated (not recreated)
When ENABLE_MULTI_ACTIONS=true, the LLM can return multiple actions per step:
- Max 3 actions per step (configurable via
MAX_ACTIONS_PER_STEP) - Batch-breaking actions:
goto,done,fail,save_progressabort remaining batch - URL change aborts batch: if a click causes navigation, remaining actions are stale
- DOM stability check: if interactive element count shifts >20%, batch aborts (prevents stale index targeting)
- Per-sub-action logging: every sub-action gets its own
StepRecordinaction_log.json - Fresh DOM per sub-action: element map refreshed before each dispatch
Best for: form fills (type + type + click), repetitive extraction. Off by default.
When ENABLE_FALLBACK_LLM=true and the primary model fails with retryable errors:
- Primary model retried 3x with exponential backoff
- One attempt on
FALLBACK_LLM_MODEL(default: Claude Haiku) - Final consolidation also prefers fallback when primary just failed
- Model switch is explicitly logged — no silent swaps
PR Audit Chain — the showcase for long-horizon:
python main.py --task tasks/github_pr_audit_chain.json \
--input tasks/inputs/github_pr_chain.csv --no-headlessAgent navigates merged PR list → clicks into each PR → extracts fields → screenshots → checkpoints → navigates back → repeats for 3-5 PRs. ~30-50 steps.
Contributor Deep Audit — cross-page navigation:
python main.py --task tasks/github_contributor_deep_audit.json \
--input tasks/inputs/github_contributors.csv --no-headlessAgent visits contributors page → clicks each profile → extracts details → screenshots → checkpoints → navigates back → repeats for top 3. ~30-40 steps.
Zero site-specific code in any Python file. The agent reads the live DOM and reasons about it. All site-specific knowledge lives in:
tasks/*.json— goal, keywords, output schema, system prompt.env— credentials and agent behavior tuning (reflection mode, fallback LLM, multi-action batching)evidence/run_XXXX/memory/— run-local learned navigation patterns and failure warnings
To add a new site: write one JSON file. No code changes.

