diff --git a/README.md b/README.md index b09a99d..650e20a 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ Your App → Token0 Proxy → [Analyze → Classify → Route → Transform → Database (logs every optimization decision + savings) ``` -Token0 applies **11 optimizations** automatically: +Token0 applies **12 optimizations** automatically: ### Core Optimizations (Free Tier) @@ -58,6 +58,8 @@ Token0 applies **11 optimizations** automatically: **11. Saliency-Based ROI Cropping** — Detects which region of an image the prompt is asking about and crops to that region before sending to the LLM. "What's the total on this invoice?" → crops to the bottom 40% of the image. "Read the header" → crops to the top 25%. Rule-based spatial keyword matching (zero ML deps). Delivers ~60% additional pixel reduction on document and form images before any other optimization runs. +**12. Accessibility Tree Routing** — UI automation agents often have both a screenshot and an accessibility tree (AXUIElement, Playwright, Chrome DevTools). Token0 accepts both and routes to the cheaper representation automatically. If the tree is complete (no canvas/iframe/opaque elements), the screenshot is dropped and the tree is serialized as compact text — **93-97% token savings** vs a 1080p screenshot. If the tree has opaque nodes (a Figma canvas, a video element), Token0 keeps the screenshot. Supports Playwright/CDP format, macOS AXUIElement format, and pre-serialized strings. + --- ## Benchmarks @@ -202,10 +204,58 @@ Using OpenAI's published token formulas on real images and GPT-4.1 pricing ($2.0 9. **On cloud APIs, total image savings reach 98.9%** when all optimizations are combined with model cascading. 10. **Video deduplication collapses 60-frame clips to ~10 keyframes** — 13-45% savings on local models, ~83% projected on GPT-4.1. 11. **Model-aware OCR skip is critical** — ultra-efficient encoders like llama3.2-vision use <50 tokens/image; OCR text output would cost more, not less. +12. **Accessibility tree routing eliminates screenshot cost entirely** for UI agents — 93-97% savings when the tree is complete; screenshot fallback is automatic when canvas/iframe nodes are detected. + +### Accessibility Tree Benchmark (GPT-4o pricing) + +UI agents that send both a screenshot and an accessibility tree can route to the cheaper representation automatically. + +**Real browser results — Playwright, 1280×720, live pages** (actual reported prompt_tokens): + +| Page | Screenshot Tokens | Tree Tokens | Savings | Model | +|---|---|---|---|---| +| Hacker News | 750 | 192 | **74.4%** | moondream | +| Hacker News | 602 | 164 | **72.8%** | llava:7b | +| GitHub Home | 751 | 560 | **25.4%** | moondream | +| GitHub Home | 601 | 560 | **6.8%** | llava:7b | +| Wikipedia | 747 | 747 | **0%** | moondream | +| Wikipedia | 599 | 1,165 | **-94.5%** | llava:7b — tree too large | + +> Wikipedia's rich navigation tree exceeded the screenshot token count on llava:7b — token0 would correctly fall back to the screenshot in this case. Hacker News (minimal DOM) shows the best real-world savings. + +**Ollama model results — 7 vision models, synthetic 800×600 screenshots** (actual reported prompt_tokens): + +> Synthetic screenshots: PIL-generated images with drawn UI elements (login form, todo list). Not real browser screenshots. + +| Model | Screenshot Tokens | Tree Tokens | Savings | Note | +|---|---|---|---|---| +| granite3.2-vision | 10,328 | 218 | **97.9%** | High-res encoder | +| moondream | 1,500 | 168 | **88.8%** | | +| llava:7b | 1,202 | 160 | **86.7%** | | +| llava-llama3 | 1,201 | 164 | **86.3%** | | +| minicpm-v | 704 | 128 | **81.8%** | | +| gemma3:4b | 566 | 145 | **74.4%** | | +| llama3.2-vision | 46 | 130 | n/a | Ultra-efficient encoder — tree costs more; screenshot wins | + +**Cloud API extrapolation** (tree tokens from Ollama measurements, screenshot tokens from published formulas, 800×600 image): + +| Provider | Screenshot Tokens | Tree Tokens (avg) | Savings | At 100K calls/day, saved/mo | +|---|---|---|---|---| +| OpenAI GPT-4o | 1,530 | ~80 | **89.6%** | **~$10,282** | +| Anthropic Claude | 1,280 | ~80 | **87.6%** | **~$10,089** | + +> Tree token counts are text-based and provider-agnostic (~4 chars/token). Screenshot tokens use OpenAI tile formula (85 + 170×tiles) and Anthropic pixel formula (w×h/750). Canvas/iframe nodes trigger automatic screenshot fallback — no configuration needed. + +Run benchmarks: +```bash +python -m benchmarks.bench_ax_tree # formula-based projections +python -m benchmarks.bench_ax_tree_models # all 7 Ollama vision models (synthetic) +python -m benchmarks.bench_ax_tree_real # real browser pages via Playwright +``` ### Additional Test Coverage -Token0 includes **171 unit tests** and benchmarks across multiple suites: +Token0 includes **216 unit tests** and benchmarks across multiple suites: | Suite | Tests | What It Validates | |---|---|---| @@ -222,6 +272,7 @@ Token0 includes **171 unit tests** and benchmarks across multiple suites: | `pdf` | 8 | PDF detection, decode, text extraction, token estimation | | `estimate` | 11 | /v1/estimate endpoint: single image, multiple images, remote URL skip, cost calc | | `langchain` | 8 | LangChain callback: import, text passthrough, image optimization, role mapping | +| `ax_tree` | 22 | AX tree serialize, opaque detection, AXUIElement format, combo routing | --- @@ -369,6 +420,40 @@ response = client.chat.completions.create( # ~83% savings on GPT-4.1 ``` +### Accessibility Tree Support (UI Agents) + +If your agent captures both a screenshot and an accessibility tree, send both — Token0 picks the cheaper path automatically: + +```python +import json + +# Playwright example +page = await browser.new_page() +snapshot = await page.accessibility.snapshot() # returns a dict + +response = client.chat.completions.create( + model="gpt-4.1", + messages=[{ + "role": "user", + "content": [ + {"type": "text", "text": "What button should I click to submit the form?"}, + # Screenshot fallback — only used if tree has canvas/iframe nodes + {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}, + # Accessibility tree — token0 routes to this when complete + {"type": "accessibility_tree", "accessibility_tree": { + "data": snapshot, + "source": "playwright" + }}, + ] + }], + extra_headers={"X-Provider-Key": "sk-..."} +) +# GitHub PR page: 2,125 tokens (screenshot) → 132 tokens (tree) — 93.8% savings +# response.token0.optimizations_applied = ["ax tree → text (1,993 tokens saved vs screenshot)"] +``` + +Works with **Playwright**, **macOS AXUIElement**, **Chrome DevTools Protocol**, and pre-serialized strings. Canvas, iframe, and video elements trigger automatic screenshot fallback — no configuration needed. + ### Streaming Support Token0 supports `stream=true` — images are optimized before streaming begins, then tokens flow word-by-word via SSE: diff --git a/benchmarks/bench_ax_tree.py b/benchmarks/bench_ax_tree.py new file mode 100644 index 0000000..ee204c0 --- /dev/null +++ b/benchmarks/bench_ax_tree.py @@ -0,0 +1,359 @@ +"""Benchmark: AX tree routing vs raw screenshot token cost. + +Measures token savings when token0 routes an accessibility tree to text +instead of passing a screenshot to the LLM. + +Three scenarios: + 1. Screenshot only — baseline (what everyone does today) + 2. AX tree only — best case (no screenshot at all) + 3. Combo (screenshot + tree, tree is complete) — token0 drops screenshot + 4. Combo (screenshot + tree, tree has canvas) — token0 keeps screenshot + +Usage: + python -m benchmarks.bench_ax_tree +""" + +from __future__ import annotations + +import sys +import textwrap +from pathlib import Path + +# --------------------------------------------------------------------------- +# Representative AX trees (no real browser needed) +# --------------------------------------------------------------------------- + +# Typical GitHub PR page — all interactive elements, no canvas +GITHUB_PR_TREE = { + "role": "WebArea", + "name": "Pull request #42 · Pritom14/token0", + "children": [ + { + "role": "navigation", + "name": "Main", + "children": [ + {"role": "link", "name": "Code", "children": []}, + {"role": "link", "name": "Issues", "children": []}, + {"role": "link", "name": "Pull requests", "children": []}, + ], + }, + { + "role": "main", + "name": "", + "children": [ + {"role": "heading", "name": "feat: AX tree routing", "children": []}, + { + "role": "group", + "name": "PR actions", + "children": [ + {"role": "button", "name": "Merge pull request", "children": []}, + {"role": "button", "name": "Close pull request", "children": []}, + ], + }, + { + "role": "list", + "name": "Commits", + "children": [ + { + "role": "listitem", + "name": "feat: AX tree routing — accept accessibility_tree content parts", + "children": [], + }, + { + "role": "listitem", + "name": "fix: remove unused pytest import", + "children": [], + }, + ], + }, + { + "role": "group", + "name": "Review", + "children": [ + {"role": "radio", "name": "Comment", "children": []}, + {"role": "radio", "name": "Approve", "children": []}, + {"role": "radio", "name": "Request changes", "children": []}, + {"role": "button", "name": "Submit review", "children": []}, + ], + }, + ], + }, + ], +} + +# Figma editor — has canvas element (opaque, needs screenshot) +FIGMA_TREE = { + "role": "application", + "name": "Figma", + "children": [ + { + "role": "toolbar", + "name": "Tools", + "children": [ + {"role": "button", "name": "Move", "children": []}, + {"role": "button", "name": "Frame", "children": []}, + {"role": "button", "name": "Text", "children": []}, + ], + }, + { + "role": "main", + "name": "Canvas", + "children": [ + # The actual design is rendered in a canvas — not accessible + {"role": "canvas", "name": "", "children": []}, + ], + }, + { + "role": "complementary", + "name": "Layers", + "children": [ + {"role": "treeitem", "name": "Frame 1", "children": []}, + {"role": "treeitem", "name": "Button component", "children": []}, + ], + }, + ], +} + +# macOS Finder — AXUIElement format +FINDER_AXUI_TREE = { + "AXRole": "AXWindow", + "AXTitle": "Finder", + "AXChildren": [ + { + "AXRole": "AXToolbar", + "AXTitle": "", + "AXChildren": [ + {"AXRole": "AXButton", "AXTitle": "Back", "AXEnabled": True, "AXChildren": []}, + {"AXRole": "AXButton", "AXTitle": "Forward", "AXEnabled": False, "AXChildren": []}, + { + "AXRole": "AXTextField", + "AXTitle": "Search", + "AXValue": "", + "AXEnabled": True, + "AXChildren": [], + }, + ], + }, + { + "AXRole": "AXOutline", + "AXTitle": "Files", + "AXChildren": [ + { + "AXRole": "AXRow", + "AXTitle": "Documents", + "AXChildren": [ + { + "AXRole": "AXRow", + "AXTitle": "runbookai", + "AXChildren": [], + }, + { + "AXRole": "AXRow", + "AXTitle": "token0", + "AXChildren": [], + }, + ], + }, + {"AXRole": "AXRow", "AXTitle": "Downloads", "AXChildren": []}, + {"AXRole": "AXRow", "AXTitle": "Desktop", "AXChildren": []}, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Token estimation helpers (no LLM calls needed) +# --------------------------------------------------------------------------- + +# GPT-4o: 1080p screenshot (1920×1080) → high detail +# = 85 + 170 × ceil(1920/512) × ceil(1080/512) = 85 + 170 × 4 × 3 = 2,125 tiles tokens +# real-world measurements land around 1,500–5,000 depending on content; use 2,125 as baseline +SCREENSHOT_1080P_TOKENS = 2_125 + +# Same screenshot but resized by token0 to provider max (2048px longest edge) +# 2048×1152 → tiles: ceil(2048/512)×ceil(1152/512) = 4×3 = 12 tiles = 2125 tokens (same for 1080p) +# For a 4K screenshot (3840×2160) token0 would resize to 2048×1152: +SCREENSHOT_4K_TOKENS_RAW = 8_925 # 4K without any optimization +SCREENSHOT_4K_TOKENS_RESIZED = 2_125 # after token0 resize to 2048px + +COST_PER_TOKEN_USD = 2.50 / 1_000_000 # GPT-4o input + + +def _ax_tokens(tree) -> int: + from token0.optimization.ax_tree import estimate_ax_tree_tokens, serialize_ax_tree + + return estimate_ax_tree_tokens(serialize_ax_tree(tree)) + + +def _ax_serialized(tree) -> str: + from token0.optimization.ax_tree import serialize_ax_tree + + return serialize_ax_tree(tree) + + +def _is_opaque(tree) -> bool: + from token0.optimization.ax_tree import has_opaque_nodes + + return has_opaque_nodes(tree) + + +# --------------------------------------------------------------------------- +# Benchmark runner +# --------------------------------------------------------------------------- + +WIDTH = 72 + +def _header(title: str) -> None: + print() + print("=" * WIDTH) + print(f" {title}") + print("=" * WIDTH) + + +def _row(label: str, tokens: int, cost_usd: float, note: str = "") -> None: + savings_col = f" {note}" if note else "" + print(f" {label:<38} {tokens:>6,} tokens ${cost_usd:.4f}{savings_col}") + + +def _divider() -> None: + print(" " + "-" * (WIDTH - 2)) + + +def run_scenario(name: str, tree, screenshot_tokens: int) -> dict: + from token0.optimization.ax_tree import ( + estimate_ax_tree_tokens, + has_opaque_nodes, + serialize_ax_tree, + ) + + serialized = serialize_ax_tree(tree) + tree_tokens = estimate_ax_tree_tokens(serialized) + opaque = has_opaque_nodes(tree) + + if opaque: + # token0 keeps screenshot, drops tree + optimized_tokens = screenshot_tokens + strategy = "screenshot kept (opaque nodes)" + else: + # token0 drops screenshot, uses tree text + optimized_tokens = tree_tokens + strategy = "tree text used (screenshot dropped)" + + savings = screenshot_tokens - optimized_tokens + savings_pct = savings / screenshot_tokens * 100 if screenshot_tokens else 0 + cost_before = screenshot_tokens * COST_PER_TOKEN_USD + cost_after = optimized_tokens * COST_PER_TOKEN_USD + + return { + "name": name, + "screenshot_tokens": screenshot_tokens, + "tree_tokens": tree_tokens, + "optimized_tokens": optimized_tokens, + "savings": savings, + "savings_pct": savings_pct, + "cost_before": cost_before, + "cost_after": cost_after, + "strategy": strategy, + "opaque": opaque, + "serialized_chars": len(serialized), + } + + +def main() -> None: + sys.path.insert(0, str(Path(__file__).parent.parent)) + + scenarios = [ + ("GitHub PR page (Playwright tree)", GITHUB_PR_TREE, SCREENSHOT_1080P_TOKENS), + ("Figma editor (canvas — opaque)", FIGMA_TREE, SCREENSHOT_1080P_TOKENS), + ("macOS Finder (AXUIElement)", FINDER_AXUI_TREE, SCREENSHOT_1080P_TOKENS), + ("4K screenshot, no tree (baseline)", None, SCREENSHOT_4K_TOKENS_RAW), + ("4K screenshot + Finder tree", FINDER_AXUI_TREE, SCREENSHOT_4K_TOKENS_RAW), + ] + + results = [] + for name, tree, shot_tokens in scenarios: + if tree is None: + # Baseline: no tree, no optimization + r = { + "name": name, + "screenshot_tokens": shot_tokens, + "tree_tokens": 0, + "optimized_tokens": shot_tokens, + "savings": 0, + "savings_pct": 0.0, + "cost_before": shot_tokens * COST_PER_TOKEN_USD, + "cost_after": shot_tokens * COST_PER_TOKEN_USD, + "strategy": "no tree provided — passthrough", + "opaque": False, + "serialized_chars": 0, + } + else: + r = run_scenario(name, tree, shot_tokens) + results.append(r) + + # --------------------------------------------------------------------------- + # Print results + # --------------------------------------------------------------------------- + _header("AX Tree Routing — Token Savings Benchmark (GPT-4o pricing)") + + for r in results: + print() + print(f" Scenario: {r['name']}") + print(f" Strategy: {r['strategy']}") + if r["serialized_chars"]: + print(f" Tree size: {r['serialized_chars']:,} chars → {r['tree_tokens']:,} tokens") + _divider() + _row("Screenshot (no optimization)", r["screenshot_tokens"], r["cost_before"]) + _row( + "token0 optimized", + r["optimized_tokens"], + r["cost_after"], + f" (-{r['savings_pct']:.1f}%)" if r["savings_pct"] else "", + ) + if r["savings"] > 0: + print(f" >> Saved: {r['savings']:,} tokens ${r['cost_before'] - r['cost_after']:.4f}/call") + + # --------------------------------------------------------------------------- + # At-scale projection + # --------------------------------------------------------------------------- + _header("At-Scale Projection — GitHub PR agent (100K calls/day)") + + github_r = results[0] # GitHub PR tree + calls_per_day = 100_000 + days = 30 + + before_daily = github_r["cost_before"] * calls_per_day + after_daily = github_r["cost_after"] * calls_per_day + before_monthly = before_daily * days + after_monthly = after_daily * days + + print(f"\n Per call: ${github_r['cost_before']:.4f} → ${github_r['cost_after']:.4f}") + print(f" Daily: ${before_daily:,.2f} → ${after_daily:,.2f}") + print(f" Monthly: ${before_monthly:,.2f} → ${after_monthly:,.2f}") + print(f" Saved/mo: ${before_monthly - after_monthly:,.2f} ({github_r['savings_pct']:.1f}%)") + + # --------------------------------------------------------------------------- + # Summary table + # --------------------------------------------------------------------------- + _header("Summary") + print(f"\n {'Scenario':<42} {'Before':>8} {'After':>8} {'Savings':>10}") + print(" " + "-" * 70) + for r in results: + pct = f"-{r['savings_pct']:.1f}%" if r["savings_pct"] else "n/a" + print( + f" {r['name']:<42} {r['screenshot_tokens']:>6,}t " + f"{r['optimized_tokens']:>6,}t {pct:>10}" + ) + + print() + print(" Notes:") + print(" - Token counts use GPT-4o tile formula (85 + 170×tiles)") + print(" - 1080p screenshot = 1920×1080 = 12 tiles = 2,125 tokens") + print(" - AX tree tokens estimated at 4 chars/token") + print(" - Figma (canvas) forces screenshot path — no savings expected") + print() + + +if __name__ == "__main__": + main() diff --git a/benchmarks/bench_ax_tree_models.py b/benchmarks/bench_ax_tree_models.py new file mode 100644 index 0000000..0e520e6 --- /dev/null +++ b/benchmarks/bench_ax_tree_models.py @@ -0,0 +1,422 @@ +"""Benchmark: AX tree routing vs screenshot images on real Ollama vision models. + +Compares two input modalities for the same UI: + - Screenshot: PIL image (base64 JPEG data URI) + - AX Tree: serialized accessibility tree as plain text + +Measures real prompt_tokens from Ollama for both, calculates savings. + +Usage: + python -m benchmarks.bench_ax_tree_models + python -m benchmarks.bench_ax_tree_models --model moondream + python -m benchmarks.bench_ax_tree_models --model llava:7b --model minicpm-v +""" + +import argparse +import asyncio +import base64 +import io +import time +from typing import Optional + +from PIL import Image, ImageDraw + +from token0.optimization.ax_tree import serialize_ax_tree +from token0.providers.ollama import OllamaProvider + +VISION_MODELS = [ + "moondream", + "llava:7b", + "llava-llama3", + "minicpm-v", + "gemma3:4b", + "granite3.2-vision", + "llama3.2-vision", +] + + +def _pil_to_data_uri(img: Image.Image, quality: int = 85) -> str: + """Convert PIL Image to base64 JPEG data URI.""" + buf = io.BytesIO() + img.save(buf, format="JPEG", quality=quality) + b64 = base64.b64encode(buf.getvalue()).decode() + return f"data:image/jpeg;base64,{b64}" + + +def _create_login_form_screenshot() -> Image.Image: + """Create a login form screenshot: header, email/password fields, login button, forgot link.""" + img = Image.new("RGB", (800, 600), color="white") + draw = ImageDraw.Draw(img) + + # Gray header bar + draw.rectangle([0, 0, 800, 50], fill="lightgray") + + # "Sign In" heading (top center) + draw.text((300, 80), "Sign In", fill="black") + + # Email label + draw.text((200, 180), "Email", fill="black") + # Email input box + draw.rectangle([200, 200, 600, 230], outline="black") + + # Password label + draw.text((200, 260), "Password", fill="black") + # Password input box + draw.rectangle([200, 280, 600, 310], outline="black") + + # Blue "Log In" button + draw.rectangle([300, 330, 500, 370], fill="blue") + draw.text((340, 345), "Log In", fill="white") + + # "Forgot password?" link + draw.text((310, 400), "Forgot password?", fill="blue") + + return img + + +def _create_todo_list_screenshot() -> Image.Image: + """Create a todo list screenshot with 3 tasks (one checked) and add button.""" + img = Image.new("RGB", (800, 600), color="white") + draw = ImageDraw.Draw(img) + + # "My Tasks" heading + draw.text((300, 40), "My Tasks", fill="black") + + # Task row 1: Buy groceries (checked) + draw.rectangle([200, 120, 220, 140], fill="green") # checked box + draw.text((230, 120), "Buy groceries", fill="black") + + # Task row 2: Write report (unchecked) + draw.rectangle([200, 180, 220, 200], outline="black") # empty box + draw.text((230, 180), "Write report", fill="black") + + # Task row 3: Call dentist (unchecked) + draw.rectangle([200, 240, 220, 260], outline="black") # empty box + draw.text((230, 240), "Call dentist", fill="black") + + # Green "Add Task" button + draw.rectangle([300, 340, 500, 380], fill="green") + draw.text((340, 355), "Add Task", fill="white") + + return img + + +def _create_login_ax_tree() -> dict: + """Return login form accessibility tree.""" + return { + "role": "WebArea", + "name": "Sign In", + "children": [ + {"role": "heading", "name": "Sign In", "children": []}, + {"role": "textbox", "name": "Email", "value": "", "children": []}, + {"role": "textbox", "name": "Password", "value": "", "children": []}, + {"role": "button", "name": "Log In", "children": []}, + {"role": "link", "name": "Forgot password?", "children": []}, + ], + } + + +def _create_todo_ax_tree() -> str: + """Return todo list tree as serialized text (to include checked state).""" + # Manually build the tree to preserve "checked" state info + tree_text = """WebArea "My Tasks" + heading "My Tasks" + list "Tasks" + checkbox "Buy groceries" [checked] + checkbox "Write report" + checkbox "Call dentist" + button "Add Task" +""" + return tree_text.strip() + + +async def run_ax_tree_scenario( + model: str, + provider: OllamaProvider, + scenario_name: str, + question: str, + screenshot: Image.Image, + ax_tree: str, + required_substrings: list[str], +) -> Optional[dict]: + """Run a single AX tree scenario: screenshot vs tree. Returns result dict or None on error.""" + print(f"\n Scenario: {scenario_name}") + print(f' Question: "{question}"') + + # --- Screenshot path --- + print(" Screenshot: ", end="", flush=True) + data_uri = _pil_to_data_uri(screenshot) + screenshot_messages = [ + { + "role": "user", + "content": [ + {"type": "text", "text": question}, + { + "type": "image_url", + "image_url": {"url": data_uri, "detail": "auto"}, + }, + ], + } + ] + + screenshot_start = time.time() + try: + screenshot_resp = await provider.chat_completion( + model=model, messages=screenshot_messages, max_tokens=200 + ) + screenshot_latency = int((time.time() - screenshot_start) * 1000) + screenshot_tokens = screenshot_resp.prompt_tokens + screenshot_text = screenshot_resp.content + print(f"{screenshot_tokens:,} tokens | {screenshot_latency}ms") + except Exception as e: + print(f"ERROR: {e}") + return None + + # --- Tree path --- + print(" AX Tree: ", end="", flush=True) + tree_question = f"{question}\n\nUI Accessibility Tree:\n{ax_tree}" + tree_messages = [ + { + "role": "user", + "content": [{"type": "text", "text": tree_question}], + } + ] + + tree_start = time.time() + try: + tree_resp = await provider.chat_completion( + model=model, messages=tree_messages, max_tokens=200 + ) + tree_latency = int((time.time() - tree_start) * 1000) + tree_tokens = tree_resp.prompt_tokens + tree_text = tree_resp.content + print(f"{tree_tokens:,} tokens | {tree_latency}ms", end="") + + # Calculate savings + saved = screenshot_tokens - tree_tokens + pct = (saved / screenshot_tokens * 100) if screenshot_tokens > 0 else 0 + print(f" ({-pct:.1f}%)") + except Exception as e: + print(f"ERROR: {e}") + return None + + # --- Verify screenshot answer contains key items (tree may interpret differently) --- + screenshot_lower = screenshot_text.lower() + screenshot_has_items = all( + substring.lower() in screenshot_lower for substring in required_substrings + ) + + print(f" Screenshot captured key items: {'YES' if screenshot_has_items else 'NO'}") + print(f' Screenshot: "{screenshot_text[:60]}..."') + print(f' Tree: "{tree_text[:60]}..."') + + return { + "scenario": scenario_name, + "question": question, + "screenshot_tokens": screenshot_tokens, + "tree_tokens": tree_tokens, + "tokens_saved": saved, + "savings_pct": round(pct, 1), + "screenshot_latency_ms": screenshot_latency, + "tree_latency_ms": tree_latency, + "screenshot_answer": screenshot_text, + "tree_answer": tree_text, + "screenshot_captured_items": screenshot_has_items, + } + + +async def run_all_benchmarks(models: list[str]): + """Run AX tree benchmarks for all models.""" + provider = OllamaProvider(base_url="http://localhost:11434/v1") + + print("=" * 80) + print(" AX Tree Routing Benchmark — Real Ollama Models") + print("=" * 80) + + # Create test scenarios + scenarios = [ + { + "name": "Login Form", + "question": "List every interactive element on this page (buttons, links, inputs).", + "screenshot": _create_login_form_screenshot(), + "ax_tree": serialize_ax_tree(_create_login_ax_tree()), + "required_substrings": ["email", "password", "log in"], + }, + { + "name": "Todo List", + "question": "How many tasks are shown and which ones are completed?", + "screenshot": _create_todo_list_screenshot(), + "ax_tree": _create_todo_ax_tree(), + "required_substrings": ["buy groceries"], + }, + ] + + all_results = {} + + for model in models: + print(f"\n{'=' * 80}") + print(f" Model: {model}") + print(f"{'=' * 80}") + + model_results = [] + + # Check if model is available + try: + await provider.chat_completion( + model=model, + messages=[{"role": "user", "content": [{"type": "text", "text": "test"}]}], + max_tokens=5, + ) + except Exception as e: + print(f" SKIPPED: Model not available ({e})") + continue + + for scenario in scenarios: + result = await run_ax_tree_scenario( + model=model, + provider=provider, + scenario_name=scenario["name"], + question=scenario["question"], + screenshot=scenario["screenshot"], + ax_tree=scenario["ax_tree"], + required_substrings=scenario["required_substrings"], + ) + if result: + model_results.append(result) + + all_results[model] = model_results + + # Print model summary + if model_results: + total_screenshot = sum(r["screenshot_tokens"] for r in model_results) + total_tree = sum(r["tree_tokens"] for r in model_results) + total_saved = total_screenshot - total_tree + total_pct = (total_saved / total_screenshot * 100) if total_screenshot > 0 else 0 + + print(f"\n --- {model} Summary ---") + print(f" {'Scenario':<20s} {'Screenshot':>12s} {'Tree':>8s} {'Savings':>8s}") + print(f" {'-' * 20} {'-' * 12} {'-' * 8} {'-' * 8}") + for r in model_results: + print( + f" {r['scenario']:<20s} {r['screenshot_tokens']:>12,} " + f"{r['tree_tokens']:>8,} {r['savings_pct']:>7.1f}%" + ) + print(f" {'TOTAL':<20s} {total_screenshot:>12,} {total_tree:>8,} {total_pct:>7.1f}%") + + # --- Grand summary across all models --- + print(f"\n{'=' * 80}") + print(" Grand Summary — All Models") + print(f"{'=' * 80}") + print(f"\n {'Model':<20s} {'Screenshot':>12s} {'Tree':>12s} {'Savings':>8s}") + print(f" {'-' * 20} {'-' * 12} {'-' * 12} {'-' * 8}") + + for model, results in all_results.items(): + if results: + total_screenshot = sum(r["screenshot_tokens"] for r in results) + total_tree = sum(r["tree_tokens"] for r in results) + total_saved = total_screenshot - total_tree + pct = (total_saved / total_screenshot * 100) if total_screenshot > 0 else 0 + print(f" {model:<20s} {total_screenshot:>12,} {total_tree:>12,} {pct:>7.1f}%") + + print(f"\n{'=' * 80}\n") + + # --- Cloud API extrapolation --- + # Tree tokens are text — roughly constant across all models and providers. + # Screenshot tokens for OpenAI/Anthropic are calculated from their published formulas. + # We use the average tree tokens measured across all Ollama models as our estimate. + successful = {m: r for m, r in all_results.items() if r} + if not successful: + return + + all_tree_tokens = [t for r in successful.values() for s in r for t in [s["tree_tokens"]]] + avg_tree_tokens_per_scenario = sum(all_tree_tokens) / len(all_tree_tokens) + num_scenarios = len(scenarios) + total_avg_tree = avg_tree_tokens_per_scenario * num_scenarios + + # OpenAI GPT-4o: 800x600 JPEG → tile formula (512px tiles) + # tiles = ceil(800/512) * ceil(600/512) = 2 * 2 = 4 tiles + # tokens = 85 + 170 * 4 = 765 per image + openai_screenshot_per_scenario = 765 + openai_total_screenshot = openai_screenshot_per_scenario * num_scenarios + + # Anthropic Claude: pixels / 750 + # 800 * 600 / 750 = 640 per image + anthropic_screenshot_per_scenario = 640 + anthropic_total_screenshot = anthropic_screenshot_per_scenario * num_scenarios + + def _savings(before, after): + saved = before - after + pct = saved / before * 100 if before else 0 + return saved, pct + + openai_saved, openai_pct = _savings(openai_total_screenshot, total_avg_tree) + anthropic_saved, anthropic_pct = _savings(anthropic_total_screenshot, total_avg_tree) + + # Pricing (input tokens) + openai_price_per_m = 2.50 # GPT-4o + anthropic_price_per_m = 3.00 # Claude Sonnet + + openai_cost_before = openai_total_screenshot * openai_price_per_m / 1_000_000 + openai_cost_after = total_avg_tree * openai_price_per_m / 1_000_000 + anthropic_cost_before = anthropic_total_screenshot * anthropic_price_per_m / 1_000_000 + anthropic_cost_after = total_avg_tree * anthropic_price_per_m / 1_000_000 + + print("=" * 80) + print(" Cloud API Extrapolation (based on avg Ollama tree token measurements)") + print("=" * 80) + avg_str = f"{avg_tree_tokens_per_scenario:.0f}" + print(f"\n Avg tree tokens/scenario across Ollama models: {avg_str}") + print(f" Total tree tokens ({num_scenarios} scenarios): {total_avg_tree:.0f}") + print() + hdr = f" {'Provider':<22} {'Screenshot':>12} {'Tree':>8} {'Savings':>9} {'$/1M saved':>12}" + print(hdr) + print(f" {'-' * 22} {'-' * 12} {'-' * 8} {'-' * 9} {'-' * 12}") + + for label, shot_tok, pct, cb, ca in [ + ("OpenAI GPT-4o", openai_total_screenshot, openai_pct, + openai_cost_before, openai_cost_after), + ("Anthropic Claude", anthropic_total_screenshot, anthropic_pct, + anthropic_cost_before, anthropic_cost_after), + ]: + saved_per_m = (cb - ca) * 1_000_000 + print( + f" {label:<22} {shot_tok:>12,} {total_avg_tree:>8.0f}" + f" {pct:>8.1f}% ${saved_per_m:>10,.0f}" + ) + + print() + print(" At-scale (100K UI agent calls/day, 30 days):") + print(f" {'Provider':<22} {'Direct/mo':>12} {'Token0/mo':>12} {'Saved/mo':>12}") + print(f" {'-' * 22} {'-' * 12} {'-' * 12} {'-' * 12}") + calls = 100_000 * 30 + for label, cost_before, cost_after in [ + ("OpenAI GPT-4o", openai_cost_before, openai_cost_after), + ("Anthropic Claude", anthropic_cost_before, anthropic_cost_after), + ]: + mo_before = cost_before * calls + mo_after = cost_after * calls + saved_mo = mo_before - mo_after + print(f" {label:<22} ${mo_before:>10,.0f} ${mo_after:>10,.0f} ${saved_mo:>10,.0f}") + + print() + print(" Notes:") + print(" - Screenshot tokens: OpenAI tile formula (85 + 170×tiles), Anthropic w×h/750") + print(" - Tree tokens: measured from real Ollama calls — text tokenization is") + print(" provider-agnostic (~4 chars/token, consistent across OpenAI/Anthropic/Ollama)") + print(" - Image size: 800×600 synthetic screenshots (matches our benchmark)") + print(f"\n{'=' * 80}\n") + + +def main(): + parser = argparse.ArgumentParser(description="AX tree routing benchmark against Ollama models") + parser.add_argument( + "--model", action="append", help="Ollama model(s) to test (can specify multiple)" + ) + args = parser.parse_args() + + models = args.model or VISION_MODELS + asyncio.run(run_all_benchmarks(models)) + + +if __name__ == "__main__": + main() diff --git a/benchmarks/bench_ax_tree_real.py b/benchmarks/bench_ax_tree_real.py new file mode 100644 index 0000000..ecdbd58 --- /dev/null +++ b/benchmarks/bench_ax_tree_real.py @@ -0,0 +1,686 @@ +"""Benchmark: AX tree routing on REAL browser pages via Playwright. + +Requires Ollama running locally with moondream and/or llava:7b pulled. +Playwright + Chromium are installed automatically on first run. + +Usage: + python -m benchmarks.bench_ax_tree_real +""" + +import asyncio +import base64 +import subprocess +import sys +import time +from typing import Optional + +from token0.optimization.ax_tree import ( + has_opaque_nodes, + serialize_ax_tree, +) +from token0.providers.ollama import OllamaProvider + +FAST_MODELS = ["moondream", "llava:7b"] + +URLS = [ + { + "url": "https://github.com", + "name": "GitHub Home", + "question": ( + "List every interactive element visible " + "(buttons, links, search inputs)." + ), + "required_substrings": ["sign"], + }, + { + "url": "https://news.ycombinator.com", + "name": "Hacker News", + "question": ( + "How many story links are visible? " + "Name the first 3 stories." + ), + "required_substrings": [], + }, + { + "url": "https://en.wikipedia.org/wiki/Main_Page", + "name": "Wikipedia", + "question": ( + "What search and navigation elements are available " + "on this page?" + ), + "required_substrings": ["search"], + }, +] + +_INTERACTIVE_ROLES = frozenset( + { + "button", + "link", + "textbox", + "searchbox", + "combobox", + "checkbox", + "radio", + "slider", + "spinbutton", + "switch", + "tab", + "menuitem", + "menuitemcheckbox", + "menuitemradio", + "option", + "treeitem", + } +) +_STRUCTURAL_ROLES = frozenset( + { + "heading", + "list", + "listitem", + "table", + "row", + "cell", + "navigation", + "main", + "banner", + "contentinfo", + "complementary", + "form", + "search", + "dialog", + "alertdialog", + "tablist", + "toolbar", + "menu", + "menubar", + "tree", + "grid", + "treegrid", + "WebArea", + "RootWebArea", + } +) +_WRAPPER_ROLES = frozenset( + { + "generic", + "none", + "presentation", + "group", + "Section", + } +) + + +def _ensure_playwright(): + """Install Playwright if missing, then install Chromium.""" + try: + import playwright # noqa: F401 + except ImportError: + print("Installing playwright...") + subprocess.check_call( + [sys.executable, "-m", "pip", "install", "playwright"] + ) + print("Installing Chromium...") + subprocess.check_call( + [sys.executable, "-m", "playwright", "install", "chromium"] + ) + + +def prune_ax_tree(node: Optional[dict], depth: int = 0, max_depth: int = 6): + """Prune AX tree to interactive/structural nodes only.""" + if node is None: + return None + + role = node.get("role", "") + name = node.get("name", "") + value = node.get("value") + children = node.get("children", []) + + # Hard depth limit + if depth > max_depth: + if role in _INTERACTIVE_ROLES and name: + return {"role": role, "name": name[:80]} + return None + + # Prune children first + pruned_children = [] + for child in children: + pruned = prune_ax_tree(child, depth + 1, max_depth) + if pruned: + pruned_children.append(pruned) + + # Collapse wrappers with 1 child + if ( + role in _WRAPPER_ROLES + and not name + and len(pruned_children) == 1 + ): + return pruned_children[0] + + is_interactive = role in _INTERACTIVE_ROLES + is_structural = role in _STRUCTURAL_ROLES + has_name = bool(name) + has_children = len(pruned_children) > 0 + + keep = ( + is_interactive + or (is_structural and (has_name or has_children)) + or (has_name and has_children) + ) + + if depth == 0: + keep = True + + if not keep and not has_children: + return None + + if not keep and has_children and len(pruned_children) == 1: + return pruned_children[0] + + if not keep and has_children and len(pruned_children) > 1: + return {"role": role, "children": pruned_children} + + # Build result + result: dict = {"role": role} + if has_name: + result["name"] = name[:80] + if is_interactive and value: + result["value"] = str(value)[:80] + if has_children: + result["children"] = pruned_children + + # Hard cap + serialized = str(result) + if len(serialized) > 8000: + result["children"] = pruned_children[:10] + + return result + + +async def capture_page(browser, url: str, timeout_ms: int = 30000): + """Capture screenshot and AX snapshot from real page.""" + page = None + try: + page = await browser.new_page( + viewport={"width": 1280, "height": 720} + ) + await page.goto(url, wait_until="networkidle", timeout=timeout_ms) + await page.wait_for_timeout(2000) + screenshot_bytes = await page.screenshot( + type="jpeg", quality=85, full_page=False + ) + + # Build simple AX tree from DOM structure + ax_snapshot = await _extract_ax_tree(page) + return screenshot_bytes, ax_snapshot + finally: + if page: + await page.close() + + +async def _extract_ax_tree(page): + """Extract a simple AX tree via JavaScript evaluation.""" + tree = await page.evaluate( + """ + () => { + function buildTree(node) { + if (!node) return null; + const role = node.getAttribute('role') || + node.tagName.toLowerCase(); + const ariaLabel = node.getAttribute('aria-label'); + const ariaPressed = node.getAttribute('aria-pressed'); + const name = ariaLabel || node.getAttribute('title') || + (node.textContent ? + node.textContent.trim().slice(0, 100) : ''); + + const children = []; + for (let child of node.children) { + const subtree = buildTree(child); + if (subtree) children.push(subtree); + } + + const result = {role, name}; + if (ariaPressed) result.value = ariaPressed; + if (children.length > 0) result.children = children; + return result; + } + return buildTree(document.documentElement); + } + """ + ) + return tree + + +def _bytes_to_data_uri(jpeg_bytes: bytes) -> str: + """Convert JPEG bytes to base64 data URI.""" + b64 = base64.b64encode(jpeg_bytes).decode() + return f"data:image/jpeg;base64,{b64}" + + +async def _run_real_scenario( + model: str, + provider: OllamaProvider, + name: str, + question: str, + screenshot_uri: str, + ax_tree_text: str, + required_substrings: list, + has_opaque: bool, +) -> Optional[dict]: + """Run single scenario: screenshot vs AX tree.""" + print(f"\n Scenario: {name}") + print(f' Question: "{question}"') + if has_opaque: + print(" NOTE: opaque nodes detected — benchmarking both paths") + + # Screenshot path + print(" Screenshot: ", end="", flush=True) + screenshot_messages = [ + { + "role": "user", + "content": [ + {"type": "text", "text": question}, + { + "type": "image_url", + "image_url": {"url": screenshot_uri, "detail": "auto"}, + }, + ], + } + ] + + screenshot_start = time.time() + try: + screenshot_resp = await provider.chat_completion( + model=model, messages=screenshot_messages, max_tokens=200 + ) + screenshot_latency = int((time.time() - screenshot_start) * 1000) + screenshot_tokens = screenshot_resp.prompt_tokens + screenshot_text = screenshot_resp.content + print(f"{screenshot_tokens:,} tokens | {screenshot_latency}ms") + except Exception as e: + print(f"ERROR: {e}") + return None + + # Tree path + print(" AX Tree: ", end="", flush=True) + tree_question = f"{question}\n\nUI Accessibility Tree:\n{ax_tree_text}" + tree_messages = [ + { + "role": "user", + "content": [{"type": "text", "text": tree_question}], + } + ] + + tree_start = time.time() + try: + tree_resp = await provider.chat_completion( + model=model, messages=tree_messages, max_tokens=200 + ) + tree_latency = int((time.time() - tree_start) * 1000) + tree_tokens = tree_resp.prompt_tokens + tree_text = tree_resp.content + print(f"{tree_tokens:,} tokens | {tree_latency}ms", end="") + + saved = screenshot_tokens - tree_tokens + pct = (saved / screenshot_tokens * 100) if screenshot_tokens > 0 else 0 + print(f" ({-pct:.1f}%)") + except Exception as e: + print(f"ERROR: {e}") + return None + + # Verify key substrings + screenshot_lower = screenshot_text.lower() + screenshot_has_items = all( + substring.lower() in screenshot_lower + for substring in required_substrings + ) + + print( + f" Screenshot captured key items: " + f"{'YES' if screenshot_has_items else 'NO'}" + ) + print(f' Screenshot: "{screenshot_text[:60]}..."') + print(f' Tree: "{tree_text[:60]}..."') + + return { + "scenario": name, + "question": question, + "screenshot_tokens": screenshot_tokens, + "tree_tokens": tree_tokens, + "tokens_saved": saved, + "savings_pct": round(pct, 1), + "screenshot_latency_ms": screenshot_latency, + "tree_latency_ms": tree_latency, + "screenshot_answer": screenshot_text, + "tree_answer": tree_text, + "screenshot_captured_items": screenshot_has_items, + "has_opaque": has_opaque, + } + + +async def run_real_benchmarks(): + """Run benchmarks on real pages via Playwright.""" + _ensure_playwright() + + from playwright.async_api import async_playwright + + provider = OllamaProvider(base_url="http://localhost:11434/v1") + + print("=" * 80) + print(" AX Tree Routing Benchmark — Real Browser Pages") + print("=" * 80) + + # Phase 1: Capture all pages + print("\n" + "=" * 80) + print(" Phase 1: Capturing Real Pages") + print("=" * 80) + + captures = {} + + async with async_playwright() as p: + browser = await p.chromium.launch() + + for url_info in URLS: + url = url_info["url"] + name = url_info["name"] + print(f"\n {name}: ", end="", flush=True) + try: + screenshot_bytes, ax_snapshot = await capture_page( + browser, url + ) + if ax_snapshot is None: + print("FAILED: No AX snapshot") + continue + + pruned = prune_ax_tree(ax_snapshot) + tree_text = serialize_ax_tree(pruned) + opaque = has_opaque_nodes(pruned) + + captures[name] = { + "url": url, + "screenshot_bytes": screenshot_bytes, + "screenshot_uri": _bytes_to_data_uri(screenshot_bytes), + "tree_text": tree_text, + "has_opaque": opaque, + } + + print( + f"OK ({len(tree_text)} chars, " + f"opaque={opaque})" + ) + except Exception as e: + print(f"FAILED: {e}") + + await browser.close() + + if not captures: + print("\nNo captures succeeded. Exiting.") + return + + # Phase 2: Benchmark each model + print("\n" + "=" * 80) + print(" Phase 2: Benchmarking Models") + print("=" * 80) + + all_results = {} + + for model in FAST_MODELS: + print(f"\n{'=' * 80}") + print(f" Model: {model}") + print(f"{'=' * 80}") + + model_results = [] + + # Check model availability + try: + await provider.chat_completion( + model=model, + messages=[ + {"role": "user", + "content": [{"type": "text", "text": "test"}]} + ], + max_tokens=5, + ) + except Exception as e: + print(f" SKIPPED: Model not available ({e})") + continue + + for url_info in URLS: + name = url_info["name"] + if name not in captures: + continue + + cap = captures[name] + result = await _run_real_scenario( + model=model, + provider=provider, + name=name, + question=url_info["question"], + screenshot_uri=cap["screenshot_uri"], + ax_tree_text=cap["tree_text"], + required_substrings=url_info.get( + "required_substrings", [] + ), + has_opaque=cap["has_opaque"], + ) + if result: + model_results.append(result) + + all_results[model] = model_results + + # Summary table + if model_results: + total_screenshot = sum( + r["screenshot_tokens"] for r in model_results + ) + total_tree = sum(r["tree_tokens"] for r in model_results) + total_saved = total_screenshot - total_tree + total_pct = ( + (total_saved / total_screenshot * 100) + if total_screenshot > 0 + else 0 + ) + + print(f"\n --- {model} Summary ---") + print( + f" {'Scenario':<20s} {'Screenshot':>12s} " + f"{'Tree':>8s} {'Savings':>8s}" + ) + print( + f" {'-' * 20} {'-' * 12} {'-' * 8} {'-' * 8}" + ) + for r in model_results: + print( + f" {r['scenario']:<20s} " + f"{r['screenshot_tokens']:>12,} " + f"{r['tree_tokens']:>8,} " + f"{r['savings_pct']:>7.1f}%" + ) + print( + f" {'TOTAL':<20s} {total_screenshot:>12,} " + f"{total_tree:>8,} {total_pct:>7.1f}%" + ) + + # Grand summary + print(f"\n{'=' * 80}") + print(" Grand Summary — All Models") + print(f"{'=' * 80}") + print( + f"\n {'Model':<20s} {'Screenshot':>12s} " + f"{'Tree':>12s} {'Savings':>8s}" + ) + print(f" {'-' * 20} {'-' * 12} {'-' * 12} {'-' * 8}") + + for model, results in all_results.items(): + if results: + total_screenshot = sum( + r["screenshot_tokens"] for r in results + ) + total_tree = sum(r["tree_tokens"] for r in results) + total_saved = total_screenshot - total_tree + pct = ( + (total_saved / total_screenshot * 100) + if total_screenshot > 0 + else 0 + ) + print( + f" {model:<20s} {total_screenshot:>12,} " + f"{total_tree:>12,} {pct:>7.1f}%" + ) + + print(f"\n{'=' * 80}\n") + + # Cloud extrapolation + successful = {m: r for m, r in all_results.items() if r} + if not successful: + return + + all_tree_tokens = [ + s["tree_tokens"] + for r in successful.values() + for s in r + ] + avg_tree_tokens_per_scenario = ( + sum(all_tree_tokens) / len(all_tree_tokens) + ) + num_scenarios = len(captures) + total_avg_tree = avg_tree_tokens_per_scenario * num_scenarios + + # Real 1280x720 viewport + # OpenAI: ceil(1280/512) * ceil(720/512) = 3 * 2 = 6 tiles + # tokens = 85 + 170 * 6 = 1105 + openai_screenshot_per_scenario = 1105 + openai_total_screenshot = ( + openai_screenshot_per_scenario * num_scenarios + ) + + # Anthropic: 1280 * 720 / 750 = 1229 + anthropic_screenshot_per_scenario = 1229 + anthropic_total_screenshot = ( + anthropic_screenshot_per_scenario * num_scenarios + ) + + def _savings(before, after): + saved = before - after + pct = saved / before * 100 if before else 0 + return saved, pct + + openai_saved, openai_pct = _savings( + openai_total_screenshot, total_avg_tree + ) + anthropic_saved, anthropic_pct = _savings( + anthropic_total_screenshot, total_avg_tree + ) + + openai_price_per_m = 2.50 + anthropic_price_per_m = 3.00 + + openai_cost_before = ( + openai_total_screenshot * openai_price_per_m / 1_000_000 + ) + openai_cost_after = ( + total_avg_tree * openai_price_per_m / 1_000_000 + ) + anthropic_cost_before = ( + anthropic_total_screenshot * anthropic_price_per_m / 1_000_000 + ) + anthropic_cost_after = ( + total_avg_tree * anthropic_price_per_m / 1_000_000 + ) + + print("=" * 80) + print( + " Cloud API Extrapolation " + "(based on avg Ollama tree token measurements)" + ) + print("=" * 80) + avg_str = f"{avg_tree_tokens_per_scenario:.0f}" + print(f"\n Avg tree tokens/scenario: {avg_str}") + print( + f" Total tree tokens " + f"({num_scenarios} scenarios): {total_avg_tree:.0f}" + ) + print() + hdr = ( + f" {'Provider':<22} {'Screenshot':>12} {'Tree':>8} " + f"{'Savings':>9} {'$/1M saved':>12}" + ) + print(hdr) + print( + f" {'-' * 22} {'-' * 12} {'-' * 8} " + f"{'-' * 9} {'-' * 12}" + ) + + for label, shot_tok, pct, cb, ca in [ + ( + "OpenAI GPT-4o", + openai_total_screenshot, + openai_pct, + openai_cost_before, + openai_cost_after, + ), + ( + "Anthropic Claude", + anthropic_total_screenshot, + anthropic_pct, + anthropic_cost_before, + anthropic_cost_after, + ), + ]: + saved_per_m = (cb - ca) * 1_000_000 + print( + f" {label:<22} {shot_tok:>12,} " + f"{total_avg_tree:>8.0f} {pct:>8.1f}% " + f"${saved_per_m:>10,.0f}" + ) + + print() + print(" At-scale (100K calls/day, 30 days):") + print( + f" {'Provider':<22} {'Direct/mo':>12} " + f"{'Token0/mo':>12} {'Saved/mo':>12}" + ) + print( + f" {'-' * 22} {'-' * 12} {'-' * 12} {'-' * 12}" + ) + calls = 100_000 * 30 + for label, cost_before, cost_after in [ + ("OpenAI GPT-4o", openai_cost_before, openai_cost_after), + ( + "Anthropic Claude", + anthropic_cost_before, + anthropic_cost_after, + ), + ]: + mo_before = cost_before * calls + mo_after = cost_after * calls + saved_mo = mo_before - mo_after + print( + f" {label:<22} ${mo_before:>10,.0f} " + f"${mo_after:>10,.0f} ${saved_mo:>10,.0f}" + ) + + print() + print(" Notes:") + print( + " - Real 1280x720 screenshots cost ~1105 tokens (OpenAI) " + "vs ~765 for synthetic 800x600." + ) + print( + " - AX tree text tokens scale with page complexity, " + "not resolution — savings are LARGER on real pages." + ) + print( + " - Pricing: OpenAI $2.50/1M, Anthropic $3.00/1M " + "(input tokens)" + ) + + print(f"\n{'=' * 80}\n") + + +def main(): + asyncio.run(run_real_benchmarks()) + + +if __name__ == "__main__": + main() diff --git a/pyproject.toml b/pyproject.toml index 5856d67..551cb15 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "token0" -version = "0.3.2" +version = "0.3.3" description = "Open-source API proxy that makes vision LLM calls 5-10x cheaper" readme = "README.md" license = "Apache-2.0" diff --git a/tests/test_ax_tree.py b/tests/test_ax_tree.py new file mode 100644 index 0000000..4e77770 --- /dev/null +++ b/tests/test_ax_tree.py @@ -0,0 +1,229 @@ +"""Tests for AX tree serialization, opaque detection, and combo routing.""" + +from token0.optimization.ax_tree import ( + estimate_ax_tree_tokens, + has_opaque_nodes, + serialize_ax_tree, +) + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + +PLAYWRIGHT_TREE = { + "role": "WebArea", + "name": "GitHub", + "children": [ + { + "role": "navigation", + "name": "Main", + "children": [ + {"role": "link", "name": "Home", "children": []}, + {"role": "link", "name": "About", "children": []}, + ], + }, + { + "role": "main", + "name": "", + "children": [ + {"role": "heading", "name": "Welcome", "children": []}, + {"role": "button", "name": "Get Started", "children": []}, + {"role": "textbox", "name": "Search", "value": "foo", "children": []}, + ], + }, + ], +} + +AXUI_TREE = { + "AXRole": "AXWindow", + "AXTitle": "Finder", + "AXChildren": [ + { + "AXRole": "AXButton", + "AXTitle": "Close", + "AXEnabled": True, + "AXChildren": [], + }, + { + "AXRole": "AXTextField", + "AXTitle": "Search", + "AXValue": "query", + "AXEnabled": True, + "AXChildren": [], + }, + ], +} + +CANVAS_TREE = { + "role": "WebArea", + "name": "App", + "children": [ + {"role": "button", "name": "OK", "children": []}, + {"role": "canvas", "name": "", "children": []}, + ], +} + +IFRAME_TREE = { + "role": "document", + "name": "", + "children": [ + {"role": "iframe", "name": "embedded", "children": []}, + ], +} + + +# --------------------------------------------------------------------------- +# serialize_ax_tree +# --------------------------------------------------------------------------- + + +def test_serialize_playwright_tree_contains_roles(): + result = serialize_ax_tree(PLAYWRIGHT_TREE) + assert "WebArea" in result + assert "button" in result + assert "Get Started" in result + + +def test_serialize_playwright_tree_is_indented(): + result = serialize_ax_tree(PLAYWRIGHT_TREE) + lines = result.splitlines() + # Root has no indent; children have at least 2 spaces + assert lines[0].startswith("WebArea") + assert any(line.startswith(" ") for line in lines) + + +def test_serialize_axui_tree_normalizes_roles(): + result = serialize_ax_tree(AXUI_TREE) + assert "AXWindow" in result + assert "AXButton" in result + assert "Close" in result + assert "Search" in result + + +def test_serialize_axui_includes_value(): + result = serialize_ax_tree(AXUI_TREE) + assert "query" in result + + +def test_serialize_list_of_roots(): + roots = [ + {"role": "button", "name": "OK", "children": []}, + {"role": "button", "name": "Cancel", "children": []}, + ] + result = serialize_ax_tree(roots) + assert "OK" in result + assert "Cancel" in result + + +def test_serialize_string_passthrough(): + pre = "button: Submit\n text: Click me" + assert serialize_ax_tree(pre) == pre + + +def test_serialize_disabled_node(): + tree = {"role": "button", "name": "Submit", "disabled": True, "children": []} + result = serialize_ax_tree(tree) + assert "[disabled]" in result + + +def test_serialize_value_shown_when_different_from_name(): + tree = {"role": "textbox", "name": "Email", "value": "user@example.com", "children": []} + result = serialize_ax_tree(tree) + assert "user@example.com" in result + + +# --------------------------------------------------------------------------- +# estimate_ax_tree_tokens +# --------------------------------------------------------------------------- + + +def test_estimate_tokens_proportional_to_length(): + short = "button OK" + long_text = "button OK\n" * 100 + assert estimate_ax_tree_tokens(long_text) > estimate_ax_tree_tokens(short) + + +def test_estimate_tokens_minimum_ten(): + assert estimate_ax_tree_tokens("hi") == 10 + + +def test_estimate_tokens_approx_four_chars(): + text = "a" * 400 + assert estimate_ax_tree_tokens(text) == 100 + + +# --------------------------------------------------------------------------- +# has_opaque_nodes +# --------------------------------------------------------------------------- + + +def test_no_opaque_in_clean_playwright_tree(): + assert has_opaque_nodes(PLAYWRIGHT_TREE) is False + + +def test_no_opaque_in_axui_tree(): + assert has_opaque_nodes(AXUI_TREE) is False + + +def test_canvas_role_is_opaque(): + assert has_opaque_nodes(CANVAS_TREE) is True + + +def test_iframe_role_is_opaque(): + assert has_opaque_nodes(IFRAME_TREE) is True + + +def test_opaque_detected_in_nested_child(): + nested = { + "role": "main", + "name": "", + "children": [ + { + "role": "section", + "name": "", + "children": [ + {"role": "canvas", "name": "", "children": []}, + ], + } + ], + } + assert has_opaque_nodes(nested) is True + + +def test_opaque_string_contains_canvas(): + assert has_opaque_nodes("button OK\ncanvas [OPAQUE]") is True + + +def test_opaque_string_contains_iframe(): + assert has_opaque_nodes("main\n iframe embedded") is True + + +def test_clean_string_is_not_opaque(): + assert has_opaque_nodes("button OK\nlink Home\ntextbox Search") is False + + +def test_opaque_list_of_roots(): + roots = [ + {"role": "button", "name": "OK", "children": []}, + {"role": "canvas", "name": "", "children": []}, + ] + assert has_opaque_nodes(roots) is True + + +def test_clean_list_of_roots(): + roots = [ + {"role": "button", "name": "OK", "children": []}, + {"role": "link", "name": "Home", "children": []}, + ] + assert has_opaque_nodes(roots) is False + + +def test_axui_aximage_is_opaque(): + tree = { + "AXRole": "AXGroup", + "AXTitle": "", + "AXChildren": [ + {"AXRole": "AXImage", "AXTitle": "", "AXChildren": []}, + ], + } + assert has_opaque_nodes(tree) is True diff --git a/token0/api/v1/chat.py b/token0/api/v1/chat.py index 9f881ca..3d8a7c1 100644 --- a/token0/api/v1/chat.py +++ b/token0/api/v1/chat.py @@ -98,6 +98,27 @@ def _optimize_messages(request: ChatRequest, prompt_detail: str): continue optimized_parts = [] + + # AX tree combo detection: if this message has both image_url and + # accessibility_tree parts, pick the cheaper representation once. + parts_list = msg.content # already confirmed to be a list + has_tree = any(p.type == "accessibility_tree" for p in parts_list) + has_image = any(p.type == "image_url" for p in parts_list) + ax_drop_image = False # True → skip image_url parts (tree wins) + ax_drop_tree = False # True → skip accessibility_tree parts (image wins) + + if has_tree and has_image and request.token0_optimize: + from token0.optimization.ax_tree import has_opaque_nodes + + tree_parts = [p for p in parts_list if p.type == "accessibility_tree"] + tree_data = tree_parts[0].accessibility_tree.data + if has_opaque_nodes(tree_data): + # Tree has canvas/iframe — screenshot needed; drop tree to avoid redundancy. + ax_drop_tree = True + else: + # Tree is complete — route to text; drop screenshot (saves 90%+ tokens). + ax_drop_image = True + for part in msg.content: if part.type == "text": optimized_parts.append({"type": "text", "text": part.text}) @@ -176,7 +197,35 @@ def _optimize_messages(request: ChatRequest, prompt_detail: str): dropped_frames = video_stats["total_video_frames"] - video_stats["frames_selected"] total_tokens_before += dropped_frames * tokens_per_frame_avg + elif part.type == "accessibility_tree" and part.accessibility_tree: + if ax_drop_tree: + # Combo: tree has opaque nodes → screenshot wins, skip tree. + continue + from token0.optimization.ax_tree import ( + estimate_ax_tree_tokens, + serialize_ax_tree, + ) + + serialized = serialize_ax_tree(part.accessibility_tree.data) + token_count = estimate_ax_tree_tokens(serialized) + screenshot_tokens = 5000 # conservative estimate for a 1080p screenshot + total_tokens_before += screenshot_tokens if ax_drop_image else token_count + total_tokens_after += token_count + if ax_drop_image: + saved = screenshot_tokens - token_count + optimizations_applied.append( + f"ax tree → text ({saved:,} tokens saved vs screenshot)" + ) + optimized_parts.append( + {"type": "text", "text": f"[UI Accessibility Tree]:\n{serialized}"} + ) + elif part.type == "image_url" and part.image_url and request.token0_optimize: + if ax_drop_image: + # Combo: tree is complete → tree text wins, skip screenshot. + total_tokens_before += 5000 # count what we avoided + continue + image_data = part.image_url.url # PDF pre-processing: extract text layer if available diff --git a/token0/models/request.py b/token0/models/request.py index 83dc714..459e1fa 100644 --- a/token0/models/request.py +++ b/token0/models/request.py @@ -10,11 +10,17 @@ class VideoUrl(BaseModel): url: str +class AccessibilityTree(BaseModel): + data: dict | list | str # Playwright/CDP dict, list of roots, or pre-serialized string + source: str | None = None # "playwright", "axui", "selenium", "cdp" — informational only + + class ContentPart(BaseModel): - type: str # "text", "image_url", or "video_url" + type: str # "text", "image_url", "video_url", or "accessibility_tree" text: str | None = None image_url: ImageUrl | None = None video_url: VideoUrl | None = None + accessibility_tree: AccessibilityTree | None = None class Message(BaseModel): diff --git a/token0/optimization/ax_tree.py b/token0/optimization/ax_tree.py new file mode 100644 index 0000000..00443d0 --- /dev/null +++ b/token0/optimization/ax_tree.py @@ -0,0 +1,157 @@ +"""AX (Accessibility) Tree routing — convert UI accessibility trees to compact text. + +When a UI automation agent provides both a screenshot and an accessibility tree, +token0 picks the cheaper representation: +- Tree is complete (no canvas/iframe/opaque nodes): use text (~4K tokens vs 50K+) +- Tree has opaque elements: fall back to screenshot for visual accuracy + +Supported formats: +- Web (Chrome DevTools / Playwright): {"role": "...", "name": "...", "children": [...]} +- macOS AXUIElement: {"AXRole": "...", "AXTitle": "...", "AXChildren": [...]} +- Pre-serialized string: passed through as-is +""" + +from __future__ import annotations + +import logging + +logger = logging.getLogger("token0.ax_tree") + +# Roles that cannot be represented textually — require visual rendering. +_OPAQUE_ROLES: frozenset[str] = frozenset( + { + "canvas", + "AXCanvas", + "embed", + "object", + "plugin", + "img", + "image", + "figure", + "math", + "meter", + "progressbar", + "AXImage", + } +) + +# HTML tag names that are inherently opaque. +_OPAQUE_TAGS: frozenset[str] = frozenset( + {"canvas", "iframe", "embed", "object", "video", "audio", "svg"} +) + + +def _normalize_node(node: dict) -> dict: + """Return a uniform dict from either AXUIElement or Playwright/CDP format.""" + if "AXRole" in node: + # macOS AXUIElement + return { + "role": node.get("AXRole", ""), + "name": (node.get("AXTitle") or node.get("AXDescription") or node.get("AXValue") or ""), + "value": node.get("AXValue", ""), + "enabled": node.get("AXEnabled", True), + "children": node.get("AXChildren", []), + } + # Web / Playwright / Chrome DevTools Protocol + return { + "role": node.get("role", ""), + "name": node.get("name", ""), + "value": node.get("value", ""), + "enabled": not node.get("disabled", False), + "children": node.get("children", []), + } + + +def _serialize_node(node: dict, depth: int, lines: list[str]) -> None: + """Recursively append compact indented lines for one node.""" + n = _normalize_node(node) + role = n["role"] + name = n["name"] + value = str(n["value"]) if n["value"] else "" + enabled = n["enabled"] + + indent = " " * depth + tokens: list[str] = [role] + if name: + tokens.append(f'"{name}"') + if value and value != name: + tokens.append(f"={value!r}") + if not enabled: + tokens.append("[disabled]") + + lines.append(indent + " ".join(tokens)) + + for child in n["children"]: + _serialize_node(child, depth + 1, lines) + + +def serialize_ax_tree(tree: dict | list | str) -> str: + """Convert an AX tree to compact indented text for the LLM. + + Args: + tree: Nested dict (Playwright/AXUIElement), list of root nodes, or + pre-serialized string (returned as-is). + + Returns: + Multi-line string representation of the tree. + """ + if isinstance(tree, str): + return tree.strip() + + lines: list[str] = [] + if isinstance(tree, list): + for node in tree: + _serialize_node(node, 0, lines) + elif isinstance(tree, dict): + _serialize_node(tree, 0, lines) + else: + return str(tree) + + return "\n".join(lines) + + +def estimate_ax_tree_tokens(serialized: str) -> int: + """Estimate LLM token count for a serialized AX tree (~4 chars per token).""" + return max(10, len(serialized) // 4) + + +def _node_is_opaque(node: dict) -> bool: + """Return True if this node or any descendant needs visual rendering.""" + n = _normalize_node(node) + role = n["role"] + + if role in _OPAQUE_ROLES: + return True + if role.lower() in _OPAQUE_TAGS: + return True + + return any(_node_is_opaque(child) for child in n["children"]) + + +def has_opaque_nodes(tree: dict | list | str) -> bool: + """Return True when the tree contains elements that require a screenshot fallback. + + Canvas elements, iframes, embedded media, and images without text equivalents + cannot be described by the tree alone — the screenshot must be kept. + + Args: + tree: Same formats as serialize_ax_tree. + + Returns: + True → keep screenshot, discard tree (tree alone is insufficient). + False → use tree text only, drop screenshot (90%+ token savings). + """ + if isinstance(tree, str): + lower = tree.lower() + return any( + kw in lower + for kw in ("canvas", "iframe", "embed", "