Skip to content

feat(harness): web::fetch page-reading mode, tool-result images, context-safe caps#233

Merged
andersonleal merged 1 commit into
mainfrom
feat/web-fetch-page-reading
Jun 8, 2026
Merged

feat(harness): web::fetch page-reading mode, tool-result images, context-safe caps#233
andersonleal merged 1 commit into
mainfrom
feat/web-fetch-page-reading

Conversation

@andersonleal

@andersonleal andersonleal commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a page-reading mode to web::fetch so agents can read web pages (not just call APIs), returns viewable images for image URLs, and reworks the response/timeout caps into a default-vs-ceiling split that is context-safe without silently truncating raw fetches.

Page-reading mode (format)

  • New optional format param: "markdown" (HTML→Markdown, best for reading), "text" (HTML→plain text), "html" (raw).
  • When set, the request goes out with a browser User-Agent + format-matched Accept/Accept-Language, and retries the hop once with the honest configured UA on a Cloudflare challenge (403 + cf-mitigated: challenge) — only for idempotent GET/HEAD.
  • Caller-supplied headers always win over the page-mode injections.
  • HTML transforms (turndown / htmlparser2) live in the new web/convert.ts. Transform failures (e.g. hostile deep nesting that overflows turndown) fall back to the raw body; bodies above max_transform_bytes skip the synchronous transform entirely to protect the worker event loop.

Tool-result images

  • Image responses in page mode come back as an {content, details} envelope with an image block (allowlisted mime, 2xx, complete/non-truncated, non-empty), routed through the Anthropic provider wire so the model can view them; text-only providers fall back to the text line. Anything not safely viewable falls through to the normal base64 envelope.

Context-safe caps (the fix from the adversarial review)

  • Split byte/timeout caps into default vs ceiling:
    • resolveMaxBytes: raw fetches keep defaulting to the 5 MiB ceiling (preserving the historical contract — existing API/download callers are not silently truncated), while page-reading mode defaults to a context-safe 256 KiB (a transformed 1 MiB+ SPA page would otherwise blow the turn's context window).
    • resolveTimeout: new default_timeout_ms (30s) separate from a raised max_timeout_ms ceiling (120s).
  • Transformed bodies append an in-band truncation marker when the body hit max_bytes, since agents rarely re-check bytes_truncated.

Test plan

  • harness typecheck clean (tsc -b --noEmit)
  • harness full suite: 1282 tests pass / 121 files
  • biome check clean on changed files
  • New unit coverage: resolveMaxBytes (raw→ceiling regression guard, page-mode→256 KiB, explicit override, clamping), HTML→markdown/text transforms, Cloudflare retry, image viewability gating, transform bounds
  • Manual smoke against a live page (format: "markdown") and an image URL

Summary by CodeRabbit

Release Notes

  • New Features

    • Added page-reading mode to web::fetch with configurable format parameter (markdown, text, html) to intelligently transform web content and prevent context flooding.
    • Image responses now returned in page-reading mode with base64 encoding and descriptive text.
    • Enhanced web content handling with HTML-to-Markdown conversion for cleaner, compact page reads.
  • Bug Fixes

    • Improved handling of Cloudflare bot challenges with browser-like request headers and retry logic.
    • Strengthened SSRF validation across redirect chains.
  • Documentation

    • Updated web::fetch skill documentation with page-reading mode, format options, and image return behavior.

…ext-safe caps

Add a `format` param ("markdown" | "text" | "html") to web::fetch for
reading web pages rather than calling APIs. HTML is converted to Markdown
or plain text (turndown/htmlparser2); requests go out with a browser UA +
format-matched Accept/Accept-Language and retry once with the honest
configured UA on a Cloudflare challenge. Image responses come back as a
viewable image block ({content, details} envelope) routed through the
Anthropic provider wire, with text-only providers falling back to a text
line. Bodies above max_transform_bytes skip the synchronous transform to
protect the worker event loop.

Split the byte and timeout caps into default-vs-ceiling. Raw fetches keep
defaulting to the 5 MiB ceiling (resolveMaxBytes), preserving the historical
contract so existing API/download callers are not silently truncated; only
page-reading mode defaults to the context-safe 256 KiB, since a transformed
1 MiB+ SPA page would otherwise blow the turn's context window. Timeout
gains a default_timeout_ms separate from the raised 120s ceiling.
@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workers Ready Ready Preview, Comment Jun 8, 2026 3:07pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds page-reading mode to web::fetch, enabling HTML-to-markdown/text conversion for agent document retrieval. It introduces format-driven response transforms, image envelope handling, Cloudflare bot-challenge retry logic, browser-like headers, and Anthropic wire-message integration for image blocks.

Changes

Web Fetch Page-Reading Feature

Layer / File(s) Summary
Schemas, Types, and Configuration
harness/src/web/schemas.ts, harness/src/web/config.ts
PageFormatSchema (markdown/text/html), FetchImageResult type, and WebConfig extensions add default/ceiling timeouts, response bytes, and transform size bounds. Optional format field in FetchPayloadSchema and optional content_type/transformed in FetchResult document page-reading metadata.
HTML Conversion and Request Utilities
harness/src/web/convert.ts, harness/tests/web/convert.test.ts
New module provides convertHtmlToMarkdown (via turndown) and extractTextFromHtml (via htmlparser2), MIME classification (isImageMime, isViewableImageMime), and browser-like headers (BROWSER_USER_AGENT, ACCEPT_LANGUAGE, acceptHeaderFor). Comprehensive tests verify markdown rules, text extraction, header ordering, and identity constants.
Fetch Implementation with Page-Reading and Image Handling
harness/src/web/fetch.ts
executeFetch gains resolveTimeout/resolveMaxBytes helpers, page-mode browser header injection (UA, Accept, Accept-Language), conditional Cloudflare bot-challenge retry (403 + cf-mitigated), and expanded response handling: viewable image MIME returns FetchImageResult with base64 block; HTML transforms to markdown/text within max_transform_bytes with explicit truncation notice; transform failure falls back to raw body. Redirect chains attached to results.
Fetch Unit and Integration Tests
harness/tests/web/fetch.test.ts, harness/tests/web/fetch.integration.test.ts
Unit tests validate resolveTimeout, resolveMaxBytes, schema format parsing (markdown/text/html, reject unknown, backward compat). Integration tests exercise loopback server with HTML/image endpoints, redirects, timeouts, and POST; verify page-format transforms, browser header precedence, Cloudflare retry behavior, image envelope hardening, response byte caps, and transform bounds with truncation notice.
Handler, Wire Protocol, and Anthropic Provider Integration
harness/src/web/handlers/fetch.ts, harness/src/types/wire.ts, harness/src/provider-anthropic/wire-messages.ts, harness/tests/.../{handler,wire-messages,wire}.test.ts
Handler return type widened to `Promise<FetchResult
System Prompts and Model Guidance
harness/src/turn-orchestrator/prompt/{default,anthropic,gpt,kimi}.ts, harness/tests/turn-orchestrator/system-prompt.test.ts
Updated prompts for all model variants to steer toward page-reading with format: "markdown" instead of raw HTML. Documents timeout caps, SSRF protection, and HTML-to-markdown behavior. Tests assert guidance is present across all prompt families.
Dependencies and User Documentation
harness/package.json, harness/src/web/skills/index.md
Added htmlparser2, turndown, @types/turndown dependencies; switched dev:all script from bun --watch to tsx --watch. Skill documentation describes format request field, browser-UA behavior, Cloudflare retry semantics, image-response special case (returns image + one-line summary), page-reading-only response fields (content_type, transformed), and expanded error table (timeout, too_many_redirects, transport_error). Clarified response_format is ignored when format is set. Added markdown page-read example.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

  • iii-hq/workers#227: Introduces and refactors the per-model system prompt strings (PROMPT_DEFAULT, PROMPT_ANTHROPIC, PROMPT_GPT, PROMPT_KIMI) that are further updated in this PR to steer model behavior toward page-reading mode.

Suggested reviewers

  • sergiofilhowz

Poem

🐰 A rabbit bounces through the web,
With turndown's help, no HTML step,
From markdown morsels, pages refined,
Bold images base64-aligned,
No floods of raw HTML to find! 🌿✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adds page-reading mode to web::fetch, support for tool-result images in wire messages, and context-safe caps with separate defaults/ceilings.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/web-fetch-page-reading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

skill-check — worker

0 verified, 14 skipped (no docs/).

Layer Result
structure
vale
ai
render

Four for four. Nicely done.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@harness/src/web/schemas.ts`:
- Line 60: Update the payload description string in harness/src/web/schemas.ts
that currently hard-codes "256 KiB" and "5 MiB": remove the literal sizes and
instead refer to the configurable parameters default_response_bytes and
max_response_bytes (and keep the behavior note about truncation and
bytes_truncated:true). Locate the schema/property whose description contains
"Cap on response body bytes..." and revise the text to state that the default
and maximum sizes are configurable via default_response_bytes and
max_response_bytes rather than fixed byte values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31027a60-1077-4140-8adc-e6af85f4df0c

📥 Commits

Reviewing files that changed from the base of the PR and between 444f47e and cc8eacf.

⛔ Files ignored due to path filters (1)
  • harness/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (20)
  • harness/package.json
  • harness/src/provider-anthropic/wire-messages.ts
  • harness/src/turn-orchestrator/prompt/anthropic.ts
  • harness/src/turn-orchestrator/prompt/default.ts
  • harness/src/turn-orchestrator/prompt/gpt.ts
  • harness/src/turn-orchestrator/prompt/kimi.ts
  • harness/src/types/wire.ts
  • harness/src/web/config.ts
  • harness/src/web/convert.ts
  • harness/src/web/fetch.ts
  • harness/src/web/handlers/fetch.ts
  • harness/src/web/schemas.ts
  • harness/src/web/skills/index.md
  • harness/tests/provider-anthropic/wire-messages.test.ts
  • harness/tests/turn-orchestrator/system-prompt.test.ts
  • harness/tests/types/wire.test.ts
  • harness/tests/web/convert.test.ts
  • harness/tests/web/fetch.integration.test.ts
  • harness/tests/web/fetch.test.ts
  • harness/tests/web/handler.test.ts

.optional()
.describe(
'Cap on response body bytes. Larger responses are truncated and bytes_truncated:true is returned.',
'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.',

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid hard-coded byte limits in the payload contract text.

Line 60 hard-codes 256 KiB and 5 MiB, but those values are now configurable via default_response_bytes and max_response_bytes. This can produce incorrect tool guidance when deployments override defaults.

Suggested wording update
-      'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.',
+      'Cap on response body bytes. Defaults to the worker raw-fetch default, or a context-safe page-reading default when `format` is set; pass an explicit value to override (up to the worker max_response_bytes ceiling). Larger responses are truncated and bytes_truncated:true is returned.',
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.',
'Cap on response body bytes. Defaults to the worker raw-fetch default, or a context-safe page-reading default when `format` is set; pass an explicit value to override (up to the worker max_response_bytes ceiling). Larger responses are truncated and bytes_truncated:true is returned.',
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@harness/src/web/schemas.ts` at line 60, Update the payload description string
in harness/src/web/schemas.ts that currently hard-codes "256 KiB" and "5 MiB":
remove the literal sizes and instead refer to the configurable parameters
default_response_bytes and max_response_bytes (and keep the behavior note about
truncation and bytes_truncated:true). Locate the schema/property whose
description contains "Cap on response body bytes..." and revise the text to
state that the default and maximum sizes are configurable via
default_response_bytes and max_response_bytes rather than fixed byte values.

@andersonleal andersonleal merged commit 7962447 into main Jun 8, 2026
13 checks passed
@andersonleal andersonleal deleted the feat/web-fetch-page-reading branch June 8, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants