feat(harness): web::fetch page-reading mode, tool-result images, context-safe caps#233
Conversation
…ext-safe caps
Add a `format` param ("markdown" | "text" | "html") to web::fetch for
reading web pages rather than calling APIs. HTML is converted to Markdown
or plain text (turndown/htmlparser2); requests go out with a browser UA +
format-matched Accept/Accept-Language and retry once with the honest
configured UA on a Cloudflare challenge. Image responses come back as a
viewable image block ({content, details} envelope) routed through the
Anthropic provider wire, with text-only providers falling back to a text
line. Bodies above max_transform_bytes skip the synchronous transform to
protect the worker event loop.
Split the byte and timeout caps into default-vs-ceiling. Raw fetches keep
defaulting to the 5 MiB ceiling (resolveMaxBytes), preserving the historical
contract so existing API/download callers are not silently truncated; only
page-reading mode defaults to the context-safe 256 KiB, since a transformed
1 MiB+ SPA page would otherwise blow the turn's context window. Timeout
gains a default_timeout_ms separate from the raised 120s ceiling.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR adds page-reading mode to ChangesWeb Fetch Page-Reading Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
skill-check — worker0 verified, 14 skipped (no docs/).
Four for four. Nicely done. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@harness/src/web/schemas.ts`:
- Line 60: Update the payload description string in harness/src/web/schemas.ts
that currently hard-codes "256 KiB" and "5 MiB": remove the literal sizes and
instead refer to the configurable parameters default_response_bytes and
max_response_bytes (and keep the behavior note about truncation and
bytes_truncated:true). Locate the schema/property whose description contains
"Cap on response body bytes..." and revise the text to state that the default
and maximum sizes are configurable via default_response_bytes and
max_response_bytes rather than fixed byte values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31027a60-1077-4140-8adc-e6af85f4df0c
⛔ Files ignored due to path filters (1)
harness/pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (20)
harness/package.jsonharness/src/provider-anthropic/wire-messages.tsharness/src/turn-orchestrator/prompt/anthropic.tsharness/src/turn-orchestrator/prompt/default.tsharness/src/turn-orchestrator/prompt/gpt.tsharness/src/turn-orchestrator/prompt/kimi.tsharness/src/types/wire.tsharness/src/web/config.tsharness/src/web/convert.tsharness/src/web/fetch.tsharness/src/web/handlers/fetch.tsharness/src/web/schemas.tsharness/src/web/skills/index.mdharness/tests/provider-anthropic/wire-messages.test.tsharness/tests/turn-orchestrator/system-prompt.test.tsharness/tests/types/wire.test.tsharness/tests/web/convert.test.tsharness/tests/web/fetch.integration.test.tsharness/tests/web/fetch.test.tsharness/tests/web/handler.test.ts
| .optional() | ||
| .describe( | ||
| 'Cap on response body bytes. Larger responses are truncated and bytes_truncated:true is returned.', | ||
| 'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.', |
There was a problem hiding this comment.
Avoid hard-coded byte limits in the payload contract text.
Line 60 hard-codes 256 KiB and 5 MiB, but those values are now configurable via default_response_bytes and max_response_bytes. This can produce incorrect tool guidance when deployments override defaults.
Suggested wording update
- 'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.',
+ 'Cap on response body bytes. Defaults to the worker raw-fetch default, or a context-safe page-reading default when `format` is set; pass an explicit value to override (up to the worker max_response_bytes ceiling). Larger responses are truncated and bytes_truncated:true is returned.',📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 'Cap on response body bytes. Defaults to the worker ceiling (5 MiB) for raw fetches, or a context-safe 256 KiB in page-reading mode (`format` set); pass an explicit value to override (up to the 5 MiB ceiling). Larger responses are truncated and bytes_truncated:true is returned.', | |
| 'Cap on response body bytes. Defaults to the worker raw-fetch default, or a context-safe page-reading default when `format` is set; pass an explicit value to override (up to the worker max_response_bytes ceiling). Larger responses are truncated and bytes_truncated:true is returned.', |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@harness/src/web/schemas.ts` at line 60, Update the payload description string
in harness/src/web/schemas.ts that currently hard-codes "256 KiB" and "5 MiB":
remove the literal sizes and instead refer to the configurable parameters
default_response_bytes and max_response_bytes (and keep the behavior note about
truncation and bytes_truncated:true). Locate the schema/property whose
description contains "Cap on response body bytes..." and revise the text to
state that the default and maximum sizes are configurable via
default_response_bytes and max_response_bytes rather than fixed byte values.
Summary
Adds a page-reading mode to
web::fetchso agents can read web pages (not just call APIs), returns viewable images for image URLs, and reworks the response/timeout caps into a default-vs-ceiling split that is context-safe without silently truncating raw fetches.Page-reading mode (
format)formatparam:"markdown"(HTML→Markdown, best for reading),"text"(HTML→plain text),"html"(raw).Accept/Accept-Language, and retries the hop once with the honest configured UA on a Cloudflare challenge (403+cf-mitigated: challenge) — only for idempotent GET/HEAD.turndown/htmlparser2) live in the newweb/convert.ts. Transform failures (e.g. hostile deep nesting that overflows turndown) fall back to the raw body; bodies abovemax_transform_bytesskip the synchronous transform entirely to protect the worker event loop.Tool-result images
{content, details}envelope with animageblock (allowlisted mime, 2xx, complete/non-truncated, non-empty), routed through the Anthropic provider wire so the model can view them; text-only providers fall back to the text line. Anything not safely viewable falls through to the normal base64 envelope.Context-safe caps (the fix from the adversarial review)
resolveMaxBytes: raw fetches keep defaulting to the 5 MiB ceiling (preserving the historical contract — existing API/download callers are not silently truncated), while page-reading mode defaults to a context-safe 256 KiB (a transformed 1 MiB+ SPA page would otherwise blow the turn's context window).resolveTimeout: newdefault_timeout_ms(30s) separate from a raisedmax_timeout_msceiling (120s).max_bytes, since agents rarely re-checkbytes_truncated.Test plan
harnesstypecheck clean (tsc -b --noEmit)harnessfull suite: 1282 tests pass / 121 filesbiome checkclean on changed filesresolveMaxBytes(raw→ceiling regression guard, page-mode→256 KiB, explicit override, clamping), HTML→markdown/text transforms, Cloudflare retry, image viewability gating, transform boundsformat: "markdown") and an image URLSummary by CodeRabbit
Release Notes
New Features
web::fetchwith configurableformatparameter (markdown,text,html) to intelligently transform web content and prevent context flooding.Bug Fixes
Documentation
web::fetchskill documentation with page-reading mode, format options, and image return behavior.