Skip to content

feat: Error recovery classification and standard error vocabulary#1083

Open
KonstantinMirin wants to merge 6 commits intoprebid:mainfrom
KonstantinMirin:KonstantinMirin/v3.6-error-resilience
Open

feat: Error recovery classification and standard error vocabulary#1083
KonstantinMirin wants to merge 6 commits intoprebid:mainfrom
KonstantinMirin:KonstantinMirin/v3.6-error-resilience

Conversation

@KonstantinMirin
Copy link
Collaborator

Summary

Adds RecoveryHint classification (transient / correctable / terminal) to the
AdCPError hierarchy and propagates it through all three transport boundaries (MCP, A2A, REST).
Buyer agents can now inspect the recovery field in error responses to decide whether to
retry, fix their request, or abandon.

  • Add RecoveryHint type and recovery field to AdCPError base class
  • Add 4 new exception subclasses: AdCPConflictError (409), AdCPGoneError (410),
    AdCPBudgetExhaustedError (422), AdCPServiceUnavailableError (503)
  • Align error codes with the AdCP specification (RATE_LIMITEDRATE_LIMIT_EXCEEDED)
  • Propagate recovery through MCP (ToolError args), A2A (data field), and REST (JSONResponse)
  • Audit all ~87 raise sites across src/ — 13 overrides where the subclass default was wrong
  • Add vocabulary consistency guard test to prevent future drift

Motivation

Closes #1072. Partial progress on #1075 (idempotency key schema prep deferred — blocked on
adcp Python library shipping the field).

Without recovery hints, buyer agents have no machine-readable signal for whether an error is
worth retrying. A transient adapter failure and a permanent validation error both look the same.
This PR gives every error a classification so agents can self-heal.

Changes

Exception hierarchy (src/core/exceptions.py)

Subclass Status Code Recovery
AdCPError (base) 500 INTERNAL_ERROR terminal
AdCPValidationError 400 VALIDATION_ERROR correctable
AdCPAuthenticationError 401 AUTHENTICATION_ERROR terminal
AdCPAuthorizationError 403 AUTHORIZATION_ERROR terminal
AdCPNotFoundError 404 NOT_FOUND terminal
AdCPConflictError 409 CONFLICT correctable
AdCPGoneError 410 GONE terminal
AdCPBudgetExhaustedError 422 BUDGET_EXHAUSTED terminal
AdCPRateLimitError 429 RATE_LIMIT_EXCEEDED transient
AdCPAdapterError 502 ADAPTER_ERROR transient
AdCPServiceUnavailableError 503 SERVICE_UNAVAILABLE transient

Transport propagation

  • MCP: ToolError(code, message, recovery) — recovery carried as 3rd positional arg
  • A2A: InvalidParamsError(message=..., data={"recovery": "..."}) — recovery in data field
  • REST: {"error_code": "...", "message": "...", "recovery": "...", "details": ...}

Raise site audit (13 overrides)

Where the subclass default was incorrect for the specific error context:

  • AdCPValidationErrorterminal (server data integrity, not user-fixable)
  • AdCPValidationErrortransient (pending admin setup / external service)
  • AdCPNotFoundErrorcorrectable (user can fix reference and retry)
  • AdCPAdapterErrorterminal (internal DB failure, not external adapter)
  • AdCPAuthorizationErrorcorrectable (buyer can add missing brand param)
  • AdCPValidationErrorterminal (feature gap, not user error)

Test plan

  • 63 new test functions across unit and integration suites
  • All 11 subclasses tested for recovery in MCP, A2A, and REST serialization
  • Custom recovery= overrides tested through all 3 boundaries
  • Roundtrip tests: raise → catch at boundary → serialize → verify recovery
  • Vocabulary consistency guard prevents error code drift from spec
  • make quality: 3294 passed, 0 failed
  • ./run_all_tests.sh: all 5 suites pass (unit, integration, integration_v2, e2e, ui)

Related

… AdCPError hierarchy

Implements GH prebid#1072 — each AdCPError subclass now carries a recovery hint
that helps buyer agents decide whether to retry, fix, or abandon a request.

Defaults: ValidationError=correctable, RateLimitError=transient,
AdapterError=transient, all others=terminal. Per-instance override via
recovery= kwarg. to_dict() includes recovery in serialized output.
Rename RATE_LIMIT_EXCEEDED to RATE_LIMITED per AdCP spec. Add four new
exception subclasses: AdCPConflictError (409), AdCPGoneError (410),
AdCPBudgetExhaustedError (422), AdCPServiceUnavailableError (503).
Update A2A error mapping and transport boundary translation for all
new types. All error codes now use consistent UPPER_SNAKE_CASE.
…undaries

MCP: ToolError now carries recovery as args[2], extract_error_info returns
3-tuple (code, message, recovery).

A2A: _adcp_to_a2a_error forwards recovery in the JSON-RPC data field.

REST: adcp_error_handler already used to_dict() which includes recovery;
_handle_tool_error in api_v1.py now also extracts and forwards recovery.

Tests: 31 assertions across all 3 transports verify recovery propagation.
…sses

Cover recovery propagation through all 3 transport boundaries (MCP, A2A, REST):
- to_dict() includes recovery for all 11 subclasses
- Custom recovery= override preserved through MCP, A2A, and REST boundaries
- Roundtrip tests: raise -> boundary catch -> serialize -> deserialize -> verify
- Parametrized tests for all subclasses in both MCP and A2A transports
- Integration tests for recovery in A2A ServerError.data and REST JSON body

beads: salesagent-ezem
Review all ~87 raise AdCPError sites and add explicit recovery=
overrides where the subclass default is wrong:

- AdCPValidationError (default correctable) overridden to terminal
  for server data integrity errors (missing pricing_options, missing
  rate on fixed pricing, internal bugs, unimplemented features,
  missing identity)
- AdCPValidationError overridden to transient for setup-incomplete
  and property-list-resolver failures (external service dependency)
- AdCPNotFoundError (default terminal) overridden to correctable
  for format-not-found, creative-IDs-not-found, and package-not-found
  (buyer can fix their request and retry)
- AdCPAdapterError (default transient) overridden to terminal for
  workflow context creation failure (internal DB, not external adapter)
- AdCPAuthorizationError (default terminal) overridden to correctable
  for brand-manifest-required (buyer can add brand parameter)

Also updates structural guard allowlist for line shift in products.py.
Cross-referenced our error_code strings against the canonical vocabulary
in adcp-req ERROR_CODE_VOCABULARY.md. Found one mismatch: RATE_LIMITED
should be RATE_LIMIT_EXCEEDED. Added vocabulary consistency guard test
to prevent future divergence from the spec.
@KonstantinMirin KonstantinMirin force-pushed the KonstantinMirin/v3.6-error-resilience branch from 08287b2 to 96bdba5 Compare March 6, 2026 17:34
@KonstantinMirin KonstantinMirin marked this pull request as ready for review March 6, 2026 17:34
@ChrisHuie ChrisHuie self-requested a review March 6, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add Error.recovery hints and standard error codes

1 participant