Skip to content

fix(octo): surface real failure detail in task:failed IM replies#33

Merged
lml2468 merged 1 commit into
mainfrom
fix/octo-task-failed-error-detail
Jun 15, 2026
Merged

fix(octo): surface real failure detail in task:failed IM replies#33
lml2468 merged 1 commit into
mainfrom
fix/octo-task-failed-error-detail

Conversation

@lml2468

@lml2468 lml2468 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

问题

线上关联 Octo bot 后,给 bot 发消息总是收到无信息量的英文提示 "The agent run failed.",无法判断 agent run 为何失败。

根因

task:failed 事件的 payload 只携带了粗粒度的 failure_reason 分类器:

  • broadcastTaskEvent(daemon 通过 FailTask 主动上报的路径)既没有 error 也没有 chat_session_id
  • HandleFailedTasks(sweeper 批处理路径)有 failure_reasonerrorchat_session_id

而 Octo 的 outbound relay (errorMessageFromPayload) 只读 error / error_message 两个字段 → 永远取不到 → 永远回退到英文兜底文案。同时 chat task 经 sweeper 失败时因缺 chat_session_id 根本无法路由到 IM relay

真实错误其实写进了 chat_message 表,只是从未进入事件 payload,所以传不到 IM。

改动

server/internal/service/task.go — 两条 EventTaskFailed 发布路径都补全:

  • 加入 error(经 redact.Text 脱敏的真实错误)+ failure_reason
  • sweeper 路径补 chat_session_id,使 chat task 失败也能路由到 Octo/Lark

server/internal/integrations/octo/outbound.goerrorMessageFromPayloadfailureMessageFromPayload,三级优先级:

  1. 显式 error / error_message 真实详情
  2. failure_reason 翻译为中文友好提示(如 provider_auth_or_access → "模型服务认证失败,请检查 Agent runtime 的 API Key 配置")
  3. 通用中文兜底(替代原英文文案)
  • 未知 reason 优雅降级,不泄露原始枚举值(符合 CLAUDE.md 的 enum-drift-downgrades 约定)

outbound_test.go — 新增测试:显式 error 优先、failure_reason 兜底、未知 reason 降级、空/非 map payload。

附带价值

Lark relay 同样受益——它本来就读 error,而 FailTask 路径之前从不填充该字段,现在 Lark 也能拿到真实错误。

验证

  • go build ./...
  • go vet(octo + service)✅
  • 测试通过:octo / service / cmd/server / lark

注:本 PR 让失败原因可见。线上 agent run 失败的底层根因(大概率是模型 provider 配置/凭据)需在部署后据 IM 提示进一步确认。

The task:failed event payload only carried the coarse failure_reason
classifier (and the FailTask broadcast path carried neither error nor
chat_session_id). The Octo outbound relay only read error/error_message,
so it always fell back to the English 'The agent run failed.' — IM users
never saw the actual cause and chat tasks failed via the sweeper path
weren't routed at all.

- task.go: include redacted error + failure_reason on both task:failed
  publish paths, and chat_session_id on the sweeper-batch path so chat
  tasks route to the IM relay.
- octo/outbound.go: replace errorMessageFromPayload with
  failureMessageFromPayload — prefer explicit error, fall back to a
  friendly Chinese description of failure_reason, then a generic default.
  Unknown reasons downgrade gracefully (no raw enum leak).
- Lark benefits too: it already read error, which the FailTask path now
  populates.

Tests cover explicit-error precedence, failure_reason fallback, unknown
reason downgrade, and empty/non-map payloads.

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q · automated review]

Verdict: Approve — no blocking findings; notes below (data-flow traced).


PR#33 Review Report — fix(octo): surface real failure detail in task:failed IM replies

Reviewer: Octo-Q (automated review)
Head SHA: 342a65caeefb10f2059d9f7bf3937dcc97f03583
Repo: Mininglamp-OSS/Octo-Q
Files: outbound.go (+58/-13), outbound_test.go (+67/-0), task.go (+29/-7)


1. Verification Summary

Item Status Evidence
FailTask path enriches payload with error + failure_reason task.go:2025-2031 — reads task.FailureReason and task.Error from the FailAgentTask RETURNING row; both fields are persisted at task.go:1381-1387
HandleFailedTasks path adds chat_session_id + error task.go:1775-1797 — conditionally adds chat_session_id when t.ChatSessionID.Valid, adds error (redacted) when t.Error.Valid
failureMessageFromPayload precedence correct outbound.go:228-248 — error → error_message → failure_reason map → default; always returns non-empty string
failureReasonText map covers all canonical reasons 21/21 keys match server/pkg/taskfailure/failure.go constants exactly; no drift
redact.Text safe for error strings server/pkg/redact/redact.go:82-95 — regex-based secret masking, never returns empty for non-empty input
Tests cover key scenarios outbound_test.go — explicit error priority, failure_reason fallback, unknown reason downgrade, empty/non-map/nil payload, empty error fall-through
sendReply never drops the message outbound.go:148 drops empty content, but "⚠️ " + failureMessageFromPayload(...) is always non-empty since defaultFailureMessage is non-empty

2. Findings

No P0 or P1 findings.

All changes are strictly additive improvements. No working path becomes unavailable, no user-visible wrong data is produced, and no behavior is made worse.

P2 Observations (non-blocking)

P2-1: Lark relay has no failure_reason fallback (pre-existing)

  • diff-scope: pre-existing — Lark's errorMessageFromPayload (lark/outbound.go:480-491) only reads error/error_message, not failure_reason. This PR does not touch the Lark relay.
  • Impact: When a sweeper-path task has FailureReason set but Error NULL, Lark users see an empty error card. The PR partially mitigates this by adding error to the payload when t.Error.Valid, but the gap remains for NULL-error rows.
  • Suggestion: Consider adding a failure_reason fallback to Lark's errorMessageFromPayload in a follow-up, similar to the Octo relay's 3-level precedence.

P2-2: local_directory_error not in failureReasonText map

  • diff-scope: pre-existing — the daemon produces this reason (daemon.go:2384,2395,2463) but it's not in the canonical taskfailure.Reason enum or the display map.
  • Impact: Falls through to defaultFailureMessage gracefully. No user-visible breakage, but the user gets a generic message instead of a specific one about local directory issues.
  • Suggestion: Add "local_directory_error" to both taskfailure/failure.go and failureReasonText in a follow-up.

P2-3: Two independent event-publishing paths for EventTaskFailed

  • diff-scope: pre-existing architecture — broadcastTaskEvent (FailTask path, task.go:2007-2039) and HandleFailedTasks (sweeper path, task.go:1775-1797) construct similar payloads independently.
  • Impact: Maintenance risk for future drift. The PR improves parity between the two paths but doesn't consolidate them.
  • Suggestion: Consider extracting a shared failedTaskPayload(task) helper in a follow-up.

P2-4: Neither path sets top-level events.Event.TaskID / ChatSessionID

  • diff-scope: pre-existing — both paths leave the optional routing hints as empty strings. Downstream taskAndSessionFromEvent falls back to reading from the payload map, so functionally correct.
  • Impact: Realtime fanout routing scope hints are always empty for task:failed events. No user-visible impact today.

3. Data-Flow Tracing

FailTask path (daemon → IM)

daemon.go:2318 → client.FailTask(errMsg, failureReason)
  → handler/daemon.go:2002 → TaskService.FailTask()
    → FailAgentTask SQL: UPDATE ... RETURNING * (persists Error + FailureReason)
    → broadcastTaskEvent() [task.go:2007]
      → payload["failure_reason"] = task.FailureReason.String  ✅
      → payload["error"] = redact.Text(task.Error.String)      ✅
      → payload["chat_session_id"] = ...                       ✅
  → Bus.Publish(EventTaskFailed)
  → Octo outbound: failureMessageFromPayload reads error → failure_reason → default  ✅
  → Lark outbound: errorMessageFromPayload reads error  ✅

HandleFailedTasks path (sweeper → IM)

sweeper → HandleFailedTasks(tasks) [task.go:1708]
  → failureReason = t.FailureReason.String (or "agent_error" default)
  → payload["failure_reason"] = failureReason                  ✅
  → payload["chat_session_id"] = ... (if t.ChatSessionID.Valid) ✅ NEW — fixes chat task routing
  → payload["error"] = redact.Text(t.Error.String) (if Valid)  ✅ NEW
  → Bus.Publish(EventTaskFailed)
  → Octo/Lark outbound: same consumption as above  ✅

Octo outbound consumption

failureMessageFromPayload(payload):
  1. payload.(map[string]any) fails → defaultFailureMessage  ✅
  2. m["error"].(string) non-empty → return it               ✅
  3. m["error_message"].(string) non-empty → return it       ✅
  4. m["failure_reason"] lookup in failureReasonText → return mapped text  ✅
  5. unknown reason → defaultFailureMessage (no enum leak)   ✅
  6. no fields → defaultFailureMessage                       ✅

4. R5 Blindspot Checklist

  • C1 — Dual-path parity: Both broadcastTaskEvent and HandleFailedTasks now enrich payloads with error, failure_reason, and chat_session_id. The two paths produce consistent payload shapes. ✅ Clear.
  • C2 — Control-flow ordering / nesting: failureMessageFromPayload is a pure function with clear 3-level precedence. redact.Text is called before payload insertion (correct ordering — redact at source, not at display). No nested/double-application risk. ✅ Clear.
  • C3 — Authorization boundary: No new endpoints, tools, or capabilities exposed. Purely internal event enrichment and IM relay message improvement. ✅ N/A.

5. Verdict

[Octo-Q] verdict: APPROVE

No P0/P1 findings. The PR is a clean, well-tested improvement that:

  1. Fixes a real bug (chat task failures invisible to IM users due to missing chat_session_id)
  2. Surfaces actionable failure detail instead of a generic English fallback
  3. Degrades gracefully for unknown failure reasons (no enum leak)
  4. Has comprehensive test coverage for the new relay logic

The four P2 observations are all pre-existing and not amplified by this PR. They are suitable for follow-up work.

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is in scope and correctly improves task failure relay detail for Multica chat integrations.

💬 Non-blocking

🟡 Warning — server/internal/integrations/octo/outbound.go: one Go comment includes a non-English phrase. Repository conventions require code comments to be English-only, so this should be reworded.

🟡 Warning — server/internal/integrations/octo/outbound.go: the new user-facing localized fallback strings appear to use English product terms where the repository voice guide prefers localized terminology for some concepts. This is not a functional blocker, but worth aligning before or after merge for consistency.

✅ Highlights

The failure payload now carries redacted error detail and failure_reason on both failure publication paths, including the sweeper path’s chat_session_id needed by IM relays.

The Octo fallback logic handles explicit errors, known classifier values, unknown future values, nil payloads, and non-map payloads without leaking raw enum drift.

I verified the focused backend tests with:

cd server && go test ./internal/integrations/octo ./internal/service

Both packages passed.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #33 (multica)

Reviewed at head 342a65c. Build, go vet, and the octo + service package tests all pass locally (Go 1.26). The fix is correct, well-scoped, and the routing mechanism it depends on checks out. Verdict: APPROVE — the findings below are all non-blocking (P2 / nits / follow-ups); none require a change before merge.

What the PR does, and why it's correct

The root cause is accurately diagnosed: the task:failed event payload carried only the coarse failure_reason, and the FailTask broadcast path carried neither error nor chat_session_id, so the Octo relay always fell back to the English generic and sweeper-failed chat tasks were never routed at all.

I verified the routing mechanism end-to-end. taskAndSessionFromEvent (octo/outbound.go) reads e.ChatSessionID (struct field) first and falls back to payload["chat_session_id"]. Both publish sites in task.go set only the payload map, so this payload-map fallback is exactly what makes sweeper chat-task routing work — confirmed reachable: Subscribe(EventTaskFailed)handleEventprocessEventsendReply(failureMessageFromPayload(...)).

Other checks that passed:

  • Enum sync: all 21 keys in failureReasonText exactly match the canonical taskfailure.Reason constants. No typos, no dead keys.
  • Unknown-reason downgrade: unmapped reasons fall through to the generic default rather than leaking a raw enum value — matches the repo's documented enum-drift convention.
  • No leakage onto success events: broadcastTaskEvent guards the new error/failure_reason with .Valid checks, and the octo relay only consumes chat:done/task:failed, so completed/cancelled events never surface these fields.
  • util.UUIDToString safely returns "" for invalid UUIDs (no panic, no zero-UUID); the unconditional issue_id serialization is pre-existing and unchanged.

Findings (all non-blocking)

P2 — error precedence makes the localized copy unreachable for sweeper/platform failures

failureMessageFromPayload prefers error over failure_reason. But all four sweeper SQL queries write a non-empty English error alongside failure_reason ('task timed out', 'runtime went offline', 'task expired in queue', 'daemon restarted while task was in flight'), and HandleFailedTasks always forwards that error. So for queued_expired / runtime_offline / runtime_recovery / timeout the friendly Chinese copy is effectively unreachable — IM users see the redacted English sentinel instead. The same holds for the daemon-derived platform reasons via broadcastTaskEvent. This is a deliberate tradeoff (showing the real error does satisfy the PR goal), but the localized strings for those reasons are largely dead in production. Consider treating those canned English sentinels as "no real detail" and preferring failure_reason for them, or consciously accept English copy for those paths.

P2 — Error text egress relies on a denylist redactor with no length cap

task.Error originates on the daemon host, but redact.Text only masks the server process's home dir/username and a fixed ~11-pattern secret denylist. It does not mask daemon-side filesystem paths/usernames, non-vendor secrets (e.g. AIza…, https://user:pass@host), stack traces, or internal hostnames/IPs, and applies no length bound. Because error wins precedence, whatever slips through is relayed verbatim.

Important context that keeps this non-blocking: this is not a new external sink. The pre-existing chat:done success path already relays redact.Text(<full daemon agent output>) to the same Octo/Lark channel through the same redactor (untouched by this PR). The bound channel is the user's own chat session, and redact.Text is the repo's established redact-at-egress convention. This PR adds the failure subset of the same data class through the same guard — consistent with what already ships. Worth a follow-up to (a) prefer the classifier copy and treat raw error as opt-in/bounded on the IM path, and (b) strengthen redact.Text for external egress — but it does not gate this PR.

P2 — The PR's headline routing fix has no regression test

Every new/existing test sets ChatSessionID on the event struct field, which short-circuits routing before the payload-map fallback. The payload-only chat_session_id path — the exact code this PR relies on to route sweeper-failed chat tasks — has zero coverage. A future refactor of the publish sites could silently break IM failure replies with all tests still green. Add one case: Payload: {task_id, chat_session_id, failure_reason} with the struct fields left empty, asserting s.sent == 1.

P2 — Map keys are string literals, decoupled from the enum, with no parity test

failureReasonText is keyed by raw literals and the package doesn't import taskfailure; the only guard is a "keep in sync" comment. All 21 currently match, but a future enum addition silently falls through to default with no CI signal. A parity test iterating taskfailure.AllReasons() and asserting a map entry exists would close this.

Nit — Lark only benefits where error is set

Lark's errorMessageFromPayload still reads only error/error_message, so a failure_reason-only payload renders the generic Run failed. card while Octo localizes it. In practice this is latent (no in-tree call site produces a failure_reason-only payload — sweepers and FailTask both populate error), and it's not a regression. The PR body's "Lark benefits too" is accurate for the paths that set error. Consider lifting failureReasonText into a shared package both relays consume, as a follow-up.

Nit — task.Error stored raw, redacted per-consumer

The DB column holds the raw value and each of the (now six) consumers must remember redact.Text. This PR's two new sites do so correctly and don't double-redact. Pre-existing structural property, not introduced here; redacting at the FailAgentTask write boundary would be defense-in-depth for a separate ticket.

Coverage note

Inspected: the 3 changed files, taskfailure/failure.go (full enum), redact.Text + its pattern list, the sweeper SQL (agent.sql/runtime.sql failure queries), the daemon FailTask flow, the chat:done success-path egress (for the trust-boundary comparison), the octo event-routing path, and the Lark relay. Built + vetted + ran octo/service tests (green). Not separately exercised: live Octo/Lark delivery, and the daemon-side production error strings (assessed by data-flow, not runtime capture).

@lml2468 lml2468 merged commit ac1ba3b into main Jun 15, 2026
5 checks passed
@lml2468 lml2468 deleted the fix/octo-task-failed-error-detail branch June 15, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants