fix(octo): surface real failure detail in task:failed IM replies#33
Conversation
The task:failed event payload only carried the coarse failure_reason classifier (and the FailTask broadcast path carried neither error nor chat_session_id). The Octo outbound relay only read error/error_message, so it always fell back to the English 'The agent run failed.' — IM users never saw the actual cause and chat tasks failed via the sweeper path weren't routed at all. - task.go: include redacted error + failure_reason on both task:failed publish paths, and chat_session_id on the sweeper-batch path so chat tasks route to the IM relay. - octo/outbound.go: replace errorMessageFromPayload with failureMessageFromPayload — prefer explicit error, fall back to a friendly Chinese description of failure_reason, then a generic default. Unknown reasons downgrade gracefully (no raw enum leak). - Lark benefits too: it already read error, which the FailTask path now populates. Tests cover explicit-error precedence, failure_reason fallback, unknown reason downgrade, and empty/non-map payloads.
mochashanyao
left a comment
There was a problem hiding this comment.
[Octo-Q · automated review]
Verdict: Approve — no blocking findings; notes below (data-flow traced).
PR#33 Review Report — fix(octo): surface real failure detail in task:failed IM replies
Reviewer: Octo-Q (automated review)
Head SHA: 342a65caeefb10f2059d9f7bf3937dcc97f03583
Repo: Mininglamp-OSS/Octo-Q
Files: outbound.go (+58/-13), outbound_test.go (+67/-0), task.go (+29/-7)
1. Verification Summary
| Item | Status | Evidence |
|---|---|---|
FailTask path enriches payload with error + failure_reason |
✅ | task.go:2025-2031 — reads task.FailureReason and task.Error from the FailAgentTask RETURNING row; both fields are persisted at task.go:1381-1387 |
HandleFailedTasks path adds chat_session_id + error |
✅ | task.go:1775-1797 — conditionally adds chat_session_id when t.ChatSessionID.Valid, adds error (redacted) when t.Error.Valid |
failureMessageFromPayload precedence correct |
✅ | outbound.go:228-248 — error → error_message → failure_reason map → default; always returns non-empty string |
failureReasonText map covers all canonical reasons |
✅ | 21/21 keys match server/pkg/taskfailure/failure.go constants exactly; no drift |
redact.Text safe for error strings |
✅ | server/pkg/redact/redact.go:82-95 — regex-based secret masking, never returns empty for non-empty input |
| Tests cover key scenarios | ✅ | outbound_test.go — explicit error priority, failure_reason fallback, unknown reason downgrade, empty/non-map/nil payload, empty error fall-through |
sendReply never drops the message |
✅ | outbound.go:148 drops empty content, but "⚠️ " + failureMessageFromPayload(...) is always non-empty since defaultFailureMessage is non-empty |
2. Findings
No P0 or P1 findings.
All changes are strictly additive improvements. No working path becomes unavailable, no user-visible wrong data is produced, and no behavior is made worse.
P2 Observations (non-blocking)
P2-1: Lark relay has no failure_reason fallback (pre-existing)
- diff-scope: pre-existing — Lark's
errorMessageFromPayload(lark/outbound.go:480-491) only readserror/error_message, notfailure_reason. This PR does not touch the Lark relay. - Impact: When a sweeper-path task has
FailureReasonset butErrorNULL, Lark users see an empty error card. The PR partially mitigates this by addingerrorto the payload whent.Error.Valid, but the gap remains for NULL-error rows. - Suggestion: Consider adding a
failure_reasonfallback to Lark'serrorMessageFromPayloadin a follow-up, similar to the Octo relay's 3-level precedence.
P2-2: local_directory_error not in failureReasonText map
- diff-scope: pre-existing — the daemon produces this reason (
daemon.go:2384,2395,2463) but it's not in the canonicaltaskfailure.Reasonenum or the display map. - Impact: Falls through to
defaultFailureMessagegracefully. No user-visible breakage, but the user gets a generic message instead of a specific one about local directory issues. - Suggestion: Add
"local_directory_error"to bothtaskfailure/failure.goandfailureReasonTextin a follow-up.
P2-3: Two independent event-publishing paths for EventTaskFailed
- diff-scope: pre-existing architecture —
broadcastTaskEvent(FailTask path,task.go:2007-2039) andHandleFailedTasks(sweeper path,task.go:1775-1797) construct similar payloads independently. - Impact: Maintenance risk for future drift. The PR improves parity between the two paths but doesn't consolidate them.
- Suggestion: Consider extracting a shared
failedTaskPayload(task)helper in a follow-up.
P2-4: Neither path sets top-level events.Event.TaskID / ChatSessionID
- diff-scope: pre-existing — both paths leave the optional routing hints as empty strings. Downstream
taskAndSessionFromEventfalls back to reading from the payload map, so functionally correct. - Impact: Realtime fanout routing scope hints are always empty for
task:failedevents. No user-visible impact today.
3. Data-Flow Tracing
FailTask path (daemon → IM)
daemon.go:2318 → client.FailTask(errMsg, failureReason)
→ handler/daemon.go:2002 → TaskService.FailTask()
→ FailAgentTask SQL: UPDATE ... RETURNING * (persists Error + FailureReason)
→ broadcastTaskEvent() [task.go:2007]
→ payload["failure_reason"] = task.FailureReason.String ✅
→ payload["error"] = redact.Text(task.Error.String) ✅
→ payload["chat_session_id"] = ... ✅
→ Bus.Publish(EventTaskFailed)
→ Octo outbound: failureMessageFromPayload reads error → failure_reason → default ✅
→ Lark outbound: errorMessageFromPayload reads error ✅
HandleFailedTasks path (sweeper → IM)
sweeper → HandleFailedTasks(tasks) [task.go:1708]
→ failureReason = t.FailureReason.String (or "agent_error" default)
→ payload["failure_reason"] = failureReason ✅
→ payload["chat_session_id"] = ... (if t.ChatSessionID.Valid) ✅ NEW — fixes chat task routing
→ payload["error"] = redact.Text(t.Error.String) (if Valid) ✅ NEW
→ Bus.Publish(EventTaskFailed)
→ Octo/Lark outbound: same consumption as above ✅
Octo outbound consumption
failureMessageFromPayload(payload):
1. payload.(map[string]any) fails → defaultFailureMessage ✅
2. m["error"].(string) non-empty → return it ✅
3. m["error_message"].(string) non-empty → return it ✅
4. m["failure_reason"] lookup in failureReasonText → return mapped text ✅
5. unknown reason → defaultFailureMessage (no enum leak) ✅
6. no fields → defaultFailureMessage ✅
4. R5 Blindspot Checklist
- C1 — Dual-path parity: Both
broadcastTaskEventandHandleFailedTasksnow enrich payloads witherror,failure_reason, andchat_session_id. The two paths produce consistent payload shapes. ✅ Clear. - C2 — Control-flow ordering / nesting:
failureMessageFromPayloadis a pure function with clear 3-level precedence.redact.Textis called before payload insertion (correct ordering — redact at source, not at display). No nested/double-application risk. ✅ Clear. - C3 — Authorization boundary: No new endpoints, tools, or capabilities exposed. Purely internal event enrichment and IM relay message improvement. ✅ N/A.
5. Verdict
[Octo-Q] verdict: APPROVE
No P0/P1 findings. The PR is a clean, well-tested improvement that:
- Fixes a real bug (chat task failures invisible to IM users due to missing
chat_session_id) - Surfaces actionable failure detail instead of a generic English fallback
- Degrades gracefully for unknown failure reasons (no enum leak)
- Has comprehensive test coverage for the new relay logic
The four P2 observations are all pre-existing and not amplified by this PR. They are suitable for follow-up work.
Jerry-Xin
left a comment
There was a problem hiding this comment.
This PR is in scope and correctly improves task failure relay detail for Multica chat integrations.
💬 Non-blocking
🟡 Warning — server/internal/integrations/octo/outbound.go: one Go comment includes a non-English phrase. Repository conventions require code comments to be English-only, so this should be reworded.
🟡 Warning — server/internal/integrations/octo/outbound.go: the new user-facing localized fallback strings appear to use English product terms where the repository voice guide prefers localized terminology for some concepts. This is not a functional blocker, but worth aligning before or after merge for consistency.
✅ Highlights
The failure payload now carries redacted error detail and failure_reason on both failure publication paths, including the sweeper path’s chat_session_id needed by IM relays.
The Octo fallback logic handles explicit errors, known classifier values, unknown future values, nil payloads, and non-map payloads without leaking raw enum drift.
I verified the focused backend tests with:
cd server && go test ./internal/integrations/octo ./internal/serviceBoth packages passed.
yujiawei
left a comment
There was a problem hiding this comment.
Code Review — PR #33 (multica)
Reviewed at head 342a65c. Build, go vet, and the octo + service package tests all pass locally (Go 1.26). The fix is correct, well-scoped, and the routing mechanism it depends on checks out. Verdict: APPROVE — the findings below are all non-blocking (P2 / nits / follow-ups); none require a change before merge.
What the PR does, and why it's correct
The root cause is accurately diagnosed: the task:failed event payload carried only the coarse failure_reason, and the FailTask broadcast path carried neither error nor chat_session_id, so the Octo relay always fell back to the English generic and sweeper-failed chat tasks were never routed at all.
I verified the routing mechanism end-to-end. taskAndSessionFromEvent (octo/outbound.go) reads e.ChatSessionID (struct field) first and falls back to payload["chat_session_id"]. Both publish sites in task.go set only the payload map, so this payload-map fallback is exactly what makes sweeper chat-task routing work — confirmed reachable: Subscribe(EventTaskFailed) → handleEvent → processEvent → sendReply(failureMessageFromPayload(...)).
Other checks that passed:
- Enum sync: all 21 keys in
failureReasonTextexactly match the canonicaltaskfailure.Reasonconstants. No typos, no dead keys. - Unknown-reason downgrade: unmapped reasons fall through to the generic default rather than leaking a raw enum value — matches the repo's documented enum-drift convention.
- No leakage onto success events:
broadcastTaskEventguards the newerror/failure_reasonwith.Validchecks, and the octo relay only consumeschat:done/task:failed, so completed/cancelled events never surface these fields. util.UUIDToStringsafely returns""for invalid UUIDs (no panic, no zero-UUID); the unconditionalissue_idserialization is pre-existing and unchanged.
Findings (all non-blocking)
P2 — error precedence makes the localized copy unreachable for sweeper/platform failures
failureMessageFromPayload prefers error over failure_reason. But all four sweeper SQL queries write a non-empty English error alongside failure_reason ('task timed out', 'runtime went offline', 'task expired in queue', 'daemon restarted while task was in flight'), and HandleFailedTasks always forwards that error. So for queued_expired / runtime_offline / runtime_recovery / timeout the friendly Chinese copy is effectively unreachable — IM users see the redacted English sentinel instead. The same holds for the daemon-derived platform reasons via broadcastTaskEvent. This is a deliberate tradeoff (showing the real error does satisfy the PR goal), but the localized strings for those reasons are largely dead in production. Consider treating those canned English sentinels as "no real detail" and preferring failure_reason for them, or consciously accept English copy for those paths.
P2 — Error text egress relies on a denylist redactor with no length cap
task.Error originates on the daemon host, but redact.Text only masks the server process's home dir/username and a fixed ~11-pattern secret denylist. It does not mask daemon-side filesystem paths/usernames, non-vendor secrets (e.g. AIza…, https://user:pass@host), stack traces, or internal hostnames/IPs, and applies no length bound. Because error wins precedence, whatever slips through is relayed verbatim.
Important context that keeps this non-blocking: this is not a new external sink. The pre-existing chat:done success path already relays redact.Text(<full daemon agent output>) to the same Octo/Lark channel through the same redactor (untouched by this PR). The bound channel is the user's own chat session, and redact.Text is the repo's established redact-at-egress convention. This PR adds the failure subset of the same data class through the same guard — consistent with what already ships. Worth a follow-up to (a) prefer the classifier copy and treat raw error as opt-in/bounded on the IM path, and (b) strengthen redact.Text for external egress — but it does not gate this PR.
P2 — The PR's headline routing fix has no regression test
Every new/existing test sets ChatSessionID on the event struct field, which short-circuits routing before the payload-map fallback. The payload-only chat_session_id path — the exact code this PR relies on to route sweeper-failed chat tasks — has zero coverage. A future refactor of the publish sites could silently break IM failure replies with all tests still green. Add one case: Payload: {task_id, chat_session_id, failure_reason} with the struct fields left empty, asserting s.sent == 1.
P2 — Map keys are string literals, decoupled from the enum, with no parity test
failureReasonText is keyed by raw literals and the package doesn't import taskfailure; the only guard is a "keep in sync" comment. All 21 currently match, but a future enum addition silently falls through to default with no CI signal. A parity test iterating taskfailure.AllReasons() and asserting a map entry exists would close this.
Nit — Lark only benefits where error is set
Lark's errorMessageFromPayload still reads only error/error_message, so a failure_reason-only payload renders the generic Run failed. card while Octo localizes it. In practice this is latent (no in-tree call site produces a failure_reason-only payload — sweepers and FailTask both populate error), and it's not a regression. The PR body's "Lark benefits too" is accurate for the paths that set error. Consider lifting failureReasonText into a shared package both relays consume, as a follow-up.
Nit — task.Error stored raw, redacted per-consumer
The DB column holds the raw value and each of the (now six) consumers must remember redact.Text. This PR's two new sites do so correctly and don't double-redact. Pre-existing structural property, not introduced here; redacting at the FailAgentTask write boundary would be defense-in-depth for a separate ticket.
Coverage note
Inspected: the 3 changed files, taskfailure/failure.go (full enum), redact.Text + its pattern list, the sweeper SQL (agent.sql/runtime.sql failure queries), the daemon FailTask flow, the chat:done success-path egress (for the trust-boundary comparison), the octo event-routing path, and the Lark relay. Built + vetted + ran octo/service tests (green). Not separately exercised: live Octo/Lark delivery, and the daemon-side production error strings (assessed by data-flow, not runtime capture).
问题
线上关联 Octo bot 后,给 bot 发消息总是收到无信息量的英文提示 "The agent run failed.",无法判断 agent run 为何失败。
根因
task:failed事件的 payload 只携带了粗粒度的failure_reason分类器:broadcastTaskEvent(daemon 通过FailTask主动上报的路径)既没有error也没有chat_session_idHandleFailedTasks(sweeper 批处理路径)有failure_reason但缺error和chat_session_id而 Octo 的 outbound relay (
errorMessageFromPayload) 只读error/error_message两个字段 → 永远取不到 → 永远回退到英文兜底文案。同时 chat task 经 sweeper 失败时因缺chat_session_id根本无法路由到 IM relay。真实错误其实写进了
chat_message表,只是从未进入事件 payload,所以传不到 IM。改动
server/internal/service/task.go— 两条EventTaskFailed发布路径都补全:error(经redact.Text脱敏的真实错误)+failure_reasonchat_session_id,使 chat task 失败也能路由到 Octo/Larkserver/internal/integrations/octo/outbound.go—errorMessageFromPayload→failureMessageFromPayload,三级优先级:error/error_message真实详情failure_reason翻译为中文友好提示(如provider_auth_or_access→ "模型服务认证失败,请检查 Agent runtime 的 API Key 配置")outbound_test.go— 新增测试:显式 error 优先、failure_reason 兜底、未知 reason 降级、空/非 map payload。附带价值
Lark relay 同样受益——它本来就读
error,而FailTask路径之前从不填充该字段,现在 Lark 也能拿到真实错误。验证
go build ./...✅go vet(octo + service)✅octo/service/cmd/server/lark✅