Skip to content

Risk control group enhance#1

Closed
andrenoah307 wants to merge 27 commits into
Calderic:mainfrom
andrenoah307:dev/risk-control-group-scoping-20260426
Closed

Risk control group enhance#1
andrenoah307 wants to merge 27 commits into
Calderic:mainfrom
andrenoah307:dev/risk-control-group-scoping-20260426

Conversation

@andrenoah307

Copy link
Copy Markdown
Contributor

⚠️ 提交说明 / PR Notice

Important

  • 请提供人工撰写的简洁摘要,避免直接粘贴未经整理的 AI 输出。

📝 变更描述 / Description

增强风控中心,增加内容审核
尝试修复异常流扣费问题

🚀 变更类型 / Type of change

  • 🐛 Bug 修复 (Bug fix) - 请关联对应 Issue,避免将设计取舍、理解偏差或预期不一致直接归类为 bug
  • ✨ 新功能 (New feature) - 重大特性建议先通过 Issue 沟通
  • ⚡ 性能优化 / 重构 (Refactor)
  • 📝 文档更新 (Documentation)

🔗 关联任务 / Related Issue

  • Closes # (如有)

✅ 提交前检查项 / Checklist

  • 人工确认: 我已亲自整理并撰写此描述,没有直接粘贴未经处理的 AI 输出。
  • 非重复提交: 我已搜索现有的 IssuesPRs,确认不是重复提交。
  • Bug fix 说明: 若此 PR 标记为 Bug fix,我已提交或关联对应 Issue,且不会将设计取舍、预期不一致或理解偏差直接归类为 bug。
  • 变更理解: 我已理解这些更改的工作原理及可能影响。
  • [ x 范围聚焦: 本 PR 未包含任何与当前任务无关的代码改动。
  • 本地验证: 已在本地运行并通过测试或手动验证,维护者可以据此复核结果。
  • 安全合规: 代码中无敏感凭据,且符合项目代码规范。

📸 运行证明 / Proof of Work

(请在此粘贴截图、关键日志或测试报告,以证明变更生效)

- introduce per-group whitelist (EnabledGroups) plus per-group mode
  override (GroupModes) on RiskControlSetting; default empty so the
  engine has zero impact on upgrade until admins flip groups in
- key every metric/inflight/block/rule-hit redis key and memory map
  by (scope, subjectID, group); same token tracked across groups now
  has independent counters and block state
- snapshot UsingGroup into RelayInfo.RiskGroup at BeforeRelay so the
  finish/start pair always lands on the same bucket even when auto
  cross-group retry rewrites UsingGroup mid-request
- rebuild risk_subject_snapshot unique index to (subject_type,
  subject_id, group) with idempotent cross-db drop+recreate; add
  group columns and indexes to risk_rules and risk_incident
- gate every BeforeRelay/AfterRelay/enqueue call on
  isRiskControlEnabledForGroup so unlisted groups bypass risk control
  entirely; reject auto from EnabledGroups during normalize
- ship GET /api/risk/groups returning schema_version=1 matrix used by
  the new admin "分组启用矩阵" widget; require ?group= on unblock,
  document why non-whitelisted unblocks are still allowed
- enforce "rule must bind groups before enable" in validateRiskRule
  and the rule editor; surface unconfigured + unlisted rule counts on
  the overview cards
- TDD coverage: triple-key isolation, group-aware evaluate, effective
  mode truth table, normalize filter for auto/invalid mode, controller
  unblock contract, sortRiskGroups order
- triggers on dev/** and dev-* branches so admins can validate staging
  builds before merging back to main
- runs go vet and go test on the focused risk control packages so the
  TDD suite stays green
- builds and pushes a linux/amd64 image to ghcr.io with two tags:
    dev-<branch-slug>            (floating)
    dev-<branch-slug>-<sha7>     (immutable per commit)
- self-hosted runner references and goreleaser/helm/kustomize blocks
  from the reference workflow are intentionally omitted; this repo
  only needs github-hosted runners and a single ghcr push
The metadata-action emits multiple tags newline-separated. Inlining
${{ steps.meta.outputs.tags }} into "for tag in ..." injects literal
newlines into the shell script, which the runner parses as
"syntax error near unexpected token". Pass the value through env and
iterate with `while read` instead.
Two independent features sharing the risk console.

Login warning (enforce path):
- User.RiskWarningPendingAt timestamp refreshed by the engine whenever an
  enforce-mode user-scope decision is non-allow; ack handler zeroes it
- GET /api/user/self exposes a boolean risk_warning_pending without
  leaking timestamps, scopes, or rule names so users cannot reverse the
  thresholds
- POST /api/user/self/risk_warning/ack clears the flag without lifting
  the actual block
- Dashboard shell shows AccountRiskWarningModal once per fresh decision,
  closable only via the explicit acknowledge button

Async OpenAI omni-moderation:
- New independent ModerationSetting with EnabledGroups/GroupModes
  (mirroring the risk-control gate pattern), multi-key list, sampling
  rate (integer percent), threshold, two-tier retention (flagged rows
  kept for 30 days by default for downstream client handling, benign
  rows for 72h)
- ModerationKeyRing rotates keys round-robin with per-key cooldowns;
  429 honours Retry-After then falls back to exponential backoff
- moderationCenter copies relay payloads off the gin context and
  enqueues via gopool.Go so the relay path never blocks; debug card
  uses the same async pipeline with a polled debugStore for results
- Engine implements buildModerationRequest / parseModerationResponse
  per the official OpenAI API schema (multi-modal input array, results
  with category_scores and category_applied_input_types)
- New endpoints under /api/risk/moderation/{config,overview,debug,
  debug/:id,incidents}; admin tab in the risk page wraps the existing
  distribution-detection workflow in a top-level Tabs strip
- PreflightModerationHook stubbed for future enforce-mode work; the
  signature is locked so callers do not change later

Tests (all green via go test ./service/ ./model/ ./controller/):
- KeyRing round-robin / cooldown skip / all-cooldown / reset / empty
- buildModerationRequest emits multi-modal array and rejects empty
- parseModerationResponse folds max score and applied input types
- parseRetryAfter handles plain seconds and invalid values
- ModerationSetting normalize filters auto and clamps sampling/threshold
- IsModerationEnabledForGroup truth table
- PreflightModerationHook stub allow-all
…gories

Replace the single-threshold gate with a full rule system mirroring the
distribution-detection workflow:

- ModerationRule model + CRUD with name/match_mode/action/priority/
  score_weight/conditions/groups, indexed by name and enabled. Reload
  drops rules with empty groups so admins cannot accidentally enable a
  rule that silently never fires.
- ModerationCondition is the unit predicate: { category, op, value,
  apply_input_type, applied_input_type }. ApplyInputType is a per-row
  toggle so admins choose whether the rule must match the OpenAI
  category_applied_input_types list (text/image) or the raw category
  score regardless of modality.
- ValidateModerationRule pins category to the official OpenAI 13-item
  list, value to [0,1], op to the shared comparator set, and rejects
  image-only filters on text-only categories (e.g. sexual/minors).
- EvaluateModerationRules runs every group-applicable rule with the
  rule's own match_mode (all → AND, any → OR; default all). Conditions
  whose category is missing from the response are short-circuit
  failures under AND, ignored under OR.
- BuildModerationDecision picks the most severe action (block > flag
  > observe). Block is recorded in incidents but does NOT short-circuit
  the relay path in v3 — the existing PreflightModerationHook stub is
  the future home for that behavior.
- moderation_center now records incidents only when a rule fires; the
  legacy FlagScoreThreshold fallback has been removed per the v3 design
  ("不保留兜底"). Debug events still record so admins can audit
  threshold-tuning sessions; previewModerationDecision in debug mode
  evaluates against every enabled rule (group-agnostic) so the editor
  shows what would fire.
- Five seeded default rules (sexual/minors block, violent illicit flag,
  text-only sexual flag, image-only violence flag, hate/harassment
  combo observe) — all Enabled=false until an operator binds a group.
- Frontend: 内容审核 tab gains an "审核规则" card with table, switch,
  edit/delete actions, and a 6-field-per-row editor modal. Each
  condition row exposes the apply_input_type toggle plus a category
  picker that disables the image option on text-only categories.
- ModerationIncident gains decision/primary_rule/matched_rules columns
  so downstream dashboards can pivot on rule names.

Tests:
- AND mode requires every condition; OR mode needs one
- ApplyInputType toggle distinguishes text-only vs image-only matches
- Group filtering, decision severity ordering, allow on empty match
- Validate rejects unknown categories, out-of-range scores, image
  filters on text-only categories, enabled rules without groups
- previewModerationDecision picks across all enabled rules regardless
  of group
- Category list exposes image_scored flag for the UI dropdown
Production debug runs were silently routed through a synthetic
"__debug__" group that no rule binds to, so the rule engine returned
allow and the recorded incident showed 未命中 — even when OpenAI
itself had flagged the input. The debug card meanwhile rendered the
upstream raw flagged field, which was sourced from a different field
than the persisted decision, so admins saw 命中 on the result Tag and
未命中 in the incident table at the same time.

- SubmitModerationDebug now accepts a group parameter; non-empty
  evaluates the request against that group's bound rules (mirroring
  production traffic in that group), empty falls back to the legacy
  preview that scans every enabled rule. The whitelist gate is
  bypassed in both cases so admins can rehearse before flipping on.
- The frontend debug card adds a group selector (default = preview;
  groups list mirrors the moderation enablement matrix and excludes
  auto). The submit payload includes the chosen group.
- Result rendering distinguishes the OpenAI raw flagged tag from the
  rule-engine decision tag and lists matched rule names so the card
  no longer disagrees with the persisted incident row.
Both engines (distribution detection and moderation) now hand off to a
single enforcement service that owns the user-facing email policy,
per-user hit counters, and the auto-ban decision. Engines stay
decoupled — neither one knows anything about email plumbing.

- EnforcementSetting registers under the "enforcement" option key with
  defaults locked off (Enabled=false, EmailOn*=false, BanThreshold=0)
  so upgrade is zero-effect until an operator opts in. Per-source ban
  thresholds let admins weight moderation hits differently from
  distribution hits, and the email rate limiter is "max N emails per M
  minutes per user" with a 3-per-10min default.
- service/enforcement.go.EnforcementHit is the engine-facing entry
  point, gopool-spawned and side-effect free when disabled. Hit
  processing follows decision points 1-9: fixed-window counter (resets
  at expiry), atomic per-user UPDATE, audit row with merged email
  status, optional auto-ban that flips User.Status to Disabled exactly
  like the existing admin disable flow. Already-banned users skip
  silently to avoid duplicate emails. Vague email templates intentionally
  omit rule names, only carrying time/group/source/count/threshold.
- Counters live on the User row (HitCountRisk, HitCountModeration,
  WindowStartAt, LastHitAt, EmailWindowStartAt, EmailCountInWindow,
  AutoBannedAt) so increments are a single UPDATE without joins. Manual
  unban / reset zeroes everything per decision point 6.
- The two engines call EnforcementHit at the same place they fire
  user-level vague warnings (risk_control on enforce-mode user
  decisions; moderation_center on rule-engine non-allow decisions for
  relay-source events only — debug runs stay local).
- 8 admin endpoints under /api/risk/enforcement/* (config, overview,
  incidents, counters, reset_counter, unban, test_email). The test
  email endpoint is hard-wired to the calling admin's mailbox so it
  cannot be repurposed as a relay (decision point 7).
- Frontend ships a third top-level tab "处置操作" with overview cards,
  full strategy editor (sources / window / thresholds / per-source
  thresholds / email rate limit / templates), per-user counter table,
  and audit incident list with source/action filters.
- Tests cover Normalize source filtering and rate-limit defaults, the
  source gate truth table, per-source threshold fallback, email body
  rendering with the "no rule names" red line, and a no-op assertion
  that confirms EnforcementHit is safe to call when disabled.
The 内容审核 tab derived its enabledGroupSet from the riskGroups prop —
that response is sourced from GET /api/risk/groups, which reads the
distribution-detection whitelist. Because the two engines are decoupled,
an admin who enabled the default group only for content moderation
would see "default 已启用" in the per-tab matrix card but still find
debug card dropdowns and rule editor labels marking default as
"分组未启用内容审核".

Production capture confirmed the divergence:

  moderation.enabled_groups   = ["default"]
  risk_control.enabled_groups = ["svip"]

Switch the in-tab badge source to config.enabled_groups (the live
moderation setting). The riskGroups prop is still passed in but only
used to enumerate the available group names — never to decide whether
a group is enabled for moderation.
… batch audit writes

Production captured a real bug: ban-notification emails were being
dropped because the single hit-email rate-limit budget was exhausted
by an earlier flurry of hit emails. SSH-validated database snapshot
confirmed the dropped ban email and showed channel-500/404 upstream
failures still entering the moderation queue, contaminating hit
counters with content the user never actually saw.

Changes:

- Split email rate limiting into two independent buckets. Hit emails
  keep the existing 10min/3 default. Ban emails get their own bucket
  (default 60min/3) backed by new User columns
  enforcement_ban_email_window_start_at + count. Hit-bucket exhaustion
  cannot starve ban notifications. JSON migration preserves the v2
  email_rate_limit_* keys so saved configs upgrade in place.
- Skip moderation entirely when relayErr != nil and the response
  delivered no chunks (RelayInfo.SendResponseCount == 0). Failed
  upstream requests no longer waste OpenAI tokens or pollute hit
  counters. Streaming responses that delivered at least one SSE chunk
  before failing still moderate, since the user did receive content.
- Switch the in-memory queue to ring-buffer semantics — when full we
  drop the OLDEST event, preserving freshness. Add an optional Redis
  LIST persistence layer (LPUSH/LTRIM/RPOPLPUSH/LREM with startup
  recovery from rc:mod:processing:* lists) so events survive container
  restarts. Falls back to memory-only when Redis is unreachable.
- Add a moderation_incidents batcher that aggregates rows for a
  configurable interval (default 500ms) or batch size (default 100)
  and uses CreateInBatches. PG write latency is now decoupled from
  OpenAI worker throughput. Synchronous fallback on submit-channel
  saturation guarantees no dropped audit rows under sustained load.
- Tune defaults for the target sizing scenario captured in DEV_GUIDE
  §14: WorkerCount 16, HTTPTimeoutMS 3000, EventQueueSize 32768.
  Beefier http.Transport keeps OpenAI keep-alives healthy.
- New /api/risk/moderation/queue_stats endpoint plus a moderation tab
  status card that polls every 15 seconds — admins can watch queue
  depth, per-worker idle/processing state, drop count, and the
  incident batcher backlog without leaving the page.
- DEV_GUIDE §14 records the choice to ship Redis LIST instead of
  asynq, the capacity math, and the migration trigger if throughput
  ever outgrows the simple queue.
…ts still moderate

Production captured the symptom: usage logs were being written for every
relay 200 response but moderation_incidents stayed empty no matter how
many requests went through. SSH into micu-us-1 confirmed two consecutive
claude-opus-4-7 /v1/messages successes followed by zero new rows in
moderation_incidents while the corresponding consume-log rows landed
normally.

Root cause: when ShouldCheckPromptSensitive() and CountToken are both
false (the production default), controller/relay.go takes the
fastTokenCountMetaForPricing fast path. That helper only fills MaxTokens
for ClaudeRequest / OpenAIResponsesRequest / GeneralOpenAIRequest and
deliberately leaves CombineText + Files empty to avoid the strings.Join
allocation hot path. The moderation hook then saw text=="" and
len(images)==0 and bailed out before ever enqueuing the event — usage
logs are an independent code path so they continued to land.

Two changes restore moderation:

- EnqueueModerationFromRelay now extracts text/images via a small helper
  that returns ("", nil) for nil/empty meta, then defers the real check
  until inside the gopool callback. If the initial extraction is empty
  we lazily call info.Request.GetTokenCountMeta() to build the full meta
  on-demand. The strings.Join cost is paid only when moderation is
  actually configured for the request's group, and only after the relay
  client has already received its response.
- New regression test TestExtractModerationPayloadHandlesNilAndEmpty
  pins the helper's nil/empty contract so a future refactor cannot
  re-introduce the silent-drop behaviour.
… complete

Production deployment of the lazy-meta fix correctly enqueued events
and the worker pool actually processed them — the queue stats card
showed workers cycling between idle and processing — but the
moderation_incidents table stayed empty because the v3 design
short-circuited persistence whenever the rule decision was "allow".
With six successful relay 200 responses in the last hour and zero
incident rows admins had no way to distinguish "moderation ran and
nothing matched" from "moderation is silently broken".

Change recordResult to persist every successfully scored event,
regardless of whether a rule fired. The flagged column distinguishes
the two cases (true == rule hit, false == benign), and the existing
two-tier retention (BenignRetentionHours defaults to 72h, kept short
on purpose) prevents the table from growing unbounded. Failures
(result.Error != "") still skip persistence — those rows would only
record "OpenAI couldn't be reached", which is more useful as a SysLog
line than a database row.

A unit test pins the flagged-vs-decision mapping so a future refactor
cannot reintroduce the silent-drop behaviour.
CombineText aggregates system prompts, all conversation history, tool
definitions and role labels for token counting — sending all of that to
the moderation API pollutes the signal. Switch relay path to
extractLastMessagePayload which type-asserts the request and pulls text
and images from messages[-1] only.
…, detail modal

- New setting record_unmatched_inputs (default false): when off, only
  flagged incidents are persisted, reducing DB pressure significantly.
- Flagged incidents now store the full input text without truncation;
  list API truncates to 200 chars in Go for transport efficiency.
- New GET /api/risk/moderation/incidents/:id returns the full record
  for the detail modal.
- Input summary column: tooltip on header explains protocol tags are
  normal; click opens a modal with complete content and metadata.
- Global config card gains a Switch for the new toggle.
The previous stream billing fix only checked `usage == nil`, which never
triggers for mainstream stream handlers (OaiStreamHandler /
ClaudeStreamHandler always return non-nil usage). This left users charged
for incomplete or empty output when streams failed mid-way.

Two-layer fix:
- Layer 1 (billing): calculateTextQuotaSummary forces zero tokens on any
  server-side stream error (timeout, scanner error, panic, ping failure),
  regardless of whether usage was reported. client_gone is excluded since
  the user initiated the disconnect.
- Layer 2 (retry): StreamAbortRetryError returns a 503 when the stream
  failed before any data was sent to the client, enabling the retry loop
  to try another channel transparently.

The check is inserted in all three major helpers (TextHelper,
ClaudeHelper, GeminiHelper) including their chatCompletionsViaResponses
code paths.
Backend:
- Redis pipeline counters (HINCRBY) on log write path for real-time metric collection
- Background aggregation loop (master-node only) reads Redis buckets and writes DB snapshots
- 3 DB tables: ChannelMonitoringStat, GroupMonitoringStat, MonitoringHistory
- DB fallback aggregation when Redis unavailable (simplified LOG_DB query)
- 7 API endpoints: admin CRUD + public read-only, rate-limited refresh trigger
- Hook mechanism (common.GroupMonitoringHook) avoids model→service circular dependency

Frontend:
- GroupMonitoringDashboard with responsive card grid and 60s auto-refresh
- GroupStatusCard with availability/cache progress bars and mini VChart history
- GroupDetailPanel SideSheet with full chart and admin channel detail table
- Settings page for monitoring groups, periods, exclude rules
- Sidebar and route integration

Tests:
- ParseMonitoringKey/ParseBucketValues table-driven tests
- IsGroupMonitored cache correctness tests
- RecordMonitoringMetric auto/empty group skip tests
- TriggerAggregationRefresh CAS guard test
…group monitoring

Frontend used `group_monitoring.*` prefix while backend registered as
`group_monitoring_setting.*`, causing settings to never reach the Go
config struct. Also renamed `groups` to `monitoring_groups` and added
`group_display_order` sync on save.
…ader nav

- parseArrayField: coerce all elements to strings, filter out numeric
  indices and invalid values; short-circuit on "[]"/"null"
- selectedGroups: filter against availableGroups to drop stale entries
- HeaderNavModules: add `monitoring` toggle (default true, backward
  compatible); add monitoring link to header navigation bar
- useHeaderBar/useNavigation: handle missing `monitoring` field in old
  configs
… filters

Backend:
- Expand GroupMonitoringHook signature to carry modelName, statusCode,
  content — enables filtering at recording time
- RecordMonitoringMetric now checks AvailabilityExcludeModels,
  AvailabilityExcludeKeywords, AvailabilityExcludeStatusCodes, and
  CacheHitExcludeModels before incrementing Redis counters
- Excluded errors skip t/s/e counters; excluded cache models skip ct/pt
- Add AvailabilityExcludeStatusCodes []int field to config struct
- Pass statusCode from other["status_code"] in RecordErrorLog

Frontend:
- Add "可用率排除状态码" TagInput to group monitoring settings card
- Add i18n keys for the new field
…ings

When monitoring_groups stored numeric indices (0, 1, 2) instead of group
names, map them to the corresponding entries from /api/group/ and
auto-correct the state so the resolved names get persisted on save.
The /api/group/ endpoint returns a flat array of group names, not an
object. Object.keys on an array returns numeric indices ["0","1","2"]
instead of the actual element values, which was the root cause of
monitoring groups displaying indices instead of group names.
Calderic added a commit that referenced this pull request Apr 27, 2026
When a stream ends abnormally due to server-side causes (timeout, scanner
error, ping fail, panic) before any chunk is delivered to the client,
return a 503 so the relay loop can transparently retry through another
channel. Client-initiated disconnects are excluded — the user chose to
stop, no retry needed.

Cherry-picked from PR #1, scoped to retry behavior only. The PR's
text_quota.go change is intentionally NOT taken; we keep the upstream
billing semantics (zero-charge only when usage is nil and no chunks
sent), which trusts upstream-provided usage data even on partial
streams.

- relay/common/stream_status.go: add IsServerSideError() helper
- service/stream_abort.go: 503 retry shim used by handlers
- relay/{claude,compatible,gemini}_handler.go: 4-line hooks at
  post-DoResponse points
Calderic added a commit that referenced this pull request Apr 27, 2026
合并 PR #1 中两个紧耦合的特性:

1. 风控按 group 隔离
   所有风控指标/决策/快照按 (scope, subject, group) 维度存储,同一
   用户/令牌在不同 group(例如 vip / free)拥有独立风险状态。新增
   一次性迁移:在 AutoMigrate 创建新的三列唯一索引 v2 之前,先 DROP
   risk_subject_snapshot 旧的两列唯一索引。三库(SQLite/MySQL/PG)
   均做幂等处理。

2. 统一命中处置层 (enforcement)
   解耦邮件限流(hit / ban 独立桶)、审计写入 enforcement_incident、
   阈值自动封禁;后续 moderation 引擎也会复用此层。

相对 PR 的性能优化:把"读—改—写"计数器更新改为单事务内的
FOR UPDATE 行锁(IncrementEnforcementHit),并发命中同一用户时
不会丢增量。SQLite 默认串行化写入;MySQL/PG 使用行锁。

附带改动:
- model/user.go: risk_warning_pending_at + 9 个 enforcement 计数字段
- controller/user.go: GetSelf 暴露 risk_warning_pending(仅布尔);
  新增 POST /api/user/risk_warning/ack 让用户消除登录弹窗
- relay/common/relay_info.go: 增加 RiskGroup 快照字段,跨组重试时
  defer 仍能记账到正确的 group
Calderic added a commit that referenced this pull request Apr 27, 2026
从 PR #1 引入异步内容审核引擎,针对 1000 RPM 量级做了简化:

精简内容
- 删除 Redis 持久化队列:1000 RPM × 100% 采样 ≈ 17 RPS,
  内存 channel + ring-buffer 足够,重启丢未处理事件可接受
- 删除 batcher:每秒最多个位数 INSERT,同步直写更简单
- 删除 stopCh / stopOnce 死代码:从未被读取,goroutine 随
  进程退出即可
- WorkerCount 从 16 降到 8,EventQueueSize 从 32768 降到 4096

保留并接入
- OpenAI omni-moderation 多 key 轮询 + cooldown
- 规则引擎(AND/OR over OpenAI 类别)+ 默认规则种子
- 异步采样、debug 试运行、保留期清理(按 flagged/benign 分桶)
- 命中后自动调用 EnforcementHit 触达统一处置层
- relay 路径异步 hook:失败请求(SendResponseCount=0)不计入

涉及文件
- service/moderation_center.go: 简化后核心引擎
- service/moderation_keyring.go: API key 轮询 + 冷却
- service/moderation_rules.go: 规则引擎
- model/moderation_{incident,rule}.go: 审计 + 规则模型
- controller/moderation.go: 管理端 CRUD + overview + debug
- setting/operation_setting/moderation_setting.go: 配置
- types/moderation.go: 共享类型
- controller/relay.go: 在 defer 内挂入异步评分钩子
- main.go: 启动注入
- model/main.go: AutoMigrate 注册
- router/api-router.go: 管理路由(已套 AdminAuth)

未引入
- service/moderation_redis_queue.go (PR 中的 159 行)
- service/moderation_incident_batcher.go (PR 中的 160 行)
Calderic added a commit that referenced this pull request Apr 27, 2026
从 PR #1 引入 group monitoring:基于 Redis counter 实时聚合每个
分组的请求量、token 用量、首字节延迟、状态码分布;带 Redis 不可
用时的 DB 回退路径。

涉及文件
- common/monitoring_hook.go: 全局 hook 函数指针,避免 model 反向
  依赖 service
- model/log.go: 在 RecordErrorLog 与 RecordConsumeLog 末尾调用
  monitoring hook(hook 为 nil 时零开销)
- model/group_monitoring.go: ChannelMonitoringStat /
  GroupMonitoringStat / MonitoringHistory 三张监控表
- service/group_monitoring{,_metric}.go: 主聚合循环 + Redis/DB 双
  通道实现 + 历史数据维护
- setting/operation_setting/group_monitoring_setting.go: 配置(采样、
  排除状态码、可用率窗口等)
- controller/group_monitoring.go: 管理 + 公共两套 API
- router/api-router.go: /monitoring/admin (AdminAuth) +
  /monitoring/public (TryUserAuth)
- main.go: 注入 StartGroupMonitoringAggregation
- model/main.go: AutoMigrate 注册三张监控表
Calderic added a commit that referenced this pull request Apr 27, 2026
合并 PR #1 全部前端改动:

新增页面与组件
- web/src/pages/Risk/index.jsx: 风控中心增强(多 tab:风控/审核/
  处置/订阅)、规则编辑、incident 详情弹窗
- web/src/pages/GroupMonitoring/index.jsx: 群组监控页面入口
- web/src/components/monitoring/*: 群组监控仪表盘、卡片、可用率
  折线图、历史趋势图
- web/src/components/common/modals/AccountRiskWarningModal.jsx:
  用户登录态风控警告弹窗(仅展示模糊提示,不暴露规则)
- web/src/components/settings/GroupMonitoringSetting.jsx +
  pages/Setting/Operation/SettingsGroupMonitoring.jsx: 监控配置项

修改的现有文件
- web/src/App.jsx: 路由 + 风控警告弹窗挂载
- web/src/components/dashboard/index.jsx: 监控入口
- web/src/components/layout/SiderBar.jsx: 监控导航项
- web/src/components/settings/OperationSetting.jsx: 加入监控 tab
- web/src/helpers/render.jsx: 状态展示工具
- web/src/hooks/common/{useHeaderBar,useNavigation}.js: 顶部/侧栏
  增加监控入口
- web/src/hooks/dashboard/useDashboardData.js: 拉取监控概览
- web/src/pages/Setting/Operation/SettingsHeaderNavModules.jsx +
  SettingsSidebarModulesAdmin.jsx: 模块开关
- web/src/i18n/locales/{en,zh-CN,zh-TW,fr,ja,ru,vi}.json: 新增条目
GROUPS is a reserved keyword in MySQL 8.0+, causing Error 1064 in
CountEnabledRiskRulesWithoutGroups. Use commonGroupsCol variable
(backtick-quoted for MySQL/SQLite, double-quoted for PostgreSQL).
Use GORM's map-based Where/Or conditions so the ORM handles column
quoting automatically, eliminating reserved-word issues across all
database backends.
The backend returns history records with recorded_at (unix seconds),
but the chart read a non-existent timestamp field, producing NaN and
rendering an empty chart. Also fix aggregation_interval_minutes being
read from the wrong response level in GroupDetailPanel.
…implify drawer

- GroupStatusCard: guard null availRate/cacheRate (show N/A), fix is_online for admin format
- MiniHistoryChart: use recorded_at (unix seconds) instead of non-existent timestamp field
- GroupMonitoringDashboard: fix history response parsing level, admin-only drawer and card click
- GroupDetailPanel: remove history chart (now in card), keep only channel details for admin
…across polls

- alignAndFillHistory: skip availability_rate/cache_hit_rate when < 0 (backend
  returns -1 for no-data), preventing chart y-axis from stretching to -1
- Dashboard poll (fetchGroups without history): use functional state update to
  preserve existing history data instead of overwriting with empty group stats
…ards

MiniHistoryChart never rendered because it lacked initVChartSemiTheme
initialization and used invalid width:'auto' in the VChart spec.
Instead of patching it, reuse the proven AvailabilityCacheChart with a
new compact prop (120px, no legends, no y-axis labels, smaller fonts).
Claude's input_tokens EXCLUDES cache_read_input_tokens, while OpenAI's
prompt_tokens INCLUDES cached_tokens. The monitoring aggregation formula
ct/pt*100 assumed pt always includes ct, producing ~8000% cache hit rates
for Claude channels.

Fix: at recording time, detect usage_semantic=anthropic and add cache
tokens to prompt tokens before HINCRBY, so pt in Redis always means
total prompt including cache. Remove the CacheTokensSeparateGroups
branching in aggregation since the data is now normalized at source.
@Calderic

Copy link
Copy Markdown
Owner

整合时做了以下调整:

  • 保留上游已合的异常流扣费修复,未采纳 PR 的版本
  • enforcement 计数器改用单事务 FOR UPDATE,修了原 PR 的 TOCTOU
  • moderation 删除 batcher(17~50 RPS 直写够用)
  • moderation Redis 队列重写为 per-instance WAL(修了原 PR 的跨实例 key 冲突 + recovery 死信两个 bug)
  • 未采纳 .github/workflows/dev.yml(与本仓库 CI 策略不符)

@Calderic Calderic closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants