Skip to content

MUL-2488 feat(timezone): Scheduling / Viewing two-layer timezone architecture#2968

Merged
Bohan-J merged 9 commits into
multica-ai:mainfrom
yyclaw:feat/timezone-architecture
May 21, 2026
Merged

MUL-2488 feat(timezone): Scheduling / Viewing two-layer timezone architecture#2968
Bohan-J merged 9 commits into
multica-ai:mainfrom
yyclaw:feat/timezone-architecture

Conversation

@yyclaw
Copy link
Copy Markdown
Contributor

@yyclaw yyclaw commented May 21, 2026

What does this PR do?

Collapses timezone into two independent product concepts and refactors the data layer accordingly.

timezone is currently overloaded onto the same fields with several meanings, which surfaces two problems:

1. #2822 was a correct fix, but did not address the root cause. #2822 found the workspace usage page's picker drove the weekly boundary in the user's tz while the backend rollup materialised in UTC, so rows crossing UTC midnight fell into the wrong calendar week. The fix locked weekly to UTC and removed the picker — that eliminated the front/back tz mismatch bug, but the user need ("view reports in my own tz") remains. The root cause is in the data layer: the rollup materialises on a fixed tz and cannot support slicing day boundaries at read time by the caller's tz.

2. Operational tz does not belong on runtime. Operational tz is a property of the physical machine — multiple runtimes on one machine share a single OS clock, so their operational tz is necessarily identical. Putting tz on the agent_runtime row copies a machine-level fact onto every runtime row on that machine: inherently redundant, and it permits an illegal state ("two runtimes on the same machine with inconsistent tz"). After auditing, every reader wants the reporting semantics, not operational semantics — so the column is dropped entirely.

Solution: collapse timezone into Scheduling (autopilot_trigger.timezone, unchanged) and Viewing (new user.timezone field). The data layer merges task_usage_daily + task_usage_dashboard_daily into a UTC hourly-grain task_usage_hourly; report queries slice day boundaries at read time by the caller's @tz. Full design in docs/timezone-architecture-rfc.md.

Related Issue

Closes #2967

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Refactor / code improvement (no behavior change)

Changes Made

  • DB — migrations 100–104: add "user".timezone (Viewing tz, nullable); create task_usage_hourly + its rollup pipeline; drop the legacy task_usage_daily / task_usage_dashboard_daily pipelines; drop the agent_runtime.timezone column. server/pkg/db/queries/*.sql and sqlc-generated code updated.
  • Backfillserver/cmd/backfill_task_usage_hourly replaces the two legacy backfill commands.
  • Server handlersresolveViewingTZ (?tzuser.timezone → UTC) resolves the viewer's tz and passes it to the hourly-rollup queries; remove the UseDailyRollup* feature flags and the old dual query paths, and the /api/usage endpoints; the daemon no longer reports host tz, and PATCH /api/runtimes/:id no longer accepts timezone.
  • Corepackages/core API client and dashboard/runtime queries send ?tz with each request; the user type gains timezone; the runtime timezone field and mutation are removed.
  • Views — add the useViewingTimezone hook and a Timezone setting in Preferences; report charts and the dashboard week boundary follow the viewer tz; remove the runtime detail timezone editor and its locale strings.

How to Test

  1. Run migrations 100–104 (make migrate-up); confirm task_usage_hourly is created and agent_runtime.timezone is dropped.
  2. make test and pnpm test pass.
  3. Set a Timezone in Settings → Preferences; confirm the "today" label and day boundaries of the dashboard / runtime detail reports follow it, and the runtime detail page no longer has a timezone control.

Checklist

  • I have included a thinking path that traces from project context to this change
  • I have run tests locally and they pass
  • I have added or updated tests where applicable
  • If this change affects the UI, I have included before/after screenshots
  • I have updated relevant documentation to reflect my changes
  • I have considered and documented any risks above

AI Disclosure

AI tool used: Claude Code

Prompt / approach: Implemented against the design in docs/timezone-architecture-rfc.md, and used Claude Code to split the work into commits by category and draft the issue and this PR.

Risks

  • The invalidation queue TTL is mandatory; omitting it lets dirty rows grow unbounded under heavy load.
  • Hourly backfill puts read pressure on the source table; coordinate with the DB team in advance and run it per-workspace during off-peak hours.
  • DST 23h/25h "days": DATE(bucket_hour AT TIME ZONE @tz) handles it correctly, but any front-end "a day = 24h" hardcoded offset must be tested at DST boundaries.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

@yyclaw is attempting to deploy a commit to the IndexLabs Team on Vercel.

A member of the Team first needs to authorize it.

@yyclaw yyclaw changed the title feat(timezone): Scheduling / Viewing 两层 timezone 架构重构 feat(timezone): Scheduling / Viewing two-layer timezone architecture May 21, 2026
@yyclaw
Copy link
Copy Markdown
Contributor Author

yyclaw commented May 21, 2026

@forrestchang Please review this PR. I think it fundamentally resolves the issue you raised in #2822.

RFC: https://github.com/multica-ai/multica/pull/2968/changes#diff-75ed0324f90dea51c317aee4e9fcf434fcd6dd610c3311406620b8d0156a119b

@yyclaw yyclaw force-pushed the feat/timezone-architecture branch from b261cb8 to 354ed4b Compare May 21, 2026 04:30
@Bohan-J Bohan-J changed the title feat(timezone): Scheduling / Viewing two-layer timezone architecture MUL-2488 feat(timezone): Scheduling / Viewing two-layer timezone architecture May 21, 2026
Copy link
Copy Markdown
Collaborator

@Bohan-J Bohan-J left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After aligning with the product owner on Multica MUL-2488, I'm promoting the relatively more important items from the earlier nit list to blocking. The rest (KPI tile task_count over-count, resolveViewingTZ cold-path GetUser caching, rebase, etc.) stay as post-merge followups and do not block this PR.

1. The self-host upgrade order needs a hint that the operator can actually see

100–104 are a single migration group. If a self-host operator just runs make migrate-up:

  1. 101 creates task_usage_hourly;
  2. 102 installs the trigger + state (cron is not yet registered);
  3. 103 immediately drops the task_usage_daily / task_usage_dashboard_daily pipelines;
  4. The backfill has not run yet, so the hourly table only contains buckets the triggers have written since they were installed;
  5. Until backfill + cron catch up, dashboards display empty. With years of history and a per-tick cap of 1 day, this can be tens to hundreds of ticks.

The rollout order is written in the SQL comments of server/migrations/102_task_usage_hourly_pipeline.up.sql:507-519 — entirely invisible to whoever runs migrate-up. The package comment at the top of server/cmd/backfill_task_usage_hourly/main.go only says "run before registering pg_cron"; it does not mention the relationship with 103/104.

Pick any of these (not mutually exclusive):

  • Preferred: add a prominent runbook to the package comment at the top of server/cmd/backfill_task_usage_hourly/main.go — "self-host upgrade order: apply 101+102 → run this backfill → apply 103+104; if you run migrate-up straight through to 103/104, dashboards will be empty until cron catches up."
  • Add a RAISE NOTICE at the top of 103_drop_legacy_daily_rollups.up.sql so the migrate-up command line at least surfaces one warning.
  • Or renumber 103/104 to a clearly later range (e.g. 110+) so the operator naturally has room to slot the backfill in.

2. Trailing blank line at packages/views/runtimes/components/runtime-detail.tsx:673

$ git diff --check origin/main...HEAD
packages/views/runtimes/components/runtime-detail.tsx:673: new blank line at EOF.

Trivial — just delete it.

3. Add an invariant comment on enqueue_task_usage_hourly_dirty_for_atq

trg_atq_dirty_hourly (server/migrations/102_task_usage_hourly_pipeline.up.sql:130-132) only watches OF runtime_id, issue_id OR DELETE, under the assumption that agent_task_queue.agent_id is immutable once a row is inserted (consistent with 084). If a future feature (reassign / quick-create rebind) makes agent_id mutable, this trigger will silently miss dirty buckets under the old agent_id.

Suggested comment above the CREATE TRIGGER:

-- INVARIANT: agent_task_queue.agent_id is immutable once a row is inserted.
-- If a future feature makes agent_id mutable (e.g. reassign / rebind), it
-- MUST be added to this trigger's `OF` column list, otherwise dirty
-- buckets for the old agent_id will not be enqueued and historical
-- aggregates will silently rot.

Very low cost; prevents a whole class of latent bug.

Followups (non-blocking, just tracking)

  • The KPI tile Tasks · Nd (SUM-over-hourly) vs leaderboard task_count discrepancy: in the future derive the KPI from DashboardRunTimeDaily to align (the SQL / frontend comments already acknowledge "acceptable for KPI" — this is polish).
  • Add per-context memoization to the resolveViewingTZ cold path so an old desktop client doesn't trigger 4 GetUser calls on a single dashboard open.
  • Rebase origin/main before merging — current merge-base is fairly old.
  • GetRuntimeTaskHourlyActivity (pre-existing, not introduced by this PR) has no started_at time-window filter; for long-lived runtimes, the detail-page heatmap scans the entire agent_task_queue. Add a 90d window in a follow-up.

@yyclaw yyclaw force-pushed the feat/timezone-architecture branch from 354ed4b to aa8af08 Compare May 21, 2026 07:23
@yyclaw
Copy link
Copy Markdown
Contributor Author

yyclaw commented May 21, 2026

@Bohan-J Addressed all three blocking items as suggested, and rebased onto the latest main. The non-blocking followups are left for post-merge tracking. Ready for another look.

yyclaw added 9 commits May 21, 2026 15:26
…zone

Migrations 100-104: add "user".timezone (Viewing tz), build the UTC
hourly task_usage_hourly rollup with its pipeline, drop the legacy
task_usage_daily / task_usage_dashboard_daily pipelines, and drop the
agent_runtime.timezone column. Report queries now slice day boundaries
at read time by the caller-supplied @tz instead of materialising in a
fixed tz. Regenerate sqlc.
Replace the two legacy backfill commands (daily / dashboard_daily) with
a single backfill_task_usage_hourly that loads historical task_usage
into the new UTC hourly rollup, sliced per workspace.
Report handlers resolve the Viewing tz per request (?tz query param,
then user.timezone, then UTC) and pass it to the hourly-rollup queries.
Drop the UseDailyRollup feature flags and the old raw-scan/daily-rollup
dual paths, remove the /api/usage endpoints, and stop the daemon from
reporting and the runtime handler from accepting host timezone.
API client and dashboard/runtime queries send ?tz with each report
request, the user schema/types carry the new timezone field, and the
runtime timezone field/mutation is removed.
Add the useViewingTimezone hook and a Timezone setting in Preferences;
report charts and the dashboard week boundary follow the viewer tz.
Remove the runtime detail timezone editor and its locale strings.
The timezone architecture refactor changed several types without
updating dependent test code:

- RuntimeDevice no longer has a timezone field — drop it from the
  create-agent-dialog runtime fixture.
- User now requires a timezone field — add it to the apps/web mockUser
  fixture.
- The PreferencesTab timezone tests asserted on the async save handler
  (PATCH then store update) with a bare expect, racing the mutation's
  settle callback, and timed out querying the Select's ~600-option IANA
  list on a loaded CI runner. Wrap the assertions in waitFor and extend
  the timeout for those three tests.
Add a SELF-HOST UPGRADE ORDER runbook to the backfill command's package
comment: applying migrations 100-104 in a single migrate-up drops the
legacy daily rollups before the hourly backfill runs, leaving dashboards
empty until cron catches up.

Add an INVARIANT comment on trg_atq_dirty_hourly noting that agent_id
must be added to the trigger's OF list if it ever becomes mutable,
otherwise dirty buckets for the old agent_id are silently missed.
@yyclaw yyclaw force-pushed the feat/timezone-architecture branch from aa8af08 to d4cf87b Compare May 21, 2026 07:28
@yyclaw yyclaw requested a review from Bohan-J May 21, 2026 07:31
Copy link
Copy Markdown
Collaborator

@Bohan-J Bohan-J left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yyclaw — all three blocking items are addressed exactly where they need to be (self-host runbook at the top of backfill_task_usage_hourly's package comment so go doc surfaces it, the agent_id immutability invariant right next to trg_atq_dirty_hourly, and the EOF whitespace in runtime-detail.tsx). Rebase on origin/main is a nice bonus that clears one of the followups too.

Approving. Will merge once the frontend CI run finishes (backend + installers already green).

@Bohan-J Bohan-J merged commit 614dfae into multica-ai:main May 21, 2026
4 checks passed
Bohan-J added a commit that referenced this pull request May 21, 2026
…migrate MUL-2488 (#2998)

* fix(timezone): harden hourly-rollup rollout against straight-through migrate

MUL-2488

PR #2968 introduced the new task_usage_hourly rollup but assumed operators
would stop migrate between 102 and 103 to run the one-shot
cmd/backfill_task_usage_hourly. Two pieces made that unsafe in practice:

1. The Dockerfile only shipped server / multica / migrate, so a deployed
   container has no backfill binary to run between phases.
2. cmd/migrate has no per-version stop, and entrypoint.sh runs `migrate up`
   to the latest version, so 103 silently drops the legacy daily rollups
   even when nobody ran the backfill — leaving usage dashboards at zero
   despite source data being intact in task_usage.

Changes:

- Build cmd/backfill_task_usage_hourly into the runtime image alongside
  the other binaries so operators can `docker exec` the backfill instead
  of needing a source checkout.
- Add a fail-closed plpgsql guard at the top of migration 103 that
  aborts the migration when task_usage has rows but task_usage_hourly is
  empty. Fresh databases (no task_usage rows) are exempt because the new
  triggers from 102 will populate the hourly table on the first event.

Already-applied databases are unaffected — schema_migrations tracks by
version only, so 103 is not re-run.

Co-authored-by: multica-agent <github@multica.ai>

* fix(timezone): use watermark coverage for hourly-rollup guard

The previous check only required `task_usage_hourly` to be non-empty,
which an interrupted backfill or a manual `rollup_task_usage_hourly_window`
call both satisfy. The completion signal we actually trust is
`task_usage_hourly_rollup_state.watermark_at` — backfill only stamps it
to `now() - 5 min` after every monthly slice succeeded, and the cron
worker only advances it on a real tick. Default after migration 101 is
`1970-01-01`, so an unrun or partial backfill is trivially detected.

Also corrects the comment about fresh-install behavior: the triggers in
102 only enqueue dirty keys for agent_task_queue / issue / task_usage
DELETE — they do not write hourly rows. INSERT/UPDATE flows through the
`updated_at` watermark window of `rollup_task_usage_hourly()`, which
only runs once the operator registers it as a pg_cron job.

Co-authored-by: multica-agent <github@multica.ai>

---------

Co-authored-by: multica-agent <github@multica.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Timezone architecture refactor — Scheduling / Viewing two-layer model

2 participants