MUL-2488 feat(timezone): Scheduling / Viewing two-layer timezone architecture#2968
Conversation
|
@yyclaw is attempting to deploy a commit to the IndexLabs Team on Vercel. A member of the Team first needs to authorize it. |
|
@forrestchang Please review this PR. I think it fundamentally resolves the issue you raised in #2822. |
b261cb8 to
354ed4b
Compare
There was a problem hiding this comment.
After aligning with the product owner on Multica MUL-2488, I'm promoting the relatively more important items from the earlier nit list to blocking. The rest (KPI tile task_count over-count, resolveViewingTZ cold-path GetUser caching, rebase, etc.) stay as post-merge followups and do not block this PR.
1. The self-host upgrade order needs a hint that the operator can actually see
100–104 are a single migration group. If a self-host operator just runs make migrate-up:
- 101 creates
task_usage_hourly; - 102 installs the trigger + state (cron is not yet registered);
- 103 immediately drops the
task_usage_daily/task_usage_dashboard_dailypipelines; - The backfill has not run yet, so the hourly table only contains buckets the triggers have written since they were installed;
- Until backfill + cron catch up, dashboards display empty. With years of history and a per-tick cap of 1 day, this can be tens to hundreds of ticks.
The rollout order is written in the SQL comments of server/migrations/102_task_usage_hourly_pipeline.up.sql:507-519 — entirely invisible to whoever runs migrate-up. The package comment at the top of server/cmd/backfill_task_usage_hourly/main.go only says "run before registering pg_cron"; it does not mention the relationship with 103/104.
Pick any of these (not mutually exclusive):
- Preferred: add a prominent runbook to the package comment at the top of
server/cmd/backfill_task_usage_hourly/main.go— "self-host upgrade order: apply 101+102 → run this backfill → apply 103+104; if you runmigrate-upstraight through to 103/104, dashboards will be empty until cron catches up." - Add a
RAISE NOTICEat the top of103_drop_legacy_daily_rollups.up.sqlso themigrate-upcommand line at least surfaces one warning. - Or renumber 103/104 to a clearly later range (e.g. 110+) so the operator naturally has room to slot the backfill in.
2. Trailing blank line at packages/views/runtimes/components/runtime-detail.tsx:673
$ git diff --check origin/main...HEAD
packages/views/runtimes/components/runtime-detail.tsx:673: new blank line at EOF.
Trivial — just delete it.
3. Add an invariant comment on enqueue_task_usage_hourly_dirty_for_atq
trg_atq_dirty_hourly (server/migrations/102_task_usage_hourly_pipeline.up.sql:130-132) only watches OF runtime_id, issue_id OR DELETE, under the assumption that agent_task_queue.agent_id is immutable once a row is inserted (consistent with 084). If a future feature (reassign / quick-create rebind) makes agent_id mutable, this trigger will silently miss dirty buckets under the old agent_id.
Suggested comment above the CREATE TRIGGER:
-- INVARIANT: agent_task_queue.agent_id is immutable once a row is inserted.
-- If a future feature makes agent_id mutable (e.g. reassign / rebind), it
-- MUST be added to this trigger's `OF` column list, otherwise dirty
-- buckets for the old agent_id will not be enqueued and historical
-- aggregates will silently rot.Very low cost; prevents a whole class of latent bug.
Followups (non-blocking, just tracking)
- The KPI tile
Tasks · Nd(SUM-over-hourly) vs leaderboardtask_countdiscrepancy: in the future derive the KPI fromDashboardRunTimeDailyto align (the SQL / frontend comments already acknowledge "acceptable for KPI" — this is polish). - Add per-context memoization to the
resolveViewingTZcold path so an old desktop client doesn't trigger 4GetUsercalls on a single dashboard open. - Rebase
origin/mainbefore merging — current merge-base is fairly old. GetRuntimeTaskHourlyActivity(pre-existing, not introduced by this PR) has nostarted_attime-window filter; for long-lived runtimes, the detail-page heatmap scans the entireagent_task_queue. Add a 90d window in a follow-up.
354ed4b to
aa8af08
Compare
|
@Bohan-J Addressed all three blocking items as suggested, and rebased onto the latest main. The non-blocking followups are left for post-merge tracking. Ready for another look. |
…zone Migrations 100-104: add "user".timezone (Viewing tz), build the UTC hourly task_usage_hourly rollup with its pipeline, drop the legacy task_usage_daily / task_usage_dashboard_daily pipelines, and drop the agent_runtime.timezone column. Report queries now slice day boundaries at read time by the caller-supplied @tz instead of materialising in a fixed tz. Regenerate sqlc.
Replace the two legacy backfill commands (daily / dashboard_daily) with a single backfill_task_usage_hourly that loads historical task_usage into the new UTC hourly rollup, sliced per workspace.
Report handlers resolve the Viewing tz per request (?tz query param, then user.timezone, then UTC) and pass it to the hourly-rollup queries. Drop the UseDailyRollup feature flags and the old raw-scan/daily-rollup dual paths, remove the /api/usage endpoints, and stop the daemon from reporting and the runtime handler from accepting host timezone.
API client and dashboard/runtime queries send ?tz with each report request, the user schema/types carry the new timezone field, and the runtime timezone field/mutation is removed.
Add the useViewingTimezone hook and a Timezone setting in Preferences; report charts and the dashboard week boundary follow the viewer tz. Remove the runtime detail timezone editor and its locale strings.
The timezone architecture refactor changed several types without updating dependent test code: - RuntimeDevice no longer has a timezone field — drop it from the create-agent-dialog runtime fixture. - User now requires a timezone field — add it to the apps/web mockUser fixture. - The PreferencesTab timezone tests asserted on the async save handler (PATCH then store update) with a bare expect, racing the mutation's settle callback, and timed out querying the Select's ~600-option IANA list on a loaded CI runner. Wrap the assertions in waitFor and extend the timeout for those three tests.
Add a SELF-HOST UPGRADE ORDER runbook to the backfill command's package comment: applying migrations 100-104 in a single migrate-up drops the legacy daily rollups before the hourly backfill runs, leaving dashboards empty until cron catches up. Add an INVARIANT comment on trg_atq_dirty_hourly noting that agent_id must be added to the trigger's OF list if it ever becomes mutable, otherwise dirty buckets for the old agent_id are silently missed.
aa8af08 to
d4cf87b
Compare
Bohan-J
left a comment
There was a problem hiding this comment.
Thanks @yyclaw — all three blocking items are addressed exactly where they need to be (self-host runbook at the top of backfill_task_usage_hourly's package comment so go doc surfaces it, the agent_id immutability invariant right next to trg_atq_dirty_hourly, and the EOF whitespace in runtime-detail.tsx). Rebase on origin/main is a nice bonus that clears one of the followups too.
Approving. Will merge once the frontend CI run finishes (backend + installers already green).
…migrate MUL-2488 (#2998) * fix(timezone): harden hourly-rollup rollout against straight-through migrate MUL-2488 PR #2968 introduced the new task_usage_hourly rollup but assumed operators would stop migrate between 102 and 103 to run the one-shot cmd/backfill_task_usage_hourly. Two pieces made that unsafe in practice: 1. The Dockerfile only shipped server / multica / migrate, so a deployed container has no backfill binary to run between phases. 2. cmd/migrate has no per-version stop, and entrypoint.sh runs `migrate up` to the latest version, so 103 silently drops the legacy daily rollups even when nobody ran the backfill — leaving usage dashboards at zero despite source data being intact in task_usage. Changes: - Build cmd/backfill_task_usage_hourly into the runtime image alongside the other binaries so operators can `docker exec` the backfill instead of needing a source checkout. - Add a fail-closed plpgsql guard at the top of migration 103 that aborts the migration when task_usage has rows but task_usage_hourly is empty. Fresh databases (no task_usage rows) are exempt because the new triggers from 102 will populate the hourly table on the first event. Already-applied databases are unaffected — schema_migrations tracks by version only, so 103 is not re-run. Co-authored-by: multica-agent <github@multica.ai> * fix(timezone): use watermark coverage for hourly-rollup guard The previous check only required `task_usage_hourly` to be non-empty, which an interrupted backfill or a manual `rollup_task_usage_hourly_window` call both satisfy. The completion signal we actually trust is `task_usage_hourly_rollup_state.watermark_at` — backfill only stamps it to `now() - 5 min` after every monthly slice succeeded, and the cron worker only advances it on a real tick. Default after migration 101 is `1970-01-01`, so an unrun or partial backfill is trivially detected. Also corrects the comment about fresh-install behavior: the triggers in 102 only enqueue dirty keys for agent_task_queue / issue / task_usage DELETE — they do not write hourly rows. INSERT/UPDATE flows through the `updated_at` watermark window of `rollup_task_usage_hourly()`, which only runs once the operator registers it as a pg_cron job. Co-authored-by: multica-agent <github@multica.ai> --------- Co-authored-by: multica-agent <github@multica.ai>
What does this PR do?
Collapses
timezoneinto two independent product concepts and refactors the data layer accordingly.timezoneis currently overloaded onto the same fields with several meanings, which surfaces two problems:1. #2822 was a correct fix, but did not address the root cause. #2822 found the workspace usage page's picker drove the weekly boundary in the user's tz while the backend rollup materialised in UTC, so rows crossing UTC midnight fell into the wrong calendar week. The fix locked weekly to UTC and removed the picker — that eliminated the front/back tz mismatch bug, but the user need ("view reports in my own tz") remains. The root cause is in the data layer: the rollup materialises on a fixed tz and cannot support slicing day boundaries at read time by the caller's tz.
2. Operational tz does not belong on
runtime. Operational tz is a property of the physical machine — multiple runtimes on one machine share a single OS clock, so their operational tz is necessarily identical. Putting tz on theagent_runtimerow copies a machine-level fact onto every runtime row on that machine: inherently redundant, and it permits an illegal state ("two runtimes on the same machine with inconsistent tz"). After auditing, every reader wants the reporting semantics, not operational semantics — so the column is dropped entirely.Solution: collapse timezone into Scheduling (
autopilot_trigger.timezone, unchanged) and Viewing (newuser.timezonefield). The data layer mergestask_usage_daily+task_usage_dashboard_dailyinto a UTC hourly-graintask_usage_hourly; report queries slice day boundaries at read time by the caller's@tz. Full design indocs/timezone-architecture-rfc.md.Related Issue
Closes #2967
Type of Change
Changes Made
"user".timezone(Viewing tz, nullable); createtask_usage_hourly+ its rollup pipeline; drop the legacytask_usage_daily/task_usage_dashboard_dailypipelines; drop theagent_runtime.timezonecolumn.server/pkg/db/queries/*.sqland sqlc-generated code updated.server/cmd/backfill_task_usage_hourlyreplaces the two legacy backfill commands.resolveViewingTZ(?tz→user.timezone→ UTC) resolves the viewer's tz and passes it to the hourly-rollup queries; remove theUseDailyRollup*feature flags and the old dual query paths, and the/api/usageendpoints; the daemon no longer reports host tz, andPATCH /api/runtimes/:idno longer acceptstimezone.packages/coreAPI client and dashboard/runtime queries send?tzwith each request; the user type gainstimezone; the runtime timezone field and mutation are removed.useViewingTimezonehook and a Timezone setting in Preferences; report charts and the dashboard week boundary follow the viewer tz; remove the runtime detail timezone editor and its locale strings.How to Test
make migrate-up); confirmtask_usage_hourlyis created andagent_runtime.timezoneis dropped.make testandpnpm testpass.Checklist
AI Disclosure
AI tool used: Claude Code
Prompt / approach: Implemented against the design in
docs/timezone-architecture-rfc.md, and used Claude Code to split the work into commits by category and draft the issue and this PR.Risks
DATE(bucket_hour AT TIME ZONE @tz)handles it correctly, but any front-end "a day = 24h" hardcoded offset must be tested at DST boundaries.