feat(metrics): usage & system-scale observability#239
Conversation
Implements plan.md (workstreams B–E + docs) end to end.
- WS-B: rate_limit_exceeded_total{endpoint} recorded in the 429 handler
(Endpoint + Recorder threaded through middleware.RateLimitConfig).
- WS-C: oauth_consent_granted_total{flow} / oauth_consent_revoked_total{actor}
via AuthorizationService (recorder injected, nil->noop default).
- WS-D: authgate_users_active{auth_source}, authgate_users_registered,
authgate_oauth_clients_registered{status}, authgate_build_info{version,commit}
reconciled through the cached gauge-update job. New store aggregates
CountActiveUsersByAuthSource / CountAllUsers; MetricsStore + CacheWrapper extended.
- WS-E: RecordSessionExpired decrements sessions_active and increments
sessions_expired_total{reason} on idle-timeout and fingerprint-mismatch,
fixing the upward drift.
- Docs: METRICS.md drift fixed (removed phantom auth_external_api_duration_seconds
/ sessions_invalidated_total + dead INTEGRATION.md link), documented existing
cache_* metrics and all new series with Grafana queries + alerts; MONITORING.md
and CLAUDE.md summaries synced.
Open-question answers folded in: authgate_ prefix, auth_source label on active
users, and the build_info gauge.
Tests: metrics unit (new recorders + cache getters), store
(CountActiveUsersByAuthSource / CountAllUsers), middleware integration
(rate-limit 429 -> metric, idle-timeout + fingerprint-mismatch drift), consent
service paths, and bootstrap-level /metrics scrape e2e (happy-path series
exposed, disabled-silent 404, bearer-token auth). mock_metrics.go / mock_store.go
regenerated via make generate. make fmt/lint/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-review cleanup (no behavior change): - Reuse: replace the duplicate buildInfoVersion() in bootstrap with the existing version-default logic, now exported as version.GetVersion(). - Altitude: derive the authgate_users_active label set from the canonical auth.Provider* constants instead of hardcoded "github"/"gitea"/... strings, so the metric labels can't silently drift from the provider registry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…a live session
Code-review follow-up. The sessions_active drift fix (WS-E) covered the
idle-timeout and fingerprint-mismatch session-clear paths but missed the third
path the plan itself named: loadUserFromSession clears a *live* authenticated
cookie session when the backing user was deleted or disabled, yet never
decremented the gauge — so disabling/deleting an active user left sessions_active
drifting upward (the same bug WS-E set out to close).
- middleware/auth.go: add clearLiveSession() which clears + records
sessions_expired_total{reason="account_inactive"} and decrements
sessions_active only on a successful save (mirrors the idle/fingerprint paths).
Thread an optional recorder through loadUserFromSession / RequireAuth /
OptionalAuth (nil-safe).
- bootstrap/router.go: pass the recorder to the auth middleware at all call sites.
- Tests: TestRequireAuth_DeletedUser_* now asserts the decrement + counter.
- Docs/comments: document the new account_inactive reason.
make fmt/lint/test all green (1763 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Re-pad the session metrics table so the rows match the widened labels column
Performance follow-up from the code review. - Index the columns the population-gauge reconciliation filters on so the per-source / per-status COUNTs use index scans instead of full table scans: a composite users(is_active, auth_source) index for authgate_users_active and an oauth_applications(status) index for authgate_oauth_clients_registered. Both are added via gorm tags and created by the existing AutoMigrate. - Harden rate_limit_exceeded_total: when a limiter has no Endpoint set, fall back to the constant "unknown" instead of the raw request path, so a misconfigured limiter can't explode the metric's label cardinality. - Add a store test asserting both indexes exist (guards the perf fix). - Document the index-backed COUNTs in METRICS.md. make fmt/lint/test all green (1764 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oups Cleanup follow-up. setupAllRoutes already constructs optionalAuth and injectPending once and reuses them; the auth-middleware threading added a fresh middleware.RequireAuth(h.userService, prometheusMetrics) at each of the 6 route groups. Build it once into a local and reuse, matching the surrounding pattern and collapsing the duplicated multi-line .Use() blocks. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly has
Code-review follow-up. The population gauge labelled authgate_users_active by
auth_source using {local, http_api, microsoft, github, gitea, gitlab}, but
User.AuthSource never holds a provider name: createUserWithOAuth stores every
OAuth user (GitHub/Gitea/GitLab/Microsoft) as "local" (the provider lives in
OAuthConnection), and external users are "http_api". So the four provider labels
were permanently-zero series and the "local" count silently conflated local +
OAuth users.
Reduce metricsAuthSources to the values auth_source can actually take
({local, http_api}); drop the now-unused internal/auth import. Per-provider
breakdown remains available via the runtime-labelled auth_login_total{auth_source}.
Docs/comments and the example labels in two tests updated to match.
make fmt/lint/test all green (1764 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts in the auth middleware where main added login-session tracking (loginSessions + allowBackfill) while this branch added a metrics recorder. Both new parameters are kept: - loadUserFromSession / OptionalAuth / RequireAuth now take (userService, recorder, loginSessions[, allowBackfill]). - router.go keeps the reusable optionalAuth/requireAuth handlers and threads both prometheusMetrics and loginMW into them. - Updated all test callers and fixed the merged authorization_test.go NewAuthorizationService call to pass the recorder arg. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 868870380b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…gative RecordOAuthCallback established a browser session (the handler sets SessionUserID right after a successful callback) but never incremented sessions_active / sessions_created_total, unlike RecordLogin. Meanwhile RecordSessionExpired decrements sessions_active on idle timeout, fingerprint mismatch, and account-inactive clears. OAuth-authenticated sessions were therefore decremented without ever being counted, driving the sessions_active gauge below the true count (or negative) in OAuth-enabled deployments. Mirror RecordLogin: increment both gauges on a successful OAuth callback. Add TestRecordOAuthCallbackCountsSession to lock in the symmetry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
authgate/internal/middleware/auth.go
Line 146 in 6af146c
When LOGIN_SESSION_TRACKING_ENABLED is on, revoking a device from /account/devices makes enforceLoginSession return !valid on that device's next request; this branch still clears the live authenticated cookie without calling the new RecordSessionExpired path. Because local/OAuth login increments sessions_active and the revoke handlers only mark the login-session row revoked, every remote sign-out continues to leave sessions_active too high despite the drift fix.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…drift The previous fix incremented sessions_active/sessions_created_total inside RecordOAuthCallback (and RecordLogin already did so), but both are called before session.Save(). If the cookie save fails (oversized/invalid store write) the handler returns 500 yet the gauges were already incremented for a session that was never established, reintroducing upward drift. Decouple session-gauge accounting from the auth-outcome counters: - RecordLogin / RecordOAuthCallback are now pure event counters. - Add RecordSessionStarted(), called by both the local-login and OAuth-callback handlers only after session.Save() succeeds, so a failed save can never inflate the gauges. This makes both login paths symmetric and pairs cleanly with RecordSessionExpired / RecordLogout on the decrement side, eliminating the drift class at its root. Regenerated the Recorder mock and added tests covering the start/expire net-zero invariant and the counter decoupling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Operators running AuthGate behind Prometheus/Grafana could see request/token/auth rates but not how the system is actually used — active users, registered clients, consent activity, or rate-limit abuse — and the
sessions_activegauge drifted upward because non-logout expiries never decremented it. This PR adds four categories of usage/system-scale metrics, fixes the session gauge drift (across all server-observable clear paths), and correctsdocs/METRICS.mddrift. New recording is additive (no renamed/removed series) and a no-op whenMETRICS_ENABLED=false.rate_limit_exceeded_total{endpoint}recorded in the 429 handler (Endpoint+Recorderthreaded throughmiddleware.RateLimitConfig).oauth_consent_granted_total{flow}/oauth_consent_revoked_total{actor}viaAuthorizationService(recorder injected,nil→noop default).authgate_users_active{auth_source},authgate_users_registered,authgate_oauth_clients_registered{status},authgate_build_info{version,commit}, reconciled through the cached gauge-update job. New read-only store aggregatesCountActiveUsersByAuthSource/CountAllUsers/CountClientsByStatus.sessions_activedrift fix:RecordSessionExpired(reason)incrementssessions_expired_total{reason}and decrementssessions_activeon idle-timeout, fingerprint-mismatch, and account-inactive (deleted/disabled user) clear paths — previously onlylogoutdecremented.INTEGRATION.mdlink fromdocs/METRICS.md; documented the existingcache_*metrics and all new series with Grafana queries + alerts; synceddocs/MONITORING.mdandCLAUDE.md.Architecture / flow
How recording is wired. Every node maps to a file this PR changes; the two green nodes are the review-critical follow-ups (the
account_inactivedrift fix and the index-backed COUNTs). All recording funnels throughcore.Recorder, which is the realMetricswhen enabled and a no-op otherwise.flowchart TD subgraph req[Request-path recording] RL["ratelimit.go — 429 handler"] SESS["auth.go — session clear<br/>idle / fingerprint / account_inactive"] CONS["authorization.go — consent grant/revoke"] end subgraph bg[Background job + startup] JOB["server.go — updatePopulationGauges"] STORE[("store COUNT — index-backed")] BUILD["cache.go — SetBuildInfo (startup)"] end REC{{"core.Recorder<br/>Metrics or Noop"}} RL -->|RecordRateLimitExceeded| REC SESS -->|RecordSessionExpired| REC CONS -->|"RecordConsentGranted / Revoked"| REC JOB --> STORE STORE -->|"SetUsers* / SetClients*Count"| REC BUILD -->|SetBuildInfo| REC REC --> SCRAPE["GET /metrics (router.go)"] subgraph series[New series exposed] S1["rate_limit_exceeded_total"] S2["sessions_active / sessions_expired_total"] S3["oauth_consent_granted / revoked_total"] S4["authgate_users_active / _registered"] S5["authgate_oauth_clients_registered"] S6["authgate_build_info"] end SCRAPE --> S1 & S2 & S3 & S4 & S5 & S6 style SESS fill:#dff0d8,stroke:#3c763d style STORE fill:#dff0d8,stroke:#3c763dTwo indexes are added via gorm tags and created by the existing
AutoMigrateon next startup:users(is_active, auth_source)→idx_users_active_source(forauthgate_users_active)oauth_applications(status)(forauthgate_oauth_clients_registered)They make the population-gauge
COUNTqueries index scans instead of full table scans. Deploy note: on a large existing Postgres table the first startup runs a one-timeCREATE INDEX(brief write lock); SQLite is fast. No columns/tables added, no data migration.Post-review follow-ups (commits 2–5)
The first commit was the feature; the rest are review/perf follow-ups (all green):
refactor— reuseversion.GetVersion()instead of a duplicate helper; derive the auth_source label set fromauth.Provider*constants instead of magic strings.fix— complete thesessions_activedrift fix for the account-inactive clear path (the third path the plan named but the feature commit missed); thread a nil-safe recorder throughRequireAuth/OptionalAuth.perf— the two indexes above; hardenrate_limit_exceeded_totalto fall back to a constant"unknown"label (never the raw request path) so a misconfigured limiter can't explode cardinality.docs— table alignment.AI Authorship
Change classification
Touches the shared
core.Recorderinterface, session/rate-limit/auth middleware, bootstrap wiring, and the DB schema (indexes). Mitigating factor: all new recording is additive and a no-op whenMETRICS_ENABLED=false; the indexes are additive and don't change query results.Plan reference
plan.md(force-added to the branch). Done-definition checklist is fully ticked.Verification
TestPopulationGaugeIndexes).middleware/metrics_integration_test.go(429→counter, idle-timeout + fingerprint-mismatch decrement);services/authorization_metrics_test.go(all four consent paths);middleware/auth_test.go(deleted-user clear →account_inactivedecrement).bootstrap/metrics_endpoint_test.go): happy-path scrape exposes the new series;METRICS_ENABLED=false→/metrics404 (silent); bearer-token auth (401 without/wrong, 200 with).make generate && make build;METRICS_ENABLED=true ./bin/authgate server;curl -s localhost:8080/metrics | grep -E '^(rate_limit_exceeded|oauth_consent|authgate_users_|authgate_oauth_clients_registered|cache_(hits|misses))'.Gates:
make generate(mocks regenerated, no diff),make fmt,make lint(0 issues),make test(1764 passed / 20 packages).Verifiability check
docs/METRICS.md.database_query_errors_totalcovers gauge-reconciliation query failures.Security check
s3cr3t-metrics-token) inmetrics_endpoint_test.go.endpointlabel now falls back to a constant"unknown", not the raw path, so cardinality stays bounded even if misconfigured./metricsbearer-token auth path is exercised by the e2e test.session.Save().Risk & rollback
CREATE INDEX(brief write lock on large Postgres tables).git revertthe relevant commit(s). Reverting code leaves the (harmless, unused) indexes in place — drop them manually if desired. No data or config migration.Reviewer guide
internal/middleware/auth.go—clearLiveSession+ theRecordSessionExpiredcalls on the idle-timeout / fingerprint-mismatch / account-inactive clear paths (confirm no control-flow change; recorded only after a successful save).internal/models/user.go+oauth_application.go— the new index tags (the migration).internal/core/metrics.go+internal/metrics/{metrics,http,noop}.go— new Recorder methods land in interface + Metrics + Noop + mock together.internal/bootstrap/server.go/router.go—updatePopulationGaugesreconciliation and the recorder threaded through allRequireAuth/OptionalAuthcall sites.internal/services/authorization.go— the four consent recording sites.mock_metrics.go/mock_store.go;docs/*andCLAUDE.md; test files; the mechanical*_test.goconstructor-signature updates.🤖 Generated with Claude Code