feat(upsampling) - Support upsampled error count with performance optimizations#4
feat(upsampling) - Support upsampled error count with performance optimizations#4zaibkhan wants to merge 3 commits into
Conversation
…(#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
Codoki PR ReviewSummary: Register upsampled_count, fix cache key stability, apply safe transforms Issues (Critical & High only)
Key Feedback (click to expand)
Confidence: 3/5 — Needs work before merge (1 high · status: Requires changes) Sequence DiagramsequenceDiagram
participant Client
participant Endpoint
participant Helper
Client->>Endpoint: GET /organization-events-stats
Endpoint->>Helper: is_errors_query_for_error_upsampled_projects(snuba_params, org, dataset, request)
alt cache hit
Helper-->>Endpoint: cached_result && _should_apply_sample_weight_transform()
else cache miss
Helper->>Helper: _are_all_projects_error_upsampled()
Helper->>Helper: cache.set(key, is_eligible, 60)
Helper-->>Endpoint: is_eligible && _should_apply_sample_weight_transform()
end
alt should_upsample
Endpoint->>Endpoint: transform_query_columns_for_error_upsampling()
end
Endpoint-->>Client: timeseries response
React with 👍 or 👎 if you found this review useful. |
| expensive repeated option lookups during high-traffic periods. This is safe | ||
| because allowlist changes are infrequent and eventual consistency is acceptable. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" |
There was a problem hiding this comment.
🔷 Medium: Using Python's hash() in a cache key is non-deterministic across processes (hash randomization) and will break shared cache reuse and invalidation, leading to inconsistent behavior and missed performance benefits.
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" | |
| cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(map(str, sorted(snuba_params.project_ids)))}" |
| This should be called when the allowlist configuration changes to ensure | ||
| cache consistency across the system. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" |
There was a problem hiding this comment.
🔷 Medium: Same issue as above: invalidation key must match the setting key across processes; using hash() prevents reliable eviction.
| cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" | |
| cache_key = f"error_upsampling_eligible:{organization_id}:{','.join(map(str, sorted(project_ids)))}" |
| if column_lower == "count()": | ||
| # Transform to upsampled count - assumes sample_weight column exists | ||
| # for all events in allowlisted projects per our data model requirements | ||
| transformed_columns.append("upsampled_count() as count") |
There was a problem hiding this comment.
This PR introduces upsampled error count support in the events-stats API, improving error metrics accuracy under race conditions. It also includes performance optimizations with caching to reduce response latency.
Key changes:
Benchmark reference:
Replicated from ai-code-review-evaluation/sentry-greptile PR #3