feat(upsampling) - Support upsampled error count with performance optimizations#18
feat(upsampling) - Support upsampled error count with performance optimizations#18ShashankFC wants to merge 2 commits into
Conversation
…(#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
|
bugbot run |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| expensive repeated option lookups during high-traffic periods. This is safe | ||
| because allowlist changes are infrequent and eventual consistency is acceptable. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" |
There was a problem hiding this comment.
Non-deterministic hash() makes cache keys unstable across processes
Medium Severity
The cache key uses Python's hash() on a tuple of project IDs, but hash() is non-deterministic across processes due to PYTHONHASHSEED randomization (enabled by default since Python 3.3). Each worker process computes a different hash for the same set of project IDs, so the shared cache (memcached/Redis) ends up with per-process entries that are never shared. This renders the caching ineffective and makes invalidate_upsampling_cache unable to delete entries set by other processes. A deterministic representation (e.g., a sorted comma-joined string of IDs) would be needed instead.
Additional Locations (1)
| # This cached result ensures consistent behavior across query execution | ||
| should_upsample = is_errors_query_for_error_upsampled_projects( | ||
| snuba_params, organization, dataset, request | ||
| ) |
There was a problem hiding this comment.
Upsampling check uses outer dataset instead of scoped_dataset
Medium Severity
The upsampling eligibility check passes the outer closure variable dataset (from the HTTP request) to is_errors_query_for_error_upsampled_projects, but _get_event_stats receives scoped_dataset as a parameter, which can differ from dataset during dashboard widget split processing. When _dashboards_check_and_get_event_stats calls _get_event_stats with a different dataset (e.g., discover for error-split widgets), the upsampling decision is based on the original request dataset rather than the actual dataset being queried.
| query = request.GET.get("query", "").lower() | ||
|
|
||
| if "event.type:error" in query: | ||
| return True |
There was a problem hiding this comment.
Negated error queries incorrectly match as error-focused
Medium Severity
The substring check "event.type:error" in query in _is_error_focused_query also matches negated forms like !event.type:error, which explicitly excludes error events. This could incorrectly trigger upsampling on the discover dataset for queries targeting non-error events, where sample_weight may not be populated, leading to incorrect counts.
| [Function("sum", [Column("sample_weight")])], | ||
| alias, | ||
| ), | ||
| default_result_type="number", |
There was a problem hiding this comment.
upsampled_count result type mismatches count it replaces
Low Severity
The upsampled_count function has default_result_type="number" but the count function it transparently replaces (via the "upsampled_count() as count" alias) has default_result_type="integer". Since this function is designed to be a drop-in substitute for count(), the result type metadata returned in the API response will change from "integer" to "number" when upsampling is active, which could affect how the frontend formats or renders the values (e.g., displaying decimal places).
| cache consistency across the system. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" | ||
| cache.delete(cache_key) |
There was a problem hiding this comment.
invalidate_upsampling_cache is defined but never used
Low Severity
invalidate_upsampling_cache is a public function that is defined but never imported or called anywhere in the codebase. Beyond being dead code, it also suffers from the same non-deterministic hash() cache key issue, so even if it were called, it couldn't reliably invalidate cache entries set by other worker processes.


Test 3
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Replicated from ai-code-review-evaluation/sentry-coderabbit#3
Note
Medium Risk
Changes query aggregation semantics for a widely used stats endpoint and relies on
sample_weight/allowlist correctness; mistakes could skew customer-facing counts, though the behavior is gated and covered by tests.Overview
organization-events-statsnow conditionally rewritescount()aggregations to an upsampled equivalent when querying error events and all requested projects are in theissues.client_error_sampling.project_allowlist.This introduces a new helper (
sentry.api.helpers.error_upsampling) with a short-lived cache for allowlist eligibility, adds anupsampled_count()SnQL function (summingsample_weight) todiscover, updates test event normalization to surfacesample_ratefromcontexts.error_sampling, and adds unit + endpoint tests covering allowlisted/partial/no allowlist and transaction-query bypass cases.Written by Cursor Bugbot for commit 6ad6fe3. Configure here.