Skip to content

feat(upsampling) - Support upsampled error count with performance optimizations#18

Open
ShashankFC wants to merge 2 commits into
masterfrom
error-upsampling-race-condition
Open

feat(upsampling) - Support upsampled error count with performance optimizations#18
ShashankFC wants to merge 2 commits into
masterfrom
error-upsampling-race-condition

Conversation

@ShashankFC

@ShashankFC ShashankFC commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

Test 3

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced error upsampling support for organization event statistics queries, improving accuracy of error metrics for eligible projects through sample-weighted counting.
    • Upsampling applies selectively based on project allowlisting and query type.
  • Tests

    • Added comprehensive test coverage for error upsampling across various project allowlist scenarios and event dataset types.

✏️ Tip: You can customize this high-level summary in your review settings.


Replicated from ai-code-review-evaluation/sentry-coderabbit#3


Note

Medium Risk
Changes query aggregation semantics for a widely used stats endpoint and relies on sample_weight/allowlist correctness; mistakes could skew customer-facing counts, though the behavior is gated and covered by tests.

Overview
organization-events-stats now conditionally rewrites count() aggregations to an upsampled equivalent when querying error events and all requested projects are in the issues.client_error_sampling.project_allowlist.

This introduces a new helper (sentry.api.helpers.error_upsampling) with a short-lived cache for allowlist eligibility, adds an upsampled_count() SnQL function (summing sample_weight) to discover, updates test event normalization to surface sample_rate from contexts.error_sampling, and adds unit + endpoint tests covering allowlisted/partial/no allowlist and transaction-query bypass cases.

Written by Cursor Bugbot for commit 6ad6fe3. Configure here.

yuvmen and others added 2 commits July 25, 2025 09:48
…(#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@ShashankFC

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

expensive repeated option lookups during high-traffic periods. This is safe
because allowlist changes are infrequent and eventual consistency is acceptable.
"""
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-deterministic hash() makes cache keys unstable across processes

Medium Severity

The cache key uses Python's hash() on a tuple of project IDs, but hash() is non-deterministic across processes due to PYTHONHASHSEED randomization (enabled by default since Python 3.3). Each worker process computes a different hash for the same set of project IDs, so the shared cache (memcached/Redis) ends up with per-process entries that are never shared. This renders the caching ineffective and makes invalidate_upsampling_cache unable to delete entries set by other processes. A deterministic representation (e.g., a sorted comma-joined string of IDs) would be needed instead.

Additional Locations (1)

Fix in Cursor Fix in Web

# This cached result ensures consistent behavior across query execution
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, dataset, request
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upsampling check uses outer dataset instead of scoped_dataset

Medium Severity

The upsampling eligibility check passes the outer closure variable dataset (from the HTTP request) to is_errors_query_for_error_upsampled_projects, but _get_event_stats receives scoped_dataset as a parameter, which can differ from dataset during dashboard widget split processing. When _dashboards_check_and_get_event_stats calls _get_event_stats with a different dataset (e.g., discover for error-split widgets), the upsampling decision is based on the original request dataset rather than the actual dataset being queried.

Fix in Cursor Fix in Web

query = request.GET.get("query", "").lower()

if "event.type:error" in query:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negated error queries incorrectly match as error-focused

Medium Severity

The substring check "event.type:error" in query in _is_error_focused_query also matches negated forms like !event.type:error, which explicitly excludes error events. This could incorrectly trigger upsampling on the discover dataset for queries targeting non-error events, where sample_weight may not be populated, leading to incorrect counts.

Fix in Cursor Fix in Web

[Function("sum", [Column("sample_weight")])],
alias,
),
default_result_type="number",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upsampled_count result type mismatches count it replaces

Low Severity

The upsampled_count function has default_result_type="number" but the count function it transparently replaces (via the "upsampled_count() as count" alias) has default_result_type="integer". Since this function is designed to be a drop-in substitute for count(), the result type metadata returned in the API response will change from "integer" to "number" when upsampling is active, which could affect how the frontend formats or renders the values (e.g., displaying decimal places).

Fix in Cursor Fix in Web

cache consistency across the system.
"""
cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"
cache.delete(cache_key)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalidate_upsampling_cache is defined but never used

Low Severity

invalidate_upsampling_cache is a public function that is defined but never imported or called anywhere in the codebase. Beyond being dead code, it also suffers from the same non-deterministic hash() cache key issue, so even if it were called, it couldn't reliably invalidate cache entries set by other worker processes.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants