Skip to content

feat(upsampling) - Support upsampled error count with performance optimizations#4

Open
zaibkhan wants to merge 3 commits into
masterfrom
error-upsampling-race-condition
Open

feat(upsampling) - Support upsampled error count with performance optimizations#4
zaibkhan wants to merge 3 commits into
masterfrom
error-upsampling-race-condition

Conversation

@zaibkhan

@zaibkhan zaibkhan commented Sep 3, 2025

Copy link
Copy Markdown

This PR introduces upsampled error count support in the events-stats API, improving error metrics accuracy under race conditions. It also includes performance optimizations with caching to reduce response latency.

Key changes:

  • Add upsampled error count path to events-stats
  • Mitigate race conditions around sampling computations
  • Add caching/perf improvements for hot endpoints
  • Update tests to cover the new path

Benchmark reference:
Replicated from ai-code-review-evaluation/sentry-greptile PR #3

yuvmen and others added 2 commits July 25, 2025 09:48
…(#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@codoki-ai codoki-ai deleted a comment from codoki-pr-intelligence Bot Sep 3, 2025
@codoki-ai codoki-ai deleted a comment from codoki-pr-intelligence Bot Sep 3, 2025
@codoki-ai codoki-ai deleted a comment from codoki-pr-intelligence Bot Sep 3, 2025
@codoki-pr-intelligence

codoki-pr-intelligence Bot commented Sep 3, 2025

Copy link
Copy Markdown

Codoki PR Review

Summary: Register upsampled_count, fix cache key stability, apply safe transforms
What’s good: Clear separation of concerns: eligibility check, dataset guard, and late column transformation minimize unintended scope. Integration tests exercise the new path across errors/transactions and allowlist mixes.
Review Status: ❌ Requires changes
Overall Priority: High

Issues (Critical & High only)

Severity Issue Why it matters File
High Correctness — upsampled_count not registered in production registry upsampled_count is only registered in the test factory but not in the production function registry (e.g., sentry.snuba.functions). In production, this will cause query parsing to fail when the transform is applied, breaking the endpoint for allowlisted orgs. …/helpers/error_upsampling.py
Key Feedback (click to expand)
  • Needs improvement: Cache key construction uses Python's hash(), which is non-deterministic across processes and will break shared caching/invalidation; switch to a deterministic key. The new upsampling function is only registered in test factories, so live queries will error where the transform is applied.
  • Testing: Integration tests pass by registering upsampled_count in test factories, masking the fact that production function registry lacks this function. Add a test that runs through the real function registry (e.g., via sentry.snuba.functions) to ensure 'upsampled_count' is defined in production code.
  • Documentation: Consider documenting in error_upsampling.py where/when invalidate_upsampling_cache must be triggered (e.g., option change hook), and that upsampled_count must be present in the production function registry.
  • Performance: Using Python's hash() in cache keys prevents cross-process cache reuse and breaks invalidation; use a deterministic join or stable digest of project IDs to enable shared caching and reliable eviction.
  • Open questions: What mechanism will call invalidate_upsampling_cache when the allowlist option changes? If none exists, should we reduce TTL or wire a signal to ensure consistency?

Confidence: 3/5 — Needs work before merge (1 high · status: Requires changes)

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint
    participant Helper
    Client->>Endpoint: GET /organization-events-stats
    Endpoint->>Helper: is_errors_query_for_error_upsampled_projects(snuba_params, org, dataset, request)
    alt cache hit
        Helper-->>Endpoint: cached_result && _should_apply_sample_weight_transform()
    else cache miss
        Helper->>Helper: _are_all_projects_error_upsampled()
        Helper->>Helper: cache.set(key, is_eligible, 60)
        Helper-->>Endpoint: is_eligible && _should_apply_sample_weight_transform()
    end
    alt should_upsample
        Endpoint->>Endpoint: transform_query_columns_for_error_upsampling()
    end
    Endpoint-->>Client: timeseries response
Loading

React with 👍 or 👎 if you found this review useful.

expensive repeated option lookups during high-traffic periods. This is safe
because allowlist changes are infrequent and eventual consistency is acceptable.
"""
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔷 Medium: Using Python's hash() in a cache key is non-deterministic across processes (hash randomization) and will break shared cache reuse and invalidation, leading to inconsistent behavior and missed performance benefits.

Suggested change
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(map(str, sorted(snuba_params.project_ids)))}"

This should be called when the allowlist configuration changes to ensure
cache consistency across the system.
"""
cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔷 Medium: Same issue as above: invalidation key must match the setting key across processes; using hash() prevents reliable eviction.

Suggested change
cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"
cache_key = f"error_upsampling_eligible:{organization_id}:{','.join(map(str, sorted(project_ids)))}"

if column_lower == "count()":
# Transform to upsampled count - assumes sample_weight column exists
# for all events in allowlisted projects per our data model requirements
transformed_columns.append("upsampled_count() as count")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ High: upsampled_count is only registered in the test factory but not in the production function registry (e.g., sentry.snuba.functions). In production, this will cause query parsing to fail when the transform is applied, breaking the endpoint for allowlisted orgs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants