Skip to content

feat(upsampling) - Support upsampled error count with performance optimizations#3

Draft
ShashankFC wants to merge 2 commits into
masterfrom
error-upsampling-race-condition
Draft

feat(upsampling) - Support upsampled error count with performance optimizations#3
ShashankFC wants to merge 2 commits into
masterfrom
error-upsampling-race-condition

Conversation

@ShashankFC

Copy link
Copy Markdown
Collaborator

Test 3nn---nReplicated from ai-code-review-evaluation/sentry-greptile#3

yuvmen and others added 2 commits July 25, 2025 09:48
…(#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@ShashankFC

Copy link
Copy Markdown
Collaborator Author

@claude

@claude

claude Bot commented Jan 7, 2026

Copy link
Copy Markdown

Claude finished @ShashankFC's task in 7m 14s —— View job


PR Review

Todo List

  • Gather context and review PR changes
  • Read modified and new files
  • Analyze implementation for issues and improvements
  • Provide comprehensive review feedback

Overview

This PR implements error upsampling functionality for the events-stats API with performance optimizations including caching. The implementation allows Sentry to report upsampled error counts based on sample_weight for allowlisted projects.

Critical Issues

1. Race Condition in Cache Implementation ⚠️

Location: src/sentry/api/helpers/error_upsampling.py:27-40

There's a critical race condition in the caching logic that matches the PR branch name error-upsampling-race-condition:

cached_result = cache.get(cache_key)
if cached_result is not None:
    return cached_result and _should_apply_sample_weight_transform(dataset, request)

# Cache miss - perform fresh allowlist check
is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)
cache.set(cache_key, is_eligible, 60)

Problem: The function _are_all_projects_error_upsampled() (lines 50-54) reads the allowlist configuration fresh every time it's called, even though the result is cached. This creates a window where:

  1. Thread A checks cache (miss), reads allowlist (returns True)
  2. Thread B checks cache (miss), reads allowlist (returns True)
  3. Allowlist configuration changes (project removed from allowlist)
  4. Thread A caches True
  5. Thread B caches True
  6. Both continue with stale data for 60 seconds

Additionally, there's a second race condition between the eligibility check and the dataset check:

  1. Eligibility is checked and cached (project is allowlisted)
  2. _should_apply_sample_weight_transform is called later
  3. Configuration could change between these two checks

Impact: Events could be incorrectly upsampled for projects that were just removed from the allowlist, or vice versa, for up to 60 seconds.

Recommendation:

# Option 1: Cache the complete decision
should_upsample = is_errors_query_for_error_upsampled_projects(...)
# Store this boolean and use it consistently throughout request

# Option 2: Document this as acceptable eventual consistency
# Add comment explaining the 60-second window is intentional

Fix race condition →

2. Inconsistent Cache Key Generation

Location: src/sentry/api/helpers/error_upsampling.py:27 and error_upsampling.py:73

The cache key uses hash(tuple(sorted(project_ids))) which can have collisions:

cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

Problem: Python's hash() is not cryptographically secure and can produce collisions, especially for similar project ID combinations. Different sets of project IDs could hash to the same value, causing incorrect cache hits.

Recommendation: Use a deterministic string representation instead:

cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(map(str, sorted(snuba_params.project_ids)))}"

Fix cache key →

3. Missing Error Handling

Location: src/sentry/api/endpoints/organization_events_stats.py:220-226

The upsampling check has no error handling:

should_upsample = is_errors_query_for_error_upsampled_projects(
    snuba_params, organization, dataset, request
)

Problem: If the options service fails, cache fails, or any exception occurs, the entire query will fail. This could cause a production outage.

Recommendation: Add defensive error handling with fallback to non-upsampled queries:

try:
    should_upsample = is_errors_query_for_error_upsampled_projects(...)
except Exception as e:
    sentry_sdk.capture_exception(e)
    should_upsample = False  # Fail safe - use regular count

Add error handling →

Major Issues

4. Type Safety Issue with ModuleType

Location: src/sentry/api/helpers/error_upsampling.py:16

def is_errors_query_for_error_upsampled_projects(
    snuba_params: SnubaParams,
    organization: Organization,
    dataset: ModuleType,  # ❌ Too generic
    request: Request,
) -> bool:

Problem: Using ModuleType for the dataset parameter is overly permissive and loses type safety. Any Python module can be passed, not just dataset modules.

Recommendation: Define a protocol or use a more specific type:

from typing import Protocol

class DatasetModule(Protocol):
    """Protocol for dataset modules"""
    pass

def is_errors_query_for_error_upsampled_projects(
    snuba_params: SnubaParams,
    organization: Organization,
    dataset: DatasetModule,  # ✅ More specific
    request: Request,
) -> bool:

5. Incomplete Test Coverage

Location: tests/sentry/api/helpers/test_error_upsampling.py

The unit tests don't cover:

  • Cache hit scenarios
  • Cache invalidation (invalidate_upsampling_cache is never tested)
  • Race conditions between eligibility and dataset checks
  • Edge cases like empty project lists with cached results

Recommendation: Add tests for:

def test_cache_hit_scenario(self):
    # Test that cached results are used
    
def test_invalidate_upsampling_cache(self):
    # Test cache invalidation works
    
def test_race_condition_scenario(self):
    # Test behavior when config changes between checks

6. Magic Number - Cache TTL

Location: src/sentry/api/helpers/error_upsampling.py:38

cache.set(cache_key, is_eligible, 60)  # ❌ Magic number

Recommendation: Define as a named constant:

ERROR_UPSAMPLING_CACHE_TTL = 60  # seconds - balance between performance and config freshness

cache.set(cache_key, is_eligible, ERROR_UPSAMPLING_CACHE_TTL)

Minor Issues

7. Redundant Variable Assignment

Location: src/sentry/api/endpoints/organization_events_stats.py:224-226

upsampling_enabled = should_upsample  # Redundant
final_columns = query_columns

Problem: Creates unnecessary intermediate variables that don't add clarity.

Recommendation: Use should_upsample directly throughout the function.

8. Code Duplication

Location: src/sentry/api/endpoints/organization_events_stats.py:232-233, 276-277, 295-296

The transformation is duplicated three times:

if upsampling_enabled:
    final_columns = transform_query_columns_for_error_upsampling(query_columns)

Recommendation: Extract to a helper function or apply transformation once at the beginning.

9. Comment Grammar

Location: src/sentry/api/helpers/error_upsampling.py:52-53

# NOTE: This function reads the allowlist configuration fresh each time,
# which means it can return different results between calls if the 

The comment is helpful but could be more precise about why this behavior exists and the implications.

10. Missing Docstring for upsampled_count

Location: src/sentry/search/events/datasets/discover.py:1041-1052

SnQLFunction(
    "upsampled_count",
    required_args=[],
    # Optimized aggregation for error upsampling - assumes sample_weight
    # exists for all events in allowlisted projects as per schema design
    snql_aggregate=lambda args, alias: Function(
        "toInt64",
        [Function("sum", [Column("sample_weight")])],
        alias,
    ),
    default_result_type="number",
),

Recommendation: Add a docstring explaining when this function should be used and its assumptions.

Performance Concerns

11. Cache Key Complexity

The cache key computation involves:

  • Tuple creation
  • Sorting project IDs
  • Hashing

For queries with many projects, this could be expensive. Consider profiling for high project counts.

12. Redundant Dataset Checks

The _should_apply_sample_weight_transform function is called after caching, which means the dataset check happens on every request even for cache hits. Consider including the dataset type in the cache key.

Security Concerns

13. Query Injection Risk (Low)

Location: src/sentry/api/helpers/error_upsampling.py:135-136

query = request.GET.get("query", "").lower()
if "event.type:error" in query:

This is a simple string check, but it doesn't validate query syntax. An attacker could potentially craft a query that includes event.type:error as a comment or string literal to trigger upsampling inappropriately.

Recommendation: Use proper query parsing instead of string matching if available in the codebase.

Documentation Issues

14. Missing Architecture Documentation

The PR description is minimal ("Test 3nn---n*Replicated from..."). There should be documentation explaining:

  • How error upsampling works
  • When to use it
  • How to add projects to the allowlist
  • Performance characteristics
  • Known limitations (eventual consistency)

15. Unclear Function Naming

The function _is_error_focused_query (line 130) only checks for event.type:error but doesn't check for error-specific filters. The name suggests broader checking.

Test Issues

16. Mock Test Data Inconsistency

Location: tests/snuba/api/endpoints/test_organization_events_stats.py:3576-3580

"contexts": {"error_sampling": {"client_sample_rate": 0.1}},

The test data includes client_sample_rate: 0.1 suggesting events should be upsampled by 10x, but the assertions expect exactly 10:

assert data[0][1][0]["count"] == 10  # First bucket has 1 event

This works, but the comment "has 1 event" is misleading - it has 1 event that's being reported as 10 due to upsampling.

17. Missing Edge Case Tests

No tests for:

  • Queries with top_events > 0
  • Queries with comparison delta
  • Empty result sets
  • Very large project lists

Positive Observations ✅

  1. Good separation of concerns - upsampling logic is in a dedicated module
  2. Comprehensive test coverage for basic scenarios
  3. Caching strategy is a good optimization approach
  4. Code follows Sentry patterns from CLAUDE.md
  5. Type hints are used throughout
  6. Comments explain intent well in most places

Summary

This PR implements a valuable feature but has several critical issues that should be addressed before merging:

Must Fix:

  1. Race condition in cache implementation (addresses PR branch name concern)
  2. Cache key collision risk
  3. Missing error handling

Should Fix:
4. Type safety with ModuleType
5. Test coverage gaps
6. Magic numbers

Nice to Have:
7. Code deduplication
8. Documentation improvements
9. Performance optimizations

The implementation is on the right track but needs refinement for production readiness, especially around the caching strategy and race condition handling.

--- | Branch: error-upsampling-race-condition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants