feat(upsampling) - Support upsampled error count with performance optimizations#14
feat(upsampling) - Support upsampled error count with performance optimizations#14akshayutture-augment wants to merge 2 commits into
Conversation
…(#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
|
@claude review this PR |
|
Claude finished @akshayutture-augment's task —— View job Code Review - Error Upsampling with Performance OptimizationsI've completed my review of this PR. Overall, the implementation is solid with good test coverage, but I've identified several important issues that should be addressed. ✅ Strengths
🔴 Critical Issues1. Race Condition in Caching Logic (src/sentry/api/helpers/error_upsampling.py:27-40)The caching implementation has a critical race condition that can cause inconsistent query results: # Line 27-40
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
# Check cache first for performance optimization
cached_result = cache.get(cache_key)
if cached_result is not None:
return cached_result and _should_apply_sample_weight_transform(dataset, request)
# Cache miss - perform fresh allowlist check
is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)
# Cache for 60 seconds to improve performance during traffic spikes
cache.set(cache_key, is_eligible, 60)
return is_eligible and _should_apply_sample_weight_transform(dataset, request)Problem: The function caches
If the dataset or request query changes between calls with the same cache key, you'll get inconsistent results. Solution: Cache the full eligibility decision: cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}:{dataset.__name__}:{request.GET.get('query', '')}"
cached_result = cache.get(cache_key)
if cached_result is not None:
return cached_result
is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)
should_transform = is_eligible and _should_apply_sample_weight_transform(dataset, request)
cache.set(cache_key, should_transform, 60)
return should_transform2. Potential Data Inconsistency (src/sentry/api/helpers/error_upsampling.py:43-64)The comment on line 50-53 acknowledges a serious issue but doesn't address it: # NOTE: This function reads the allowlist configuration fresh each time,
# which means it can return different results between calls if the
# configuration changes during request processing. This is intentional
# to ensure we always have the latest configuration state.Problem: Combined with the cache, this creates a window where:
Solution: Either:
3. Missing Null Safety (src/sentry/api/helpers/error_upsampling.py:87-100)def transform_query_columns_for_error_upsampling(
query_columns: Sequence[str],
) -> list[str]:
# ...
# Note: We rely on the database schema to ensure sample_weight exists for all
# events in allowlisted projects, so no additional null checks are needed here.Problem: The comment claims schema guarantees, but:
Impact: Queries will break with NULL errors if Solution: Add defensive null handling in the SQL function definition in discover.py:1046-1050: snql_aggregate=lambda args, alias: Function(
"toInt64",
[Function("sum", [Function("coalesce", [Column("sample_weight"), 1])])],
alias,
),
|
Test 3
Replicated from ai-code-review-evaluation/sentry-greptile#3