Skip to content

Replays Self-Serve Bulk Delete System#5

Draft
ShashankFC wants to merge 32 commits into
replays-delete-vulnerablefrom
replays-delete-stable
Draft

Replays Self-Serve Bulk Delete System#5
ShashankFC wants to merge 32 commits into
replays-delete-vulnerablefrom
replays-delete-stable

Conversation

@ShashankFC

Copy link
Copy Markdown
Collaborator

Test 5nn---nReplicated from ai-code-review-evaluation/sentry-greptile#5

armenzg and others added 30 commits June 20, 2025 12:49
…o 'low' (#93927)"

This reverts commit 8d04522.

Co-authored-by: roaga <47861399+roaga@users.noreply.github.com>
Missed in the initial commit, leading to some relevant logs being
unannotated.
We have had a few tasks get killed at 10% rollout.
Also add a test, so that this doesn't happen again
Fixes DE-129 and DE-156

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
These transitions should be matching
…` (#93946)

Use `project_id` on the replay record instead of the URL (where it does
not always exist).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
Also fixed `replay.view_html` -> `replay.view-html`

---------

Co-authored-by: Michelle Zhang <56095982+michellewzhang@users.noreply.github.com>
…948)

gets `npx @typescript/native-preview` passing again
The conditions associated with a DCG can change over time, and it's good
if we can be completely confident that they're consistent within a given
task execution.
This is unused and most regex experiments have required broader changes
to ensure that regexes are evaluated in a specific order (ex:
traceparent). Removing this for now to simplify the code and very
slightly improve runtime performance.
From some testing (on feedback lists of all different lengths), this
prompt seems to work better. It doesn't write overly long sentences and
also does a better job at "summarizing" versus just mentioning a few
specific topics and leaving out others.
Just remove a couple custom Flex* classes in favor of the Flex primitive
This has been killed a few times.

Refs SENTRY-42M7
…n table (#93892)

<!-- Describe your PR here. -->

[ticket](https://linear.app/getsentry/issue/ID-156/grouping-info-remove-type-field-from-ui)
The Type field in the Grouping Info section of the issue details page
was redundant.
This removes the Type row from all variant types while keeping the
underlying data structure intact.

before
![Screenshot 2025-06-20 at 12 00
54 PM](https://github.com/user-attachments/assets/97ca72da-0a52-4446-9825-cd4fcb505adf)

after
![Screenshot 2025-06-20 at 11 59
29 AM](https://github.com/user-attachments/assets/a4284d2b-c9f5-442f-b010-7fe72a598e39)
### Changes
Related to this PR: getsentry/sentry#93810. This
is part 1 of the change, which is pulling out the new component and just
adding it to the repo. Also includes some simplification of the logic in
the base component.

Part 2 will be replacing tables in widgets.

### Before/After

There is no UI change as the table is not being used yet. There is a new
story page for the component.
…93943)

to prevent this issue from becoming too noisy, add a noise config
Unfortunately, 'event_data' went from being the variable for current
event context to being the complete parsed data from Redis, and we
continued logging it per group.
That's more data than we should be logging even arguably once, let alone
per group.
Co-authored-by: Abdullah Khan <abdullahkhan@PG9Y57YDXQ.local>
Adds some simple analytics to our endpoint so we can begin building a
dashboard in Amplitude.
Previously, explore supported multiple y axis per chart, so each
visualize supported multiple y axis. That functionality has since been
removed for simplicity so update the types here to match. Keep in mind
that saved queries still store them as an array so when
serializing/deserializing, we still need to treat it as an array.
We'll need the `useGetTraceItemAttributeKeys` hook in other places so
refactoring it so that it can exported.
mrduncan and others added 2 commits June 20, 2025 13:20
When the max segment ID is null the process fails. We should exit early
since if there aren't any segments to delete there's nothing to do.
@ShashankFC

Copy link
Copy Markdown
Collaborator Author

@claude

@claude

claude Bot commented Jan 7, 2026

Copy link
Copy Markdown

Claude finished @ShashankFC's task in 2m 19s —— View job


Code Review Complete

I've reviewed this PR for the Replays Self-Serve Bulk Delete System. This is a large PR with 100 files changed (+2312/-981 lines) across multiple areas of the codebase. Below are my findings organized by severity.


Critical Issues

1. Security: Unbounded Thread Pool in Replay Deletion

File: src/sentry/replays/usecases/delete.py:74

with cf.ThreadPoolExecutor(max_workers=100) as pool:
    pool.map(_delete_if_exists, _make_recording_filenames(project_id, row))

Issue: The thread pool creates up to 100 threads per replay deletion. For replays with many segments, this could exhaust system resources. If multiple deletions happen concurrently, this could create thousands of threads.

Recommendation: Limit max_workers to a more reasonable number (e.g., 10-20) and rely on batching multiple replays instead of parallel thread execution per replay.

Fix this →

2. Potential Data Loss: Silent Exception Handling

File: src/sentry/replays/usecases/delete.py:78-83

def _delete_if_exists(filename: str) -> None:
    """Delete the blob if it exists or silence the 404."""
    try:
        storage_kv.delete(filename)
    except NotFound:
        pass

Issue: This function silences ALL exceptions from storage_kv.delete(), not just NotFound. Other failures (permissions, network errors, etc.) will be silently ignored, potentially leaving orphaned data.

Recommendation: Be more specific with exception handling:

def _delete_if_exists(filename: str) -> None:
    try:
        storage_kv.delete(filename)
    except NotFound:
        pass  # File doesn't exist, which is fine
    except Exception as e:
        logger.warning("Failed to delete replay segment", extra={"filename": filename, "error": str(e)})
        sentry_sdk.capture_exception(e)

Fix this →

3. TODO in Production Code with Detector Logic

File: src/sentry/workflow_engine/processors/delayed_workflow.py:437

if (
    evaluation.logic_result and workflow_id is None
):  # TODO: detector trigger passes. do something like create issue
    pass

Issue: This TODO indicates incomplete detector trigger logic. When a detector condition passes but there's no workflow_id, nothing happens. This could be a silent failure case.

Recommendation: Either implement the detector trigger logic or add logging/metrics to track when this condition occurs.

Fix this →


High Priority Issues

4. Performance: N+1 Query Pattern in Delayed Workflow

File: src/sentry/workflow_engine/processors/delayed_workflow.py:168

Organization.objects.filter(project__id=project_id).get().id

Issue: This query is executed inside a retry loop within execute_query. This could be optimized by passing the organization_id directly rather than looking it up each time.

Recommendation: Pass organization_id as a parameter or cache it to avoid repeated queries.

5. XXX Comments Indicating Technical Debt

Multiple XXX comments in the codebase indicate areas needing attention:

File: src/sentry/replays/endpoints/project_replay_summarize_breadcrumbs.py:179

# XXX: I have to deserialize this request so it can be "automatically" reserialized by the
# paginate method. This is less than ideal.
return json.loads(make_seer_request(request_data).decode("utf-8"))

File: src/sentry/replays/endpoints/project_replay_summarize_breadcrumbs.py:185

# XXX: Request isn't streaming. Limitation of Seer authentication. Would be much faster if we
# could stream the request data since the GCS download will (likely) dominate latency.

These indicate performance and architecture concerns that should be tracked.

6. Missing Permission Checks in Replay Endpoints

File: src/sentry/replays/endpoints/project_replay_summarize_breadcrumbs.py:52-65

The endpoint checks for feature flags but doesn't appear to check user permissions beyond basic authentication. Ensure replay access permissions are properly validated.


Medium Priority Issues

7. Browser Reporting Validation Logic Issue

File: src/sentry/issues/endpoints/browser_reporting_collector.py:50-60

def validate_timestamp(self, value: int) -> int:
    """Validate that age is absent, but timestamp is present."""
    if self.initial_data.get("age"):
        raise serializers.ValidationError("If timestamp is present, age must be absent")
    return value

def validate_age(self, value: int) -> int:
    """Validate that age is present, but not timestamp."""
    if self.initial_data.get("timestamp"):
        raise serializers.ValidationError("If age is present, timestamp must be absent")
    return value

Issue: Both validators raise errors if the other field is present, which will result in duplicate error messages when both fields are provided (as shown in test line 168-175). This is confusing for API consumers.

Recommendation: Use a single validate() method instead to handle mutual exclusivity cleanly.

8. Inconsistent Error Handling in Replay Summarization

File: src/sentry/replays/endpoints/project_replay_summarize_breadcrumbs.py:104-123

The fetch_error_details function catches all exceptions but only logs them to Sentry without returning any indication of partial failure. Consider returning a status indicator or logging which specific errors failed to load.

9. Hardcoded Sample Rate

File: src/sentry/issues/endpoints/browser_reporting_collector.py:127

sample_rate=1.0,  # XXX: Remove this once we have a ballpark figure

This 100% sampling rate should be reduced once initial data collection is complete to avoid metric explosion.


Low Priority / Code Quality Issues

10. TODO Comments Throughout Codebase

Multiple TODO comments indicate future work:

  • Line 192, 200 in commit_context.py: Exception handling improvements needed for GitLab
  • Line 227 in commit_context.py: Abstract method implementation for all integrations

11. Deprecated Test File

File: tests/sentry/migrations/test_0917_convert_org_saved_searches_to_views.py (DELETED)

Good - this test file was properly deleted, indicating the migration is now stable.

12. Type Safety in EventInstance Validation

File: src/sentry/workflow_engine/processors/delayed_workflow.py:93-97

The validator for occurrence_id returns None for empty strings, but this could mask data quality issues. Consider raising a validation error instead for malformed data.


Positive Observations

  1. Good Test Coverage: The browser reporting collector has comprehensive test coverage including edge cases (lines 106-176 in test file)
  2. Proper Error Logging: Most error paths include structured logging with appropriate context
  3. Retry Logic: The delayed workflow processor implements proper retry logic with exponential backoff (lines 160-164)
  4. Feature Flags: New functionality is properly gated behind feature flags
  5. Null Safety: The fix for max_segment_id being null (line 89-90 in delete.py) properly handles edge cases

Testing Recommendations

  1. Load Testing: Test replay deletion with high segment counts to validate thread pool behavior
  2. Concurrent Deletion: Test multiple concurrent replay deletions to ensure resource limits are respected
  3. Failure Scenarios: Test storage deletion failures to ensure they're properly logged
  4. Browser Reporting: Monitor the sample_rate=1.0 metric carefully in production

Summary

This PR implements a significant feature for replay deletion but has some critical issues around resource management and error handling that should be addressed before merging. The most important fixes are:

  1. Reduce thread pool size to prevent resource exhaustion
  2. Improve exception handling in storage deletion
  3. Implement or document the detector trigger logic
  4. Review and optimize the N+1 query pattern

The codebase shows good practices overall with feature flags, comprehensive testing, and proper logging, but the technical debt indicated by XXX/TODO comments should be tracked and addressed.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.