ref(crons): Reorganize incident creation / issue occurrence logic by akshayutture-augment · Pull Request #8 · ai-code-review-evaluations/coderabbit-sentry

akshayutture-augment · 2025-11-14T23:42:35Z

Test 8

Summary by CodeRabbit

Refactor
- Enhanced monitor incident handling with improved threshold evaluation and incident occurrence creation logic.
- Streamlined internal architecture for better code maintainability and clarity in incident processing workflows.
- Incidents now include enriched contextual information with localized failure status messaging for improved visibility.

Replicated from ai-code-review-evaluation/sentry-coderabbit#8

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

akshayutture-augment · 2025-11-14T23:47:47Z

@coderabbitai review

akshayutture-augment · 2025-11-18T22:28:23Z

@coderabbitai review

coderabbitai · 2025-11-18T22:31:54Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-11-18T22:32:14Z

Walkthrough

This pull request introduces a new incident occurrence module that structures monitor failure data into Issues and publishes them to Kafka. It adds a try_incident_threshold function to centralize incident creation logic and refactors the mark_failed module to delegate to this new function. A new SimpleCheckIn type is added to support these changes.

Changes

Cohort / File(s)	Change Summary
Incident Occurrence Logic `src/sentry/monitors/logic/incident_occurrence.py`	New module with `create_incident_occurrence`, `get_failure_reason`, and `get_monitor_environment_context` functions. Builds IssueOccurrence structures with failure context, evidence, and trace information, then publishes to Kafka. Includes human-readable status mappings and failure summarization.
Incidents Threshold Handling `src/sentry/monitors/logic/incidents.py`	New `try_incident_threshold` function evaluates monitor failure thresholds, manages monitor environment status transitions to ERROR, creates or retrieves MonitorIncident objects, and orchestrates occurrence creation. Branches on monitor status and respects muting rules.
Mark Failed Refactor `src/sentry/monitors/logic/mark_failed.py`	Simplified to delegate incident threshold logic to `try_incident_threshold`. Removed internal logic for check-in analysis, state mutation, incident creation, and occurrence handling. Now serves as thin wrapper.
Type Definitions `src/sentry/monitors/types.py`	Added `SimpleCheckIn` TypedDict with fields: `id` (int), `date_added` (datetime), `status` (int).

Sequence Diagram

sequenceDiagram
    participant mark_failed as mark_failed()
    participant try_threshold as try_incident_threshold()
    participant incidents as MonitorIncident<br/>(fetch/create)
    participant occurrence as create_incident_occurrence()
    participant kafka as Kafka<br/>(produce_occurrence)

    mark_failed->>try_threshold: failed_checkin,<br/>threshold, received
    
    try_threshold->>try_threshold: fetch prior<br/>check-ins
    
    alt Monitor status == OK
        try_threshold->>try_threshold: evaluate<br/>threshold
        try_threshold->>incidents: create/retrieve<br/>incident
    end
    
    alt Not muted & incident exists
        try_threshold->>occurrence: create_incident_occurrence<br/>(failed_checkins,<br/>incident, ...)
        occurrence->>occurrence: build IssueOccurrence<br/>+ event_data
        occurrence->>kafka: publish to Kafka<br/>(PayloadType.OCCURRENCE)
        kafka-->>occurrence: ✓
    end
    
    try_threshold->>try_threshold: emit<br/>monitor_environment_failed
    try_threshold-->>mark_failed: return bool

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Key areas requiring attention:
- incident_occurrence.py: Dense logic around IssueOccurrence construction, context assembly, and Kafka payload formation; verify trace_id derivation and event_data structuring
- try_incident_threshold in incidents.py: Control flow branching on monitor status; verify threshold evaluation logic and incident state transitions
- mark_failed.py refactor: Confirm that delegating to try_incident_threshold preserves original behavior and error handling semantics
- Integration between modules: Ensure proper data flow from try_incident_threshold to create_incident_occurrence

Poem

🐰 When monitors fail with a solemn beep,
Occurrences rise from the data heap,
Kafka streams carry the news so true,
Incidents tracked through and through! 🎯

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ref(crons): Reorganize incident creation / issue occurrence logic' clearly and concisely describes the main refactoring objective of reorganizing incident and occurrence creation logic across multiple modules.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch monitor-incident-refactor-after

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fc9f67 and c3b32a7.

📒 Files selected for processing (4)

src/sentry/monitors/logic/incident_occurrence.py (1 hunks)
src/sentry/monitors/logic/incidents.py (1 hunks)
src/sentry/monitors/logic/mark_failed.py (2 hunks)
src/sentry/monitors/types.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.5)

src/sentry/monitors/logic/incident_occurrence.py

146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🔇 Additional comments (3)

src/sentry/monitors/types.py (1)

105-112: SimpleCheckIn shape looks appropriate and consistent

The stripped-down check-in representation matches how it’s consumed (id, date_added, status) and keeps coupling minimal. No issues from a typing or usage perspective.

src/sentry/monitors/logic/mark_failed.py (1)

78-79: mark_failed now mirrors try_incident_threshold’s boolean semantics

The refactor cleanly delegates all incident/occurrence logic to try_incident_threshold, which is good for cohesion. However, mark_failed now returns False whenever try_incident_threshold short-circuits (e.g., monitor env status not OK/ACTIVE/ERROR), which might differ from previous behavior if callers only cared about whether the env timestamps were updated.

Please double-check call sites to confirm that mark_failed’s return value is still interpreted as “incident/side-effects were processed” rather than “timestamps updated,” and adjust docstrings or callers if needed.

src/sentry/monitors/logic/incident_occurrence.py (1)

28-113: create_incident_occurrence wiring and payload construction look solid

The occurrence construction and Kafka publish path are cohesive:

Guarding on monitor_env is None is correct.

Enriching evidence with failure reason, environment, and last successful check-in is clear.

Trace context is conditionally added and tagged via monitor.id/slug/incident, which matches the expected monitor incident story.

I don’t see functional issues in this block as written.

coderabbitai · 2025-11-18T22:35:10Z

+def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
+    """
+    Builds a humam readible string from a list of failed check-ins.
+
+    "3 missed check-ins detected"
+    "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
+    "A failed check-in was detected"
+    """
+
+    status_counts = Counter(
+        checkin["status"]
+        for checkin in failed_checkins
+        if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
+    )
+
+    if sum(status_counts.values()) == 1:
+        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+
+    human_status = get_text_list(
+        [
+            "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
+            for status, count in status_counts.items()
+        ],
+        last_word=_("and"),
+    )
+
+    return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}


⚠️ Potential issue | 🟡 Minor

Small cleanups in get_failure_reason (RUF015 + minor polish)

Two small improvements here:

Key selection for the singular case: Ruff is right that list(status_counts.keys())[0] is wasteful; you can avoid materializing a list:

- if sum(status_counts.values()) == 1: - return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] + if sum(status_counts.values()) == 1: + key = next(iter(status_counts.keys())) + return SINGULAR_HUMAN_FAILURE_MAP[key]

Docstring typo (optional): “humam readible” → “human readable” for clarity.

These changes are non-functional but bring the function in line with style and static-analysis guidance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):

"""

Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"

"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"

"A failed check-in was detected"

"""

status_counts = Counter(

checkin["status"]

for checkin in failed_checkins

if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()

)

if sum(status_counts.values()) == 1:

return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(

[

"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}

for status, count in status_counts.items()

],

last_word=_("and"),

)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}

def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):

"""

Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"

"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"

"A failed check-in was detected"

"""

status_counts = Counter(

checkin["status"]

for checkin in failed_checkins

if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()

)

if sum(status_counts.values()) == 1:

key = next(iter(status_counts.keys()))

return SINGULAR_HUMAN_FAILURE_MAP[key]

human_status = get_text_list(

[

"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}

for status, count in status_counts.items()

],

last_word=_("and"),

)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}

🧰 Tools

🪛 Ruff (0.14.5)

146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🤖 Prompt for AI Agents

In src/sentry/monitors/logic/incident_occurrence.py around lines 130 to 156, the singular-case key extraction is wastefully creating a list with list(status_counts.keys())[0] and the docstring contains a typo "humam readible"; change the singular-case selection to use an iterator (e.g. next(iter(status_counts))) to avoid materializing a list, and fix the docstring text to "human readable".

coderabbitai · 2025-11-18T22:35:10Z

+def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
+    config = monitor_environment.monitor.config.copy()
+    if "schedule_type" in config:
+        config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
+
+    return {
+        "id": str(monitor_environment.monitor.guid),
+        "slug": str(monitor_environment.monitor.slug),
+        "name": monitor_environment.monitor.name,
+        "config": monitor_environment.monitor.config,
+        "status": monitor_environment.get_status_display(),
+        "type": monitor_environment.monitor.get_type_display(),
+    }


⚠️ Potential issue | 🟠 Major

Bug: schedule_type override in context is never applied

get_monitor_environment_context builds a config copy and conditionally normalizes schedule_type, but then returns the original monitor_environment.monitor.config instead of the modified copy:

config = monitor_environment.monitor.config.copy() if "schedule_type" in config: config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() return { # ... "config": monitor_environment.monitor.config, # ignores modified `config` }

As written, the schedule type in the returned context will remain the raw stored value, not the human-readable display value.

You likely meant to return the adjusted config:

def get_monitor_environment_context(monitor_environment: MonitorEnvironment): config = monitor_environment.monitor.config.copy() if "schedule_type" in config: config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() return { "id": str(monitor_environment.monitor.guid), "slug": str(monitor_environment.monitor.slug), "name": monitor_environment.monitor.name, - "config": monitor_environment.monitor.config, + "config": config, "status": monitor_environment.get_status_display(), "type": monitor_environment.monitor.get_type_display(), }

This keeps the original config immutable and ensures consumers see the normalized schedule type.

🤖 Prompt for AI Agents

In src/sentry/monitors/logic/incident_occurrence.py around lines 159 to 171, the function modifies a local copy named `config` to normalize `schedule_type` but then returns the original `monitor_environment.monitor.config`, so the normalized value is never used; change the returned dictionary to use the modified `config` variable instead of `monitor_environment.monitor.config` so the display-friendly `schedule_type` is included while keeping the original config immutable.

coderabbitai · 2025-11-18T22:35:10Z

+    # check to see if we need to update the status
+    if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
+        if failure_issue_threshold == 1:
+            previous_checkins: list[SimpleCheckIn] = [
+                {
+                    "id": failed_checkin.id,
+                    "date_added": failed_checkin.date_added,
+                    "status": failed_checkin.status,
+                }
+            ]
+        else:
+            previous_checkins = cast(
+                list[SimpleCheckIn],
+                # Using .values for performance reasons
+                MonitorCheckIn.objects.filter(
+                    monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
+                )
+                .order_by("-date_added")
+                .values("id", "date_added", "status"),
+            )
+
+            # reverse the list after slicing in order to start with oldest check-in
+            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+            # If we have any successful check-ins within the threshold of
+            # commits we have NOT reached an incident state
+            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+                return False
+
+        # change monitor status + update fingerprint timestamp


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Incident threshold can trigger early when there are fewer failures than failure_issue_threshold

The logic for failure_issue_threshold > 1 slices and reverses recent check-ins but never verifies that you actually have at least failure_issue_threshold check-ins before treating the threshold as reached:

previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) # ... if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins): return False # otherwise we proceed to ERROR + incident creation

If the monitor has fewer than failure_issue_threshold total check-ins, all of which are failures, this will still promote the env to ERROR and create an incident, effectively ignoring the configured threshold.

If the intended semantics are “require N consecutive failing check-ins,” you likely want an explicit length guard, e.g.:

- previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) - - # If we have any successful check-ins within the threshold of - # commits we have NOT reached an incident state - if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins): - return False + previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) + + # If we haven't yet hit the configured number of check-ins, bail out + if len(previous_checkins) < failure_issue_threshold: + return False + + # If we have any successful check-ins within the threshold window, + # we have NOT reached an incident state + if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins): + return False

Please confirm the intended semantics and adjust accordingly.

--- 🏁 Script executed: ```shell #!/bin/bash # Extract the full try_incident_threshold function to understand complete context rg -A 100 'def try_incident_threshold' src/sentry/monitors/logic/incidents.py -n

Length of output: 3681

🏁 Script executed:

#!/bin/bash # Look for tests related to this function fd -e py -path '*/test*' | xargs rg -l 'try_incident_threshold' 2>/dev/null | head -5

Length of output: 248

🏁 Script executed:

#!/bin/bash # Search for where try_incident_threshold is called to understand usage rg -n 'try_incident_threshold' --type py -B2 -A2 | head -50

Length of output: 966

🏁 Script executed:

#!/bin/bash # Check the MonitorStatus and CheckInStatus enums to understand the status types rg -A 10 'class MonitorStatus|class CheckInStatus' --type py | head -40

Length of output: 1293

🏁 Script executed:

#!/bin/bash # Look for any validation or length checks on previous_checkins or similar patterns rg -n 'len\(previous_checkins\)|failure_issue_threshold' src/sentry/monitors/logic/incidents.py -B2 -A2

Length of output: 678

🏁 Script executed:

#!/bin/bash # Search for incident creation to understand the downstream impact rg -n 'incident_issue|starting_checkin' src/sentry/monitors/logic/incidents.py -B3 -A3 | head -60

Length of output: 566

🏁 Script executed:

#!/bin/bash # Look at the full file structure to understand the broader context head -100 src/sentry/monitors/logic/incidents.py

Length of output: 3662

🏁 Script executed:

#!/bin/bash # Check for any related monitor environment update logic rg -n 'monitor_env.status.*ERROR|MonitorStatus.ERROR' src/sentry/monitors/logic/incidents.py -B2 -A2

Length of output: 369

🏁 Script executed:

#!/bin/bash # Check database/ORM query to understand what .values returns rg -n 'SimpleCheckIn' src/sentry/monitors/logic/incidents.py -B3 -A3

Length of output: 973

🏁 Script executed:

#!/bin/bash # Look for type hints or definitions of SimpleCheckIn fd -e py | xargs rg -l 'SimpleCheckIn' | head -10

Length of output: 116

🏁 Script executed:

#!/bin/bash # Check if there are integration tests that exercise the threshold logic fd -e py -path '*/test*monitor*' | xargs rg -l 'failure_issue_threshold' 2>/dev/null

Length of output: 643

🏁 Script executed:

#!/bin/bash # Search for guard clauses related to check-in counts rg -n 'if.*len.*check' src/sentry/monitors/logic/incidents.py -B1 -A1

Length of output: 0

🏁 Script executed:

#!/bin/bash # Look at the logic flow around the threshold logic more carefully sed -n '26,80p' src/sentry/monitors/logic/incidents.py

Length of output: 2269

--- 🏁 Script executed: ```shell #!/bin/bash # Check the test file to understand intended behavior cat tests/sentry/monitors/logic/test_mark_failed.py | head -150

Length of output: 5417

🏁 Script executed:

#!/bin/bash # Search for test cases specifically about failure_issue_threshold rg -n 'failure_issue_threshold' tests/sentry/monitors/logic/test_mark_failed.py -B5 -A10

Length of output: 9843

🏁 Script executed:

#!/bin/bash # Check mark_failed.py to understand how threshold is determined and used cat src/sentry/monitors/logic/mark_failed.py

Length of output: 2882

🏁 Script executed:

#!/bin/bash # Look for any documentation or comments about what the threshold means rg -n 'failure_issue_threshold|failure.*threshold' src/sentry/monitors/ -B2 -A2 | head -80

Length of output: 3624

Add explicit threshold validation to prevent early incident creation

The code lacks a length check before treating insufficient check-ins as a threshold breach. If a monitor has fewer total failures than failure_issue_threshold (with no intervening OK check-ins), the incident triggers prematurely.

The test passes only because it creates an initial OK check-in that eventually gets pushed outside the slice window. An uncovered edge case: threshold=8 with only 3 actual failures triggers an incident anyway.

Add the length guard:

previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) # Ensure we have accumulated enough failures to meet the threshold if len(previous_checkins) < failure_issue_threshold: return False # If we have any successful check-ins within the threshold window, # we have NOT reached an incident state if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]): return False

🤖 Prompt for AI Agents

In src/sentry/monitors/logic/incidents.py around lines 26 to 55, the logic slices and reverses previous_checkins but does not verify that the number of returned check-ins actually meets failure_issue_threshold, which can cause incidents to be created when there are fewer failures than the threshold; modify the flow so after reversing and slicing previous_checkins you check if len(previous_checkins) < failure_issue_threshold and return False if so, and only then proceed to check for any CheckInStatus.OK entries (return False if any exist) before changing the monitor status.

ref(crons): Reorganize incident creation / issue occurrence logic (#8…

c3b32a7

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

ai-code-review-evaluations deleted a comment from coderabbitai Bot Nov 18, 2025

coderabbitai Bot reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(crons): Reorganize incident creation / issue occurrence logic#8

ref(crons): Reorganize incident creation / issue occurrence logic#8
akshayutture-augment wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after

akshayutture-augment commented Nov 14, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

akshayutture-augment commented Nov 14, 2025

Uh oh!

akshayutture-augment commented Nov 18, 2025

Uh oh!

coderabbitai Bot commented Nov 18, 2025

Uh oh!

coderabbitai Bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 18, 2025

Uh oh!

coderabbitai Bot Nov 18, 2025

Uh oh!

coderabbitai Bot Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akshayutture-augment commented Nov 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

akshayutture-augment commented Nov 14, 2025

Uh oh!

akshayutture-augment commented Nov 18, 2025

Uh oh!

coderabbitai Bot commented Nov 18, 2025

Uh oh!

coderabbitai Bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akshayutture-augment commented Nov 14, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 18, 2025 •

edited

Loading