Skip to content

ref(crons): Reorganize incident creation / issue occurrence logic#13

Open
ShashankFC wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after
Open

ref(crons): Reorganize incident creation / issue occurrence logic#13
ShashankFC wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after

Conversation

@ShashankFC

@ShashankFC ShashankFC commented Jan 5, 2026

Copy link
Copy Markdown
Collaborator

Test 8

Summary by CodeRabbit

Release Notes

  • New Features
    • Implemented automatic incident creation and detection for monitor failures, respecting configurable failure thresholds.
    • Enhanced incident context with human-readable failure reason summaries, environment details, and tracking of last successful check-in times to aid in incident diagnosis and resolution.

✏️ Tip: You can customize this high-level summary in your review settings.


Replicated from ai-code-review-evaluation/sentry-coderabbit#8

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
@ShashankFC

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jan 6, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jan 6, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

A new incident occurrence module is introduced alongside refactored threshold-based incident detection logic. The changes establish a workflow for creating, serializing, and emitting monitor incident occurrences to Kafka when monitors fail, with comprehensive failure reason generation and monitor environment context capture.

Changes

Cohort / File(s) Summary
Incident Occurrence Creation
src/sentry/monitors/logic/incident_occurrence.py
New module creating IssueOccurrence objects with metadata (id, fingerprint, type, title, evidence display). Includes functions to construct human-readable failure reasons, serialize monitor environment context, and emit occurrences via Kafka. Defines localized failure status mappings for error/missed/timeout cases.
Incident Threshold Detection
src/sentry/monitors/logic/incidents.py
New try_incident_threshold function determines incident creation/update based on monitor environment status and failure thresholds. Implements branching logic for OK, ACTIVE, and ERROR states; queries prior check-ins; manages status transitions; emits monitor environment signals.
Monitor Failure Refactoring
src/sentry/monitors/logic/mark_failed.py
Delegates incident/threshold logic to try_incident_threshold, removing in-function branching. Retains timing field updates and environment refresh while consolidating incident creation handling.
Type System
src/sentry/monitors/types.py
Introduces SimpleCheckIn TypedDict with id, date\_added, and status fields for check-in data representation.

Sequence Diagram

sequenceDiagram
    actor Monitor as Monitor System
    participant ME as MonitorEnvironment
    participant INC as Incident Handler
    participant KF as Kafka/Occurrence Producer
    
    Monitor->>ME: check_failed_checkin()
    ME-->>Monitor: status check (OK/ACTIVE/ERROR)
    
    alt Status: OK or ACTIVE
        Monitor->>INC: try_incident_threshold()
        activate INC
        INC->>INC: query prior check-ins up to threshold
        INC->>INC: validate check-in statuses
        INC->>ME: update status to ERROR
        INC->>INC: create or retrieve incident
        deactivate INC
    end
    
    alt Status: ERROR
        Monitor->>INC: try_incident_threshold()
        activate INC
        INC->>INC: retrieve active incident
        deactivate INC
    end
    
    alt Incident exists & not muted
        INC->>INC: create_incident_occurrence()
        activate INC
        INC->>INC: build IssueOccurrence (metadata, failure reason, context)
        INC->>KF: produce_occurrence_to_kafka()
        deactivate INC
    end
    
    KF-->>Monitor: ✓ Occurrence emitted
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A monitor fails, we hop into action swift,
With thresholds and status, we carefully sift,
Each incident born from a check-in's demise,
To Kafka it flows—oh, what graceful goodbyes!
New failures detected, the logic runs deep,
Our incident occurrences, cataloged and neat! 📋✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: reorganizing incident creation and issue occurrence logic across the codebase into separate modules.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py:
- Around line 130-137: Docstring in get_failure_reason contains a typo "humam" —
change it to "human" in the triple-quoted description inside the
get_failure_reason function so the docstring reads "Builds a human readable
string from a list of failed check-ins." and ensure surrounding examples remain
unchanged.
- Around line 159-171: In get_monitor_environment_context, you copy and modify
config (variable config) but then return monitor_environment.monitor.config;
change the returned "config" field to use the modified local variable config so
the human-readable schedule_type is included; locate the function
get_monitor_environment_context and replace the returned config reference with
the local config variable.
🧹 Nitpick comments (2)
src/sentry/monitors/logic/incidents.py (1)

50-53: Use generator expression instead of list comprehension inside any().

any() short-circuits on the first truthy value, but the list comprehension evaluates all elements first. Use a generator expression for efficiency.

🔎 Proposed fix
-            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+            if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
                 return False
src/sentry/monitors/logic/incident_occurrence.py (1)

145-146: Prefer next(iter(...)) over single-element slice.

Creating a list just to access the first element is less efficient than using an iterator directly.

🔎 Proposed fix
     if sum(status_counts.values()) == 1:
-        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+        return SINGULAR_HUMAN_FAILURE_MAP[next(iter(status_counts.keys()))]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fc9f67 and c3b32a7.

📒 Files selected for processing (4)
  • src/sentry/monitors/logic/incident_occurrence.py
  • src/sentry/monitors/logic/incidents.py
  • src/sentry/monitors/logic/mark_failed.py
  • src/sentry/monitors/types.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/sentry/monitors/logic/incidents.py (3)
src/sentry/monitors/logic/incident_occurrence.py (1)
  • create_incident_occurrence (28-113)
src/sentry/monitors/models.py (3)
  • MonitorCheckIn (472-582)
  • MonitorIncident (747-795)
  • MonitorStatus (80-121)
src/sentry/monitors/types.py (1)
  • SimpleCheckIn (105-112)
src/sentry/monitors/logic/incident_occurrence.py (5)
src/sentry/eventstore/models.py (2)
  • datetime (112-120)
  • trace_id (90-94)
src/sentry/monitors/models.py (6)
  • timezone (340-341)
  • CheckInStatus (124-176)
  • MonitorCheckIn (472-582)
  • MonitorIncident (747-795)
  • get_last_successful_checkin (707-712)
  • get_schedule_type_display (343-344)
src/sentry/issues/grouptype.py (1)
  • MonitorIncidentType (528-536)
src/sentry/monitors/types.py (1)
  • SimpleCheckIn (105-112)
src/sentry/issues/producer.py (2)
  • PayloadType (25-34)
  • produce_occurrence_to_kafka (50-90)
src/sentry/monitors/logic/mark_failed.py (2)
src/sentry/monitors/logic/incidents.py (1)
  • try_incident_threshold (14-104)
src/sentry/monitors/models.py (3)
  • CheckInStatus (124-176)
  • MonitorCheckIn (472-582)
  • MonitorEnvironment (644-724)
🪛 Ruff (0.14.10)
src/sentry/monitors/logic/incident_occurrence.py

146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🔇 Additional comments (4)
src/sentry/monitors/types.py (1)

105-113: LGTM!

The SimpleCheckIn TypedDict cleanly encapsulates the minimal check-in data needed for incident processing, and its fields align with the .values() query in incidents.py.

src/sentry/monitors/logic/mark_failed.py (1)

78-79: Clean refactor separating concerns.

The delegation to try_incident_threshold after updating timing fields and refreshing from DB is a clean separation. This keeps mark_failed focused on monitor environment state updates while incident creation logic is centralized.

src/sentry/monitors/logic/incidents.py (1)

89-104: LGTM!

The occurrence creation logic correctly handles both initial incident creation (multiple occurrences for threshold > 1) and ongoing failures (single occurrence per check-in). The muting check and signal emission are appropriately placed.

src/sentry/monitors/logic/incident_occurrence.py (1)

28-113: LGTM!

The occurrence creation is well-structured with proper null handling, comprehensive evidence display, and appropriate Kafka event data including optional trace context.

Comment on lines +130 to +137
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo: "humam" → "human".

🔎 Proposed fix
 def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
     """
-    Builds a humam readible string from a list of failed check-ins.
+    Builds a human readable string from a list of failed check-ins.
 
     "3 missed check-ins detected"
     "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
     "A failed check-in was detected"
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.
"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a human readable string from a list of failed check-ins.
"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""
🤖 Prompt for AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py around lines 130 - 137,
Docstring in get_failure_reason contains a typo "humam" — change it to "human"
in the triple-quoted description inside the get_failure_reason function so the
docstring reads "Builds a human readable string from a list of failed
check-ins." and ensure surrounding examples remain unchanged.

Comment on lines +148 to +156
human_status = get_text_list(
[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Grammar issue: Plural formatting doesn't handle singular counts correctly.

The current format produces strings like "1 missed and 1 error check-ins detected" when counts are 1, which is grammatically incorrect. The docstring example shows "1 timeout check-in" (singular) which the implementation doesn't produce.

Consider including "check-in(s)" per status entry with proper pluralization, or simplifying the docstring to match actual output.

Comment on lines +159 to +171
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: Modified config is not returned.

Lines 160-162 create a copy of config and update schedule_type to a human-readable display value, but line 168 returns the original monitor_environment.monitor.config instead of the modified config variable.

🔎 Proposed fix
 def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
     config = monitor_environment.monitor.config.copy()
     if "schedule_type" in config:
         config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
 
     return {
         "id": str(monitor_environment.monitor.guid),
         "slug": str(monitor_environment.monitor.slug),
         "name": monitor_environment.monitor.name,
-        "config": monitor_environment.monitor.config,
+        "config": config,
         "status": monitor_environment.get_status_display(),
         "type": monitor_environment.monitor.get_type_display(),
     }
🤖 Prompt for AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py around lines 159 - 171, In
get_monitor_environment_context, you copy and modify config (variable config)
but then return monitor_environment.monitor.config; change the returned "config"
field to use the modified local variable config so the human-readable
schedule_type is included; locate the function get_monitor_environment_context
and replace the returned config reference with the local config variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants