Skip to content

ref(crons): Reorganize incident creation / issue occurrence logic#8

Open
akshayutture-augment wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after
Open

ref(crons): Reorganize incident creation / issue occurrence logic#8
akshayutture-augment wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after

Conversation

@akshayutture-augment

@akshayutture-augment akshayutture-augment commented Nov 14, 2025

Copy link
Copy Markdown

Test 8

Summary by CodeRabbit

  • Refactor
    • Enhanced monitor incident handling with improved threshold evaluation and incident occurrence creation logic.
    • Streamlined internal architecture for better code maintainability and clarity in incident processing workflows.
    • Incidents now include enriched contextual information with localized failure status messaging for improved visibility.

Replicated from ai-code-review-evaluation/sentry-coderabbit#8

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
@akshayutture-augment

Copy link
Copy Markdown
Author

@coderabbitai review

@akshayutture-augment

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Nov 18, 2025

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Nov 18, 2025

Copy link
Copy Markdown

Walkthrough

This pull request introduces a new incident occurrence module that structures monitor failure data into Issues and publishes them to Kafka. It adds a try_incident_threshold function to centralize incident creation logic and refactors the mark_failed module to delegate to this new function. A new SimpleCheckIn type is added to support these changes.

Changes

Cohort / File(s) Change Summary
Incident Occurrence Logic
src/sentry/monitors/logic/incident_occurrence.py
New module with create_incident_occurrence, get_failure_reason, and get_monitor_environment_context functions. Builds IssueOccurrence structures with failure context, evidence, and trace information, then publishes to Kafka. Includes human-readable status mappings and failure summarization.
Incidents Threshold Handling
src/sentry/monitors/logic/incidents.py
New try_incident_threshold function evaluates monitor failure thresholds, manages monitor environment status transitions to ERROR, creates or retrieves MonitorIncident objects, and orchestrates occurrence creation. Branches on monitor status and respects muting rules.
Mark Failed Refactor
src/sentry/monitors/logic/mark_failed.py
Simplified to delegate incident threshold logic to try_incident_threshold. Removed internal logic for check-in analysis, state mutation, incident creation, and occurrence handling. Now serves as thin wrapper.
Type Definitions
src/sentry/monitors/types.py
Added SimpleCheckIn TypedDict with fields: id (int), date_added (datetime), status (int).

Sequence Diagram

sequenceDiagram
    participant mark_failed as mark_failed()
    participant try_threshold as try_incident_threshold()
    participant incidents as MonitorIncident<br/>(fetch/create)
    participant occurrence as create_incident_occurrence()
    participant kafka as Kafka<br/>(produce_occurrence)

    mark_failed->>try_threshold: failed_checkin,<br/>threshold, received
    
    try_threshold->>try_threshold: fetch prior<br/>check-ins
    
    alt Monitor status == OK
        try_threshold->>try_threshold: evaluate<br/>threshold
        try_threshold->>incidents: create/retrieve<br/>incident
    end
    
    alt Not muted & incident exists
        try_threshold->>occurrence: create_incident_occurrence<br/>(failed_checkins,<br/>incident, ...)
        occurrence->>occurrence: build IssueOccurrence<br/>+ event_data
        occurrence->>kafka: publish to Kafka<br/>(PayloadType.OCCURRENCE)
        kafka-->>occurrence: ✓
    end
    
    try_threshold->>try_threshold: emit<br/>monitor_environment_failed
    try_threshold-->>mark_failed: return bool
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Key areas requiring attention:
    • incident_occurrence.py: Dense logic around IssueOccurrence construction, context assembly, and Kafka payload formation; verify trace_id derivation and event_data structuring
    • try_incident_threshold in incidents.py: Control flow branching on monitor status; verify threshold evaluation logic and incident state transitions
    • mark_failed.py refactor: Confirm that delegating to try_incident_threshold preserves original behavior and error handling semantics
    • Integration between modules: Ensure proper data flow from try_incident_threshold to create_incident_occurrence

Poem

🐰 When monitors fail with a solemn beep,
Occurrences rise from the data heap,
Kafka streams carry the news so true,
Incidents tracked through and through! 🎯

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ref(crons): Reorganize incident creation / issue occurrence logic' clearly and concisely describes the main refactoring objective of reorganizing incident and occurrence creation logic across multiple modules.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch monitor-incident-refactor-after

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fc9f67 and c3b32a7.

📒 Files selected for processing (4)
  • src/sentry/monitors/logic/incident_occurrence.py (1 hunks)
  • src/sentry/monitors/logic/incidents.py (1 hunks)
  • src/sentry/monitors/logic/mark_failed.py (2 hunks)
  • src/sentry/monitors/types.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.5)
src/sentry/monitors/logic/incident_occurrence.py

146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🔇 Additional comments (3)
src/sentry/monitors/types.py (1)

105-112: SimpleCheckIn shape looks appropriate and consistent

The stripped-down check-in representation matches how it’s consumed (id, date_added, status) and keeps coupling minimal. No issues from a typing or usage perspective.

src/sentry/monitors/logic/mark_failed.py (1)

78-79: mark_failed now mirrors try_incident_threshold’s boolean semantics

The refactor cleanly delegates all incident/occurrence logic to try_incident_threshold, which is good for cohesion. However, mark_failed now returns False whenever try_incident_threshold short-circuits (e.g., monitor env status not OK/ACTIVE/ERROR), which might differ from previous behavior if callers only cared about whether the env timestamps were updated.

Please double-check call sites to confirm that mark_failed’s return value is still interpreted as “incident/side-effects were processed” rather than “timestamps updated,” and adjust docstrings or callers if needed.

src/sentry/monitors/logic/incident_occurrence.py (1)

28-113: create_incident_occurrence wiring and payload construction look solid

The occurrence construction and Kafka publish path are cohesive:

  • Guarding on monitor_env is None is correct.
  • Enriching evidence with failure reason, environment, and last successful check-in is clear.
  • Trace context is conditionally added and tagged via monitor.id/slug/incident, which matches the expected monitor incident story.

I don’t see functional issues in this block as written.

Comment on lines +130 to +156
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
)

if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(
[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Small cleanups in get_failure_reason (RUF015 + minor polish)

Two small improvements here:

  1. Key selection for the singular case: Ruff is right that list(status_counts.keys())[0] is wasteful; you can avoid materializing a list:
-    if sum(status_counts.values()) == 1:
-        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+    if sum(status_counts.values()) == 1:
+        key = next(iter(status_counts.keys()))
+        return SINGULAR_HUMAN_FAILURE_MAP[key]
  1. Docstring typo (optional): “humam readible” → “human readable” for clarity.

These changes are non-functional but bring the function in line with style and static-analysis guidance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.
"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""
status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
)
if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
human_status = get_text_list(
[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)
return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.
"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""
status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
)
if sum(status_counts.values()) == 1:
key = next(iter(status_counts.keys()))
return SINGULAR_HUMAN_FAILURE_MAP[key]
human_status = get_text_list(
[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)
return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}
🧰 Tools
🪛 Ruff (0.14.5)

146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py around lines 130 to 156, the
singular-case key extraction is wastefully creating a list with
list(status_counts.keys())[0] and the docstring contains a typo "humam
readible"; change the singular-case selection to use an iterator (e.g.
next(iter(status_counts))) to avoid materializing a list, and fix the docstring
text to "human readable".

Comment on lines +159 to +171
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: schedule_type override in context is never applied

get_monitor_environment_context builds a config copy and conditionally normalizes schedule_type, but then returns the original monitor_environment.monitor.config instead of the modified copy:

config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
    config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
    # ...
    "config": monitor_environment.monitor.config,  # ignores modified `config`
}

As written, the schedule type in the returned context will remain the raw stored value, not the human-readable display value.

You likely meant to return the adjusted config:

 def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
     config = monitor_environment.monitor.config.copy()
     if "schedule_type" in config:
         config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
 
     return {
         "id": str(monitor_environment.monitor.guid),
         "slug": str(monitor_environment.monitor.slug),
         "name": monitor_environment.monitor.name,
-        "config": monitor_environment.monitor.config,
+        "config": config,
         "status": monitor_environment.get_status_display(),
         "type": monitor_environment.monitor.get_type_display(),
     }

This keeps the original config immutable and ensures consumers see the normalized schedule type.

🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py around lines 159 to 171, the
function modifies a local copy named `config` to normalize `schedule_type` but
then returns the original `monitor_environment.monitor.config`, so the
normalized value is never used; change the returned dictionary to use the
modified `config` variable instead of `monitor_environment.monitor.config` so
the display-friendly `schedule_type` is included while keeping the original
config immutable.

Comment on lines +26 to +55
# check to see if we need to update the status
if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
if failure_issue_threshold == 1:
previous_checkins: list[SimpleCheckIn] = [
{
"id": failed_checkin.id,
"date_added": failed_checkin.date_added,
"status": failed_checkin.status,
}
]
else:
previous_checkins = cast(
list[SimpleCheckIn],
# Using .values for performance reasons
MonitorCheckIn.objects.filter(
monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
)
.order_by("-date_added")
.values("id", "date_added", "status"),
)

# reverse the list after slicing in order to start with oldest check-in
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))

# If we have any successful check-ins within the threshold of
# commits we have NOT reached an incident state
if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
return False

# change monitor status + update fingerprint timestamp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Incident threshold can trigger early when there are fewer failures than failure_issue_threshold

The logic for failure_issue_threshold > 1 slices and reverses recent check-ins but never verifies that you actually have at least failure_issue_threshold check-ins before treating the threshold as reached:

previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
# ...
if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
    return False
# otherwise we proceed to ERROR + incident creation

If the monitor has fewer than failure_issue_threshold total check-ins, all of which are failures, this will still promote the env to ERROR and create an incident, effectively ignoring the configured threshold.

If the intended semantics are “require N consecutive failing check-ins,” you likely want an explicit length guard, e.g.:

-            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
-
-            # If we have any successful check-ins within the threshold of
-            # commits we have NOT reached an incident state
-            if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
-                return False
+            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+            # If we haven't yet hit the configured number of check-ins, bail out
+            if len(previous_checkins) < failure_issue_threshold:
+                return False
+
+            # If we have any successful check-ins within the threshold window,
+            # we have NOT reached an incident state
+            if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
+                return False

Please confirm the intended semantics and adjust accordingly.



---

🏁 Script executed:

```shell
#!/bin/bash
# Extract the full try_incident_threshold function to understand complete context
rg -A 100 'def try_incident_threshold' src/sentry/monitors/logic/incidents.py -n

Length of output: 3681


🏁 Script executed:

#!/bin/bash
# Look for tests related to this function
fd -e py -path '*/test*' | xargs rg -l 'try_incident_threshold' 2>/dev/null | head -5

Length of output: 248


🏁 Script executed:

#!/bin/bash
# Search for where try_incident_threshold is called to understand usage
rg -n 'try_incident_threshold' --type py -B2 -A2 | head -50

Length of output: 966


🏁 Script executed:

#!/bin/bash
# Check the MonitorStatus and CheckInStatus enums to understand the status types
rg -A 10 'class MonitorStatus|class CheckInStatus' --type py | head -40

Length of output: 1293


🏁 Script executed:

#!/bin/bash
# Look for any validation or length checks on previous_checkins or similar patterns
rg -n 'len\(previous_checkins\)|failure_issue_threshold' src/sentry/monitors/logic/incidents.py -B2 -A2

Length of output: 678


🏁 Script executed:

#!/bin/bash
# Search for incident creation to understand the downstream impact
rg -n 'incident_issue|starting_checkin' src/sentry/monitors/logic/incidents.py -B3 -A3 | head -60

Length of output: 566


🏁 Script executed:

#!/bin/bash
# Look at the full file structure to understand the broader context
head -100 src/sentry/monitors/logic/incidents.py

Length of output: 3662


🏁 Script executed:

#!/bin/bash
# Check for any related monitor environment update logic
rg -n 'monitor_env.status.*ERROR|MonitorStatus.ERROR' src/sentry/monitors/logic/incidents.py -B2 -A2

Length of output: 369


🏁 Script executed:

#!/bin/bash
# Check database/ORM query to understand what .values returns
rg -n 'SimpleCheckIn' src/sentry/monitors/logic/incidents.py -B3 -A3

Length of output: 973


🏁 Script executed:

#!/bin/bash
# Look for type hints or definitions of SimpleCheckIn
fd -e py | xargs rg -l 'SimpleCheckIn' | head -10

Length of output: 116


🏁 Script executed:

#!/bin/bash
# Check if there are integration tests that exercise the threshold logic
fd -e py -path '*/test*monitor*' | xargs rg -l 'failure_issue_threshold' 2>/dev/null

Length of output: 643


🏁 Script executed:

#!/bin/bash
# Search for guard clauses related to check-in counts
rg -n 'if.*len.*check' src/sentry/monitors/logic/incidents.py -B1 -A1

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look at the logic flow around the threshold logic more carefully
sed -n '26,80p' src/sentry/monitors/logic/incidents.py

Length of output: 2269



---

🏁 Script executed:

```shell
#!/bin/bash
# Check the test file to understand intended behavior
cat tests/sentry/monitors/logic/test_mark_failed.py | head -150

Length of output: 5417


🏁 Script executed:

#!/bin/bash
# Search for test cases specifically about failure_issue_threshold
rg -n 'failure_issue_threshold' tests/sentry/monitors/logic/test_mark_failed.py -B5 -A10

Length of output: 9843


🏁 Script executed:

#!/bin/bash
# Check mark_failed.py to understand how threshold is determined and used
cat src/sentry/monitors/logic/mark_failed.py

Length of output: 2882


🏁 Script executed:

#!/bin/bash
# Look for any documentation or comments about what the threshold means
rg -n 'failure_issue_threshold|failure.*threshold' src/sentry/monitors/ -B2 -A2 | head -80

Length of output: 3624


Add explicit threshold validation to prevent early incident creation

The code lacks a length check before treating insufficient check-ins as a threshold breach. If a monitor has fewer total failures than failure_issue_threshold (with no intervening OK check-ins), the incident triggers prematurely.

The test passes only because it creates an initial OK check-in that eventually gets pushed outside the slice window. An uncovered edge case: threshold=8 with only 3 actual failures triggers an incident anyway.

Add the length guard:

previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))

# Ensure we have accumulated enough failures to meet the threshold
if len(previous_checkins) < failure_issue_threshold:
    return False

# If we have any successful check-ins within the threshold window,
# we have NOT reached an incident state
if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
    return False
🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incidents.py around lines 26 to 55, the logic
slices and reverses previous_checkins but does not verify that the number of
returned check-ins actually meets failure_issue_threshold, which can cause
incidents to be created when there are fewer failures than the threshold; modify
the flow so after reversing and slicing previous_checkins you check if
len(previous_checkins) < failure_issue_threshold and return False if so, and
only then proceed to check for any CheckInStatus.OK entries (return False if any
exist) before changing the monitor status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants