ref(crons): Reorganize incident creation / issue occurrence logic#8
Conversation
…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527
|
@coderabbitai review |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis pull request introduces a new incident occurrence module that structures monitor failure data into Issues and publishes them to Kafka. It adds a Changes
Sequence DiagramsequenceDiagram
participant mark_failed as mark_failed()
participant try_threshold as try_incident_threshold()
participant incidents as MonitorIncident<br/>(fetch/create)
participant occurrence as create_incident_occurrence()
participant kafka as Kafka<br/>(produce_occurrence)
mark_failed->>try_threshold: failed_checkin,<br/>threshold, received
try_threshold->>try_threshold: fetch prior<br/>check-ins
alt Monitor status == OK
try_threshold->>try_threshold: evaluate<br/>threshold
try_threshold->>incidents: create/retrieve<br/>incident
end
alt Not muted & incident exists
try_threshold->>occurrence: create_incident_occurrence<br/>(failed_checkins,<br/>incident, ...)
occurrence->>occurrence: build IssueOccurrence<br/>+ event_data
occurrence->>kafka: publish to Kafka<br/>(PayloadType.OCCURRENCE)
kafka-->>occurrence: ✓
end
try_threshold->>try_threshold: emit<br/>monitor_environment_failed
try_threshold-->>mark_failed: return bool
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/sentry/monitors/logic/incident_occurrence.py(1 hunks)src/sentry/monitors/logic/incidents.py(1 hunks)src/sentry/monitors/logic/mark_failed.py(2 hunks)src/sentry/monitors/types.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.5)
src/sentry/monitors/logic/incident_occurrence.py
146-146: Prefer next(iter(status_counts.keys())) over single element slice
Replace with next(iter(status_counts.keys()))
(RUF015)
🔇 Additional comments (3)
src/sentry/monitors/types.py (1)
105-112: SimpleCheckIn shape looks appropriate and consistentThe stripped-down check-in representation matches how it’s consumed (id, date_added, status) and keeps coupling minimal. No issues from a typing or usage perspective.
src/sentry/monitors/logic/mark_failed.py (1)
78-79: mark_failed now mirrors try_incident_threshold’s boolean semanticsThe refactor cleanly delegates all incident/occurrence logic to
try_incident_threshold, which is good for cohesion. However,mark_failednow returnsFalsewhenevertry_incident_thresholdshort-circuits (e.g., monitor env status not OK/ACTIVE/ERROR), which might differ from previous behavior if callers only cared about whether the env timestamps were updated.Please double-check call sites to confirm that
mark_failed’s return value is still interpreted as “incident/side-effects were processed” rather than “timestamps updated,” and adjust docstrings or callers if needed.src/sentry/monitors/logic/incident_occurrence.py (1)
28-113: create_incident_occurrence wiring and payload construction look solidThe occurrence construction and Kafka publish path are cohesive:
- Guarding on
monitor_env is Noneis correct.- Enriching evidence with failure reason, environment, and last successful check-in is clear.
- Trace context is conditionally added and tagged via
monitor.id/slug/incident, which matches the expected monitor incident story.I don’t see functional issues in this block as written.
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | ||
| """ | ||
| Builds a humam readible string from a list of failed check-ins. | ||
|
|
||
| "3 missed check-ins detected" | ||
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | ||
| "A failed check-in was detected" | ||
| """ | ||
|
|
||
| status_counts = Counter( | ||
| checkin["status"] | ||
| for checkin in failed_checkins | ||
| if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() | ||
| ) | ||
|
|
||
| if sum(status_counts.values()) == 1: | ||
| return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] | ||
|
|
||
| human_status = get_text_list( | ||
| [ | ||
| "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]} | ||
| for status, count in status_counts.items() | ||
| ], | ||
| last_word=_("and"), | ||
| ) | ||
|
|
||
| return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status} |
There was a problem hiding this comment.
Small cleanups in get_failure_reason (RUF015 + minor polish)
Two small improvements here:
- Key selection for the singular case: Ruff is right that
list(status_counts.keys())[0]is wasteful; you can avoid materializing a list:
- if sum(status_counts.values()) == 1:
- return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+ if sum(status_counts.values()) == 1:
+ key = next(iter(status_counts.keys()))
+ return SINGULAR_HUMAN_FAILURE_MAP[key]- Docstring typo (optional): “humam readible” → “human readable” for clarity.
These changes are non-functional but bring the function in line with style and static-analysis guidance.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | |
| """ | |
| Builds a humam readible string from a list of failed check-ins. | |
| "3 missed check-ins detected" | |
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | |
| "A failed check-in was detected" | |
| """ | |
| status_counts = Counter( | |
| checkin["status"] | |
| for checkin in failed_checkins | |
| if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() | |
| ) | |
| if sum(status_counts.values()) == 1: | |
| return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] | |
| human_status = get_text_list( | |
| [ | |
| "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]} | |
| for status, count in status_counts.items() | |
| ], | |
| last_word=_("and"), | |
| ) | |
| return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status} | |
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | |
| """ | |
| Builds a humam readible string from a list of failed check-ins. | |
| "3 missed check-ins detected" | |
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | |
| "A failed check-in was detected" | |
| """ | |
| status_counts = Counter( | |
| checkin["status"] | |
| for checkin in failed_checkins | |
| if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() | |
| ) | |
| if sum(status_counts.values()) == 1: | |
| key = next(iter(status_counts.keys())) | |
| return SINGULAR_HUMAN_FAILURE_MAP[key] | |
| human_status = get_text_list( | |
| [ | |
| "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]} | |
| for status, count in status_counts.items() | |
| ], | |
| last_word=_("and"), | |
| ) | |
| return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status} |
🧰 Tools
🪛 Ruff (0.14.5)
146-146: Prefer next(iter(status_counts.keys())) over single element slice
Replace with next(iter(status_counts.keys()))
(RUF015)
🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py around lines 130 to 156, the
singular-case key extraction is wastefully creating a list with
list(status_counts.keys())[0] and the docstring contains a typo "humam
readible"; change the singular-case selection to use an iterator (e.g.
next(iter(status_counts))) to avoid materializing a list, and fix the docstring
text to "human readable".
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | ||
| config = monitor_environment.monitor.config.copy() | ||
| if "schedule_type" in config: | ||
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), | ||
| "slug": str(monitor_environment.monitor.slug), | ||
| "name": monitor_environment.monitor.name, | ||
| "config": monitor_environment.monitor.config, | ||
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } |
There was a problem hiding this comment.
Bug: schedule_type override in context is never applied
get_monitor_environment_context builds a config copy and conditionally normalizes schedule_type, but then returns the original monitor_environment.monitor.config instead of the modified copy:
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
return {
# ...
"config": monitor_environment.monitor.config, # ignores modified `config`
}As written, the schedule type in the returned context will remain the raw stored value, not the human-readable display value.
You likely meant to return the adjusted config:
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
- "config": monitor_environment.monitor.config,
+ "config": config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}This keeps the original config immutable and ensures consumers see the normalized schedule type.
🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py around lines 159 to 171, the
function modifies a local copy named `config` to normalize `schedule_type` but
then returns the original `monitor_environment.monitor.config`, so the
normalized value is never used; change the returned dictionary to use the
modified `config` variable instead of `monitor_environment.monitor.config` so
the display-friendly `schedule_type` is included while keeping the original
config immutable.
| # check to see if we need to update the status | ||
| if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]: | ||
| if failure_issue_threshold == 1: | ||
| previous_checkins: list[SimpleCheckIn] = [ | ||
| { | ||
| "id": failed_checkin.id, | ||
| "date_added": failed_checkin.date_added, | ||
| "status": failed_checkin.status, | ||
| } | ||
| ] | ||
| else: | ||
| previous_checkins = cast( | ||
| list[SimpleCheckIn], | ||
| # Using .values for performance reasons | ||
| MonitorCheckIn.objects.filter( | ||
| monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added | ||
| ) | ||
| .order_by("-date_added") | ||
| .values("id", "date_added", "status"), | ||
| ) | ||
|
|
||
| # reverse the list after slicing in order to start with oldest check-in | ||
| previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) | ||
|
|
||
| # If we have any successful check-ins within the threshold of | ||
| # commits we have NOT reached an incident state | ||
| if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]): | ||
| return False | ||
|
|
||
| # change monitor status + update fingerprint timestamp |
There was a problem hiding this comment.
🧩 Analysis chain
Incident threshold can trigger early when there are fewer failures than failure_issue_threshold
The logic for failure_issue_threshold > 1 slices and reverses recent check-ins but never verifies that you actually have at least failure_issue_threshold check-ins before treating the threshold as reached:
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
# ...
if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
return False
# otherwise we proceed to ERROR + incident creationIf the monitor has fewer than failure_issue_threshold total check-ins, all of which are failures, this will still promote the env to ERROR and create an incident, effectively ignoring the configured threshold.
If the intended semantics are “require N consecutive failing check-ins,” you likely want an explicit length guard, e.g.:
- previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
-
- # If we have any successful check-ins within the threshold of
- # commits we have NOT reached an incident state
- if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
- return False
+ previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+ # If we haven't yet hit the configured number of check-ins, bail out
+ if len(previous_checkins) < failure_issue_threshold:
+ return False
+
+ # If we have any successful check-ins within the threshold window,
+ # we have NOT reached an incident state
+ if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
+ return FalsePlease confirm the intended semantics and adjust accordingly.
---
🏁 Script executed:
```shell
#!/bin/bash
# Extract the full try_incident_threshold function to understand complete context
rg -A 100 'def try_incident_threshold' src/sentry/monitors/logic/incidents.py -n
Length of output: 3681
🏁 Script executed:
#!/bin/bash
# Look for tests related to this function
fd -e py -path '*/test*' | xargs rg -l 'try_incident_threshold' 2>/dev/null | head -5Length of output: 248
🏁 Script executed:
#!/bin/bash
# Search for where try_incident_threshold is called to understand usage
rg -n 'try_incident_threshold' --type py -B2 -A2 | head -50Length of output: 966
🏁 Script executed:
#!/bin/bash
# Check the MonitorStatus and CheckInStatus enums to understand the status types
rg -A 10 'class MonitorStatus|class CheckInStatus' --type py | head -40Length of output: 1293
🏁 Script executed:
#!/bin/bash
# Look for any validation or length checks on previous_checkins or similar patterns
rg -n 'len\(previous_checkins\)|failure_issue_threshold' src/sentry/monitors/logic/incidents.py -B2 -A2Length of output: 678
🏁 Script executed:
#!/bin/bash
# Search for incident creation to understand the downstream impact
rg -n 'incident_issue|starting_checkin' src/sentry/monitors/logic/incidents.py -B3 -A3 | head -60Length of output: 566
🏁 Script executed:
#!/bin/bash
# Look at the full file structure to understand the broader context
head -100 src/sentry/monitors/logic/incidents.pyLength of output: 3662
🏁 Script executed:
#!/bin/bash
# Check for any related monitor environment update logic
rg -n 'monitor_env.status.*ERROR|MonitorStatus.ERROR' src/sentry/monitors/logic/incidents.py -B2 -A2Length of output: 369
🏁 Script executed:
#!/bin/bash
# Check database/ORM query to understand what .values returns
rg -n 'SimpleCheckIn' src/sentry/monitors/logic/incidents.py -B3 -A3Length of output: 973
🏁 Script executed:
#!/bin/bash
# Look for type hints or definitions of SimpleCheckIn
fd -e py | xargs rg -l 'SimpleCheckIn' | head -10Length of output: 116
🏁 Script executed:
#!/bin/bash
# Check if there are integration tests that exercise the threshold logic
fd -e py -path '*/test*monitor*' | xargs rg -l 'failure_issue_threshold' 2>/dev/nullLength of output: 643
🏁 Script executed:
#!/bin/bash
# Search for guard clauses related to check-in counts
rg -n 'if.*len.*check' src/sentry/monitors/logic/incidents.py -B1 -A1Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look at the logic flow around the threshold logic more carefully
sed -n '26,80p' src/sentry/monitors/logic/incidents.pyLength of output: 2269
---
🏁 Script executed:
```shell
#!/bin/bash
# Check the test file to understand intended behavior
cat tests/sentry/monitors/logic/test_mark_failed.py | head -150
Length of output: 5417
🏁 Script executed:
#!/bin/bash
# Search for test cases specifically about failure_issue_threshold
rg -n 'failure_issue_threshold' tests/sentry/monitors/logic/test_mark_failed.py -B5 -A10Length of output: 9843
🏁 Script executed:
#!/bin/bash
# Check mark_failed.py to understand how threshold is determined and used
cat src/sentry/monitors/logic/mark_failed.pyLength of output: 2882
🏁 Script executed:
#!/bin/bash
# Look for any documentation or comments about what the threshold means
rg -n 'failure_issue_threshold|failure.*threshold' src/sentry/monitors/ -B2 -A2 | head -80Length of output: 3624
Add explicit threshold validation to prevent early incident creation
The code lacks a length check before treating insufficient check-ins as a threshold breach. If a monitor has fewer total failures than failure_issue_threshold (with no intervening OK check-ins), the incident triggers prematurely.
The test passes only because it creates an initial OK check-in that eventually gets pushed outside the slice window. An uncovered edge case: threshold=8 with only 3 actual failures triggers an incident anyway.
Add the length guard:
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
# Ensure we have accumulated enough failures to meet the threshold
if len(previous_checkins) < failure_issue_threshold:
return False
# If we have any successful check-ins within the threshold window,
# we have NOT reached an incident state
if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
return False🤖 Prompt for AI Agents
In src/sentry/monitors/logic/incidents.py around lines 26 to 55, the logic
slices and reverses previous_checkins but does not verify that the number of
returned check-ins actually meets failure_issue_threshold, which can cause
incidents to be created when there are fewer failures than the threshold; modify
the flow so after reversing and slicing previous_checkins you check if
len(previous_checkins) < failure_issue_threshold and return False if so, and
only then proceed to check for any CheckInStatus.OK entries (return False if any
exist) before changing the monitor status.
Test 8
Summary by CodeRabbit
Replicated from ai-code-review-evaluation/sentry-coderabbit#8