ref(crons): Reorganize incident creation / issue occurrence logic#13
ref(crons): Reorganize incident creation / issue occurrence logic#13ShashankFC wants to merge 1 commit into
Conversation
…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughA new incident occurrence module is introduced alongside refactored threshold-based incident detection logic. The changes establish a workflow for creating, serializing, and emitting monitor incident occurrences to Kafka when monitors fail, with comprehensive failure reason generation and monitor environment context capture. Changes
Sequence DiagramsequenceDiagram
actor Monitor as Monitor System
participant ME as MonitorEnvironment
participant INC as Incident Handler
participant KF as Kafka/Occurrence Producer
Monitor->>ME: check_failed_checkin()
ME-->>Monitor: status check (OK/ACTIVE/ERROR)
alt Status: OK or ACTIVE
Monitor->>INC: try_incident_threshold()
activate INC
INC->>INC: query prior check-ins up to threshold
INC->>INC: validate check-in statuses
INC->>ME: update status to ERROR
INC->>INC: create or retrieve incident
deactivate INC
end
alt Status: ERROR
Monitor->>INC: try_incident_threshold()
activate INC
INC->>INC: retrieve active incident
deactivate INC
end
alt Incident exists & not muted
INC->>INC: create_incident_occurrence()
activate INC
INC->>INC: build IssueOccurrence (metadata, failure reason, context)
INC->>KF: produce_occurrence_to_kafka()
deactivate INC
end
KF-->>Monitor: ✓ Occurrence emitted
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py:
- Around line 130-137: Docstring in get_failure_reason contains a typo "humam" —
change it to "human" in the triple-quoted description inside the
get_failure_reason function so the docstring reads "Builds a human readable
string from a list of failed check-ins." and ensure surrounding examples remain
unchanged.
- Around line 159-171: In get_monitor_environment_context, you copy and modify
config (variable config) but then return monitor_environment.monitor.config;
change the returned "config" field to use the modified local variable config so
the human-readable schedule_type is included; locate the function
get_monitor_environment_context and replace the returned config reference with
the local config variable.
🧹 Nitpick comments (2)
src/sentry/monitors/logic/incidents.py (1)
50-53: Use generator expression instead of list comprehension insideany().
any()short-circuits on the first truthy value, but the list comprehension evaluates all elements first. Use a generator expression for efficiency.🔎 Proposed fix
- if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]): + if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins): return Falsesrc/sentry/monitors/logic/incident_occurrence.py (1)
145-146: Prefernext(iter(...))over single-element slice.Creating a list just to access the first element is less efficient than using an iterator directly.
🔎 Proposed fix
if sum(status_counts.values()) == 1: - return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] + return SINGULAR_HUMAN_FAILURE_MAP[next(iter(status_counts.keys()))]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/sentry/monitors/logic/incident_occurrence.pysrc/sentry/monitors/logic/incidents.pysrc/sentry/monitors/logic/mark_failed.pysrc/sentry/monitors/types.py
🧰 Additional context used
🧬 Code graph analysis (3)
src/sentry/monitors/logic/incidents.py (3)
src/sentry/monitors/logic/incident_occurrence.py (1)
create_incident_occurrence(28-113)src/sentry/monitors/models.py (3)
MonitorCheckIn(472-582)MonitorIncident(747-795)MonitorStatus(80-121)src/sentry/monitors/types.py (1)
SimpleCheckIn(105-112)
src/sentry/monitors/logic/incident_occurrence.py (5)
src/sentry/eventstore/models.py (2)
datetime(112-120)trace_id(90-94)src/sentry/monitors/models.py (6)
timezone(340-341)CheckInStatus(124-176)MonitorCheckIn(472-582)MonitorIncident(747-795)get_last_successful_checkin(707-712)get_schedule_type_display(343-344)src/sentry/issues/grouptype.py (1)
MonitorIncidentType(528-536)src/sentry/monitors/types.py (1)
SimpleCheckIn(105-112)src/sentry/issues/producer.py (2)
PayloadType(25-34)produce_occurrence_to_kafka(50-90)
src/sentry/monitors/logic/mark_failed.py (2)
src/sentry/monitors/logic/incidents.py (1)
try_incident_threshold(14-104)src/sentry/monitors/models.py (3)
CheckInStatus(124-176)MonitorCheckIn(472-582)MonitorEnvironment(644-724)
🪛 Ruff (0.14.10)
src/sentry/monitors/logic/incident_occurrence.py
146-146: Prefer next(iter(status_counts.keys())) over single element slice
Replace with next(iter(status_counts.keys()))
(RUF015)
🔇 Additional comments (4)
src/sentry/monitors/types.py (1)
105-113: LGTM!The
SimpleCheckInTypedDict cleanly encapsulates the minimal check-in data needed for incident processing, and its fields align with the.values()query inincidents.py.src/sentry/monitors/logic/mark_failed.py (1)
78-79: Clean refactor separating concerns.The delegation to
try_incident_thresholdafter updating timing fields and refreshing from DB is a clean separation. This keepsmark_failedfocused on monitor environment state updates while incident creation logic is centralized.src/sentry/monitors/logic/incidents.py (1)
89-104: LGTM!The occurrence creation logic correctly handles both initial incident creation (multiple occurrences for threshold > 1) and ongoing failures (single occurrence per check-in). The muting check and signal emission are appropriately placed.
src/sentry/monitors/logic/incident_occurrence.py (1)
28-113: LGTM!The occurrence creation is well-structured with proper null handling, comprehensive evidence display, and appropriate Kafka event data including optional trace context.
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | ||
| """ | ||
| Builds a humam readible string from a list of failed check-ins. | ||
|
|
||
| "3 missed check-ins detected" | ||
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | ||
| "A failed check-in was detected" | ||
| """ |
There was a problem hiding this comment.
Fix typo: "humam" → "human".
🔎 Proposed fix
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
- Builds a humam readible string from a list of failed check-ins.
+ Builds a human readable string from a list of failed check-ins.
"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | |
| """ | |
| Builds a humam readible string from a list of failed check-ins. | |
| "3 missed check-ins detected" | |
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | |
| "A failed check-in was detected" | |
| """ | |
| def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): | |
| """ | |
| Builds a human readable string from a list of failed check-ins. | |
| "3 missed check-ins detected" | |
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | |
| "A failed check-in was detected" | |
| """ |
🤖 Prompt for AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py around lines 130 - 137,
Docstring in get_failure_reason contains a typo "humam" — change it to "human"
in the triple-quoted description inside the get_failure_reason function so the
docstring reads "Builds a human readable string from a list of failed
check-ins." and ensure surrounding examples remain unchanged.
| human_status = get_text_list( | ||
| [ | ||
| "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]} | ||
| for status, count in status_counts.items() | ||
| ], | ||
| last_word=_("and"), | ||
| ) | ||
|
|
||
| return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status} |
There was a problem hiding this comment.
Grammar issue: Plural formatting doesn't handle singular counts correctly.
The current format produces strings like "1 missed and 1 error check-ins detected" when counts are 1, which is grammatically incorrect. The docstring example shows "1 timeout check-in" (singular) which the implementation doesn't produce.
Consider including "check-in(s)" per status entry with proper pluralization, or simplifying the docstring to match actual output.
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | ||
| config = monitor_environment.monitor.config.copy() | ||
| if "schedule_type" in config: | ||
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), | ||
| "slug": str(monitor_environment.monitor.slug), | ||
| "name": monitor_environment.monitor.name, | ||
| "config": monitor_environment.monitor.config, | ||
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } |
There was a problem hiding this comment.
Bug: Modified config is not returned.
Lines 160-162 create a copy of config and update schedule_type to a human-readable display value, but line 168 returns the original monitor_environment.monitor.config instead of the modified config variable.
🔎 Proposed fix
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
- "config": monitor_environment.monitor.config,
+ "config": config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}🤖 Prompt for AI Agents
In @src/sentry/monitors/logic/incident_occurrence.py around lines 159 - 171, In
get_monitor_environment_context, you copy and modify config (variable config)
but then return monitor_environment.monitor.config; change the returned "config"
field to use the modified local variable config so the human-readable
schedule_type is included; locate the function get_monitor_environment_context
and replace the returned config reference with the local config variable.
Test 8
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.
Replicated from ai-code-review-evaluation/sentry-coderabbit#8