Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions src/sentry/monitors/logic/incident_occurrence.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
from __future__ import annotations

import logging
import uuid
from collections import Counter
from collections.abc import Mapping, Sequence
from datetime import datetime, timezone
from typing import TYPE_CHECKING

from django.utils.text import get_text_list
from django.utils.translation import gettext_lazy as _

from sentry.issues.grouptype import MonitorIncidentType
from sentry.monitors.models import (
CheckInStatus,
MonitorCheckIn,
MonitorEnvironment,
MonitorIncident,
)
from sentry.monitors.types import SimpleCheckIn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Unused import in types.py consumer (confidence: 88%)

incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.

Evidence:

  • incident_occurrence.py line 20: from sentry.monitors.types import SimpleCheckIn
  • incidents.py line 9: from sentry.monitors.types import SimpleCheckIn
  • Both modules use SimpleCheckIn in type annotations and dict construction

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.

Suggested change
from sentry.monitors.types import SimpleCheckIn
from sentry.monitors.types import SimpleCheckIn

🤖 Grapple PR auto-fix • minor • confidence: 88%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Incomplete type export (confidence: 100%)

The SimpleCheckIn type is imported from sentry.monitors.types which is correct, but any external code that previously imported SimpleCheckIn from sentry.monitors.logic.mark_failed will break since it has been removed from that module and no re-export is provided.

Evidence:

  • Original SimpleCheckIn was defined in mark_failed.py (lines removed in diff)
  • New definition is in types.py
  • No re-export alias in mark_failed.py for backward compatibility

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (6 lines, 1 file))

The SimpleCheckIn type is imported from sentry.monitors.types which is correct, but any external code that previously imported SimpleCheckIn from sentry.monitors.logic.mark_failed will break since it has been removed from that module and no re-export is provided.

Suggested change
from sentry.monitors.types import SimpleCheckIn
# Re-export SimpleCheckIn for backward compatibility.
# The canonical definition has moved to sentry.monitors.types as part of the
# incident_occurrence.py refactor. External code importing SimpleCheckIn from
# this module will continue to work without modification.
from sentry.monitors.types import SimpleCheckIn as SimpleCheckIn # noqa: F401

🤖 Grapple PR auto-fix • minor • confidence: 100%


if TYPE_CHECKING:
from django.utils.functional import _StrPromise

logger = logging.getLogger(__name__)


def create_incident_occurrence(
failed_checkins: Sequence[SimpleCheckIn],
failed_checkin: MonitorCheckIn,
incident: MonitorIncident,
received: datetime | None,
) -> None:
from sentry.issues.issue_occurrence import IssueEvidence, IssueOccurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 99%)

The occurrence creation function was renamed from create_issue_platform_occurrence to create_incident_occurrence. Any code outside mark_failed.py that imports or references create_issue_platform_occurrence by name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.

Evidence:

  • Original function was create_issue_platform_occurrence in mark_failed.py
  • New function is create_incident_occurrence in incident_occurrence.py
  • Tests commonly mock internal functions like create_issue_platform_occurrence by path; those mocks will silently pass (no-op) or fail

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (11 lines, 1 file) — review recommended)

The occurrence creation function was renamed from create_issue_platform_occurrence to create_incident_occurrence. Any code outside mark_failed.py that imports or references create_issue_platform_occurrence by name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.

--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -168,3 +168,13 @@
         "type": monitor_environment.monitor.get_type_display(),
     }
+
+
+# Backward-compatible alias for the previous name of this function when it lived
+# in mark_failed.py. Tests that mock or import `create_issue_platform_occurrence`
+# by path (e.g. via `mock.patch("...incident_occurrence.create_issue_platform_occurrence")`)
+# will continue to work without modification.
+#
+# Deprecated: use `create_incident_occurrence` directly.
+# TODO: Remove this alias once all references to `create_issue_platform_occurrence`
+#       have been updated across tests and callers.
+create_issue_platform_occurrence = create_incident_occurrence

🤖 Grapple PR auto-fix • major • Review this diff before applying

from sentry.issues.producer import PayloadType, produce_occurrence_to_kafka

monitor_env = failed_checkin.monitor_environment

if monitor_env is None:
return

current_timestamp = datetime.now(timezone.utc)

# Get last successful check-in to show in evidence display
last_successful_checkin_timestamp = "Never"
last_successful_checkin = monitor_env.get_last_successful_checkin()
if last_successful_checkin:
last_successful_checkin_timestamp = last_successful_checkin.date_added.isoformat()

occurrence = IssueOccurrence(
id=uuid.uuid4().hex,
resource_id=None,
project_id=monitor_env.monitor.project_id,
event_id=uuid.uuid4().hex,
fingerprint=[incident.grouphash],
type=MonitorIncidentType,
issue_title=f"Monitor failure: {monitor_env.monitor.name}",
subtitle="Your monitor has reached its failure threshold.",
evidence_display=[
IssueEvidence(
name="Failure reason",
value=str(get_failure_reason(failed_checkins)),
important=True,
),
IssueEvidence(
name="Environment",
value=monitor_env.get_environment().name,
important=False,
),
IssueEvidence(
name="Last successful check-in",
value=last_successful_checkin_timestamp,
important=False,
),
],
evidence_data={},
culprit="",
detection_time=current_timestamp,
level="error",
assignee=monitor_env.monitor.owner_actor,
)

if failed_checkin.trace_id:
trace_id = failed_checkin.trace_id.hex
else:
trace_id = None

event_data = {
"contexts": {"monitor": get_monitor_environment_context(monitor_env)},
"environment": monitor_env.get_environment().name,
"event_id": occurrence.event_id,
"fingerprint": [incident.grouphash],
"platform": "other",
"project_id": monitor_env.monitor.project_id,
# We set this to the time that the checkin that triggered the occurrence was written to relay if available
"received": (received if received else current_timestamp).isoformat(),
"sdk": None,
"tags": {
"monitor.id": str(monitor_env.monitor.guid),
"monitor.slug": str(monitor_env.monitor.slug),
"monitor.incident": str(incident.id),
},
"timestamp": current_timestamp.isoformat(),
}

if trace_id:
event_data["contexts"]["trace"] = {"trace_id": trace_id, "span_id": None}

produce_occurrence_to_kafka(
payload_type=PayloadType.OCCURRENCE,
occurrence=occurrence,
event_data=event_data,
)


HUMAN_FAILURE_STATUS_MAP: Mapping[int, _StrPromise] = {
CheckInStatus.ERROR: _("error"),
CheckInStatus.MISSED: _("missed"),
CheckInStatus.TIMEOUT: _("timeout"),
}

# Exists due to the vowel differences (A vs An) in the statuses
SINGULAR_HUMAN_FAILURE_MAP: Mapping[int, _StrPromise] = {
CheckInStatus.ERROR: _("An error check-in was detected"),
CheckInStatus.MISSED: _("A missed check-in was detected"),
CheckInStatus.TIMEOUT: _("A timeout check-in was detected"),
}


def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 76%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • The function get_failure_reason has a docstring with a spelling error: 'Builds a humam readible string...'
  • This is a public function that may be referenced in documentation or by other developers

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Typo in docstring: 'humam readible' should be 'human readable'

Suggested change
"""
Builds a human readable string from a list of failed check-ins.

🤖 Grapple PR auto-fix • minor • confidence: 76%


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Documentation (confidence: 82%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • Line 130 contains: 'Builds a humam readible string from a list of failed check-ins.'
  • This is a documentation typo that should be corrected for clarity

Agent: style

status_counts = Counter(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 86%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • The docstring for get_failure_reason contains a spelling error that reduces code clarity
  • This is a public function with a docstring that should follow proper English conventions

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Typo in docstring: 'humam readible' should be 'human readable'

Suggested change
status_counts = Counter(
Builds a human readable string from a list of failed check-ins.

🤖 Grapple PR auto-fix • minor • confidence: 86%

checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Edge case / KeyError risk (confidence: 100%)

In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.

Evidence:

  • Lines 136-139: status_counts = Counter(checkin['status'] for checkin in failed_checkins if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys())
  • If no check-ins match, status_counts is empty and sum(status_counts.values()) == 0
  • Falls through to get_text_list([], last_word=_('and')) which returns ''
  • Returns _('%(problem_checkins)s check-ins detected') % {'problem_checkins': ''}' check-ins detected'
  • This is then surfaced as the 'Failure reason' evidence in the issue occurrence shown to users
  • HUMAN_FAILURE_STATUS_MAP has keys: ERROR, MISSED, TIMEOUT
  • SINGULAR_HUMAN_FAILURE_MAP has keys: ERROR, MISSED, TIMEOUT
  • If these maps ever diverge, line 142 would raise KeyError
  • Also: if all check-ins are filtered out (status_counts is empty), sum == 0, so the code goes to get_text_list with an empty list, producing a malformed string like ' check-ins detected'

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (16 lines, 1 file) — review recommended)

In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.

--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -139,10 +139,18 @@ def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
     status_counts = Counter(
         checkin["status"]
         for checkin in failed_checkins
         if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
     )
 
+    total = sum(status_counts.values())
+
+    # No check-ins with recognized failure statuses; return a generic fallback
+    # rather than producing a malformed string like ' check-ins detected'.
+    if total == 0:
+        return _("A failed check-in was detected")
+
-    if sum(status_counts.values()) == 1:
-        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+    if total == 1:
+        status_key = list(status_counts.keys())[0]
+        # Defensive .get() in case SINGULAR_HUMAN_FAILURE_MAP and
+        # HUMAN_FAILURE_STATUS_MAP ever diverge (they currently share keys).
+        return SINGULAR_HUMAN_FAILURE_MAP.get(
+            status_key, _("A failed check-in was detected")
+        )
 
     human_status = get_text_list(
         [

🤖 Grapple PR auto-fix • major • Review this diff before applying

)

if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Edge Case - Empty failed_checkins (confidence: 100%)

The get_failure_reason function will raise a KeyError if failed_checkins contains only check-ins with statuses NOT in HUMAN_FAILURE_STATUS_MAP (e.g., all filtered out). In that case status_counts would be empty, sum(status_counts.values()) would be 0, which is not == 1, so it falls through to the get_text_list path with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.

Evidence:

  • If all check-in statuses are filtered out by the if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys() guard, status_counts is an empty Counter
  • sum of empty Counter values is 0, which != 1, so it proceeds to format an empty list
  • get_text_list with an empty list returns '' which then gets interpolated into the message as ' check-ins detected'

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (7 lines, 1 file))

The get_failure_reason function will raise a KeyError if failed_checkins contains only check-ins with statuses NOT in HUMAN_FAILURE_STATUS_MAP (e.g., all filtered out). In that case status_counts would be empty, sum(status_counts.values()) would be 0, which is not == 1, so it falls through to the get_text_list path with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.

Suggested change
human_status = get_text_list(
# Guard against the edge case where all check-ins have statuses not present
# in HUMAN_FAILURE_STATUS_MAP (e.g. unexpected/future statuses). Without this,
# get_text_list([]) returns '' and we'd produce the malformed string
# " check-ins detected".
if not status_counts:
return _("A failed check-in was detected")

🤖 Grapple PR auto-fix • minor • confidence: 100%

[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}


def get_monitor_environment_context(monitor_environment: MonitorEnvironment):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Unused Variable (confidence: 100%)

In get_monitor_environment_context, the local variable config is computed and modified (line 152-153: config = monitor_environment.monitor.config.copy() and the schedule_type override), but the returned dict uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This is a pre-existing bug that was faithfully moved from mark_failed.py, but it means the schedule_type display override is silently discarded.

Evidence:

  • Line 152: config = monitor_environment.monitor.config.copy()
  • Line 153-154: modifies config['schedule_type'] to the display value
  • Line 159: returns 'config': monitor_environment.monitor.config instead of 'config': config
  • This was also present in the original mark_failed.py (same bug), so the refactor faithfully preserved it, but it is still a logic bug

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

In get_monitor_environment_context, the local variable config is computed and modified (line 152-153: config = monitor_environment.monitor.config.copy() and the schedule_type override), but the returned dict uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This is a pre-existing bug that was faithfully moved from mark_failed.py, but it means the schedule_type display override is silently discarded.

Suggested change
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%

config = monitor_environment.monitor.config.copy()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Code Organization (confidence: 100%)

Inconsistent handling of monitor config in get_monitor_environment_context: the function modifies a shallow copy of config but returns the original unmodified config in the dict

Evidence:

  • Line 156-157: config = monitor_environment.monitor.config.copy() followed by conditional mutation of config['schedule_type']
  • Line 162: 'config': monitor_environment.monitor.config, returns the original, not the modified config variable
  • This means the schedule_type transformation is lost in the returned context

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Inconsistent handling of monitor config in get_monitor_environment_context: the function modifies a shallow copy of config but returns the original unmodified config in the dict

Suggested change
config = monitor_environment.monitor.config.copy()
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%

if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Data flow / Bug (confidence: 100%)

get_monitor_environment_context builds a local config variable with the schedule_type overwritten, but then the returned dictionary uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.

Evidence:

  • Line 158: config = monitor_environment.monitor.config.copy()
  • Line 159-160: if 'schedule_type' in config: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display()
  • Line 165: 'config': monitor_environment.monitor.config, — uses the original config, not the local config variable

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Data Integrity (confidence: 100%)

The config copy with humanized schedule_type is computed but never used. The returned dict references monitor_environment.monitor.config (the original dict) instead of the local config variable, so the schedule_type transformation is silently discarded.

Evidence:

  • Line 155: config = monitor_environment.monitor.config.copy() — copy is made and potentially modified
  • Line 157: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display() — written to local config
  • Line 160: 'config': monitor_environment.monitor.config — returns the ORIGINAL unmodified dict, not config
  • This bug exists verbatim in the original mark_failed.py and was faithfully copied, but it means the monitor context sent with every occurrence always has the raw integer schedule_type instead of the human-readable string

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

get_monitor_environment_context builds a local config variable with the schedule_type overwritten, but then the returned dictionary uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • critical • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))

The config copy with humanized schedule_type is computed but never used. The returned dict references monitor_environment.monitor.config (the original dict) instead of the local config variable, so the schedule_type transformation is silently discarded.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • major • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Potential KeyError or Misleading Output (confidence: 100%)

get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.

Evidence:

  • Line 140-144: Counter filters only statuses in HUMAN_FAILURE_STATUS_MAP (ERROR, MISSED, TIMEOUT)
  • Line 146: if sum(status_counts.values()) == 1 — only handles the singular case
  • No else/fallback branch for sum == 0 (empty status_counts)
  • Docstring at line 133 mentions 'A failed check-in was detected' but no code path produces that string — SINGULAR_HUMAN_FAILURE_MAP only covers ERROR, MISSED, TIMEOUT
  • If a check-in with status IN_PROGRESS (or any unrecognized status) is in failed_checkins, get_text_list receives an empty list

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Logic Error / Stale Data (confidence: 100%)

get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local config copy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.

Evidence:

  • Line 155: config = monitor_environment.monitor.config.copy() — a copy is made and potentially mutated
  • Line 157: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display() — human-readable value stored in copy
  • Line 163: 'config': monitor_environment.monitor.config — returns the original raw dict, not the transformed copy
  • The caller in create_incident_occurrence at line 86 embeds this context in event_data sent to Kafka, so consumers see the raw schedule_type integer instead of the display string

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (3 lines, 1 file))

get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.

Suggested change
"id": str(monitor_environment.monitor.guid),
if sum(status_counts.values()) == 0:
return _("A failed check-in was detected")

🤖 Grapple PR auto-fix • major • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))

get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local config copy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%

"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error - Config Context Bug Preserved (confidence: 100%)

The get_monitor_environment_context function builds a config copy and conditionally modifies schedule_type, but then returns monitor_environment.monitor.config (the original) instead of the modified config copy. This bug existed in the original mark_failed.py and was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.

Evidence:

  • Line 163: config = monitor_environment.monitor.config.copy()
  • Line 164-165: if 'schedule_type' in config: config['schedule_type'] = ... modifies the copy
  • Line 170: 'config': monitor_environment.monitor.config returns the ORIGINAL, not the modified copy
  • This means the schedule_type display transformation is silently discarded

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

The get_monitor_environment_context function builds a config copy and conditionally modifies schedule_type, but then returns monitor_environment.monitor.config (the original) instead of the modified config copy. This bug existed in the original mark_failed.py and was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.

Suggested change
}
"config": config,

🤖 Grapple PR auto-fix • major • confidence: 100%

104 changes: 104 additions & 0 deletions src/sentry/monitors/logic/incidents.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
from __future__ import annotations

import logging
from datetime import datetime
from typing import cast

from sentry.monitors.logic.incident_occurrence import create_incident_occurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Module Boundaries / Import Direction (confidence: 100%)

incidents.py imports directly from incident_occurrence.py, creating a fixed coupling direction. This is acceptable for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency will result. Consider whether the call to create_incident_occurrence should instead be injected or orchestrated from a higher-level caller (like mark_failed.py) to keep the two new modules independent.

Evidence:

  • Line 7: from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
  • incidents.py calls create_incident_occurrence at line 96
  • If incident_occurrence.py later needs incident state info, circular import will occur

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (12 lines, 1 file) — review recommended)

incidents.py imports directly from incident_occurrence.py, creating a fixed coupling direction. This is acceptable for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency will result. Consider whether the call to create_incident_occurrence should instead be injected or orchestrated from a higher-level caller (like mark_failed.py) to keep the two new modules independent.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,11 +1,11 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
-from typing import cast
+from typing import Callable, cast
 
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
 from sentry.monitors.types import SimpleCheckIn
 
 logger = logging.getLogger(__name__)
 
 
 def try_incident_threshold(
     failed_checkin: MonitorCheckIn,
     failure_issue_threshold: int,
     received: datetime | None,
+    # Injected to avoid a coupling dependency between incidents.py and
+    # incident_occurrence.py. If incident_occurrence.py ever needs to reference
+    # incident state, a circular import would result. The orchestrating caller
+    # (mark_failed.py) is responsible for wiring this dependency.
+    create_occurrence: Callable = None,
 ) -> bool:
@@ -88,10 +88,13 @@
     # Only create an occurrence if:
     # - We have an active incident and fingerprint
     # - The monitor and env are not muted
-    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident and create_occurrence is not None:
         checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
         for checkin in checkins:
-            create_incident_occurrence(
+            create_occurrence(
                 previous_checkins,
                 checkin,
                 incident,
                 received=received,
             )

🤖 Grapple PR auto-fix • minor • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Module boundaries / Dependency direction (confidence: 84%)

incidents.py directly imports from incident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency would result. Consider whether the create_incident_occurrence call should instead be injected or dispatched from mark_failed.py (the orchestrator) to keep both new modules independent of each other.

Evidence:

  • incidents.py line 7: from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
  • mark_failed.py already serves as the orchestrator; it could call both modules sequentially rather than having incidents.py call incident_occurrence.py

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (33 lines, 1 file) — review recommended)

incidents.py directly imports from incident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency would result. Consider whether the create_incident_occurrence call should instead be injected or dispatched from mark_failed.py (the orchestrator) to keep both new modules independent of each other.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,10 +1,9 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
 from typing import cast
 
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
 from sentry.monitors.types import SimpleCheckIn
 
@@ -14,7 +13,14 @@ logger = logging.getLogger(__name__)
 
 def try_incident_threshold(
     failed_checkin: MonitorCheckIn,
     failure_issue_threshold: int,
     received: datetime | None,
-) -> bool:
+) -> tuple[bool, list[SimpleCheckIn], MonitorIncident | None]:
+    """
+    Evaluate whether the failure threshold has been reached and return the
+    incident context needed to create occurrences.
+
+    Returns a 3-tuple of (threshold_reached, previous_checkins, incident).
+    The caller (mark_failed.py orchestrator) is responsible for calling
+    create_incident_occurrence so that incidents.py and incident_occurrence.py
+    remain independent of each other.
+    """
     from sentry.signals import monitor_environment_failed
 
     monitor_env = failed_checkin.monitor_environment
 
     if monitor_env is None:
-        return False
+        return False, [], None
 
     # check to see if we need to update the status
     if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
@@ -63,7 +69,7 @@ def try_incident_threshold(
         # get the active incident from the monitor environment
         incident = monitor_env.active_incident
     else:
         # don't send occurrence for other statuses
-        return False
+        return False, [], None
 
     # Only create an occurrence if:
     # - We have an active incident and fingerprint
     # - The monitor and env are not muted
-    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
-        checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
-        for checkin in checkins:
-            create_incident_occurrence(
-                previous_checkins,
-                checkin,
-                incident,
-                received=received,
-            )
-
     monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
 
-    return True
+    # Return context to the orchestrator so it can call create_incident_occurrence
+    # without incidents.py taking a direct dependency on incident_occurrence.py.
+    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+        return True, previous_checkins, incident
+
+    return True, [], None

🤖 Grapple PR auto-fix • minor • Review this diff before applying

from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Import Hygiene (confidence: 85%)

The MonitorStatus import is used in incidents.py but not explicitly imported in the file's imports section — actually it is imported on line 8. However, MonitorIncident is imported but only used as a type annotation for incident variable and in MonitorIncident.objects.get_or_create. This is fine, but note that CheckInStatus and MonitorCheckIn are also imported and used. All imports appear valid.

Evidence:

  • Line 8: from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
  • All four are used within the function body

Agent: logic

from sentry.monitors.types import SimpleCheckIn

logger = logging.getLogger(__name__)


def try_incident_threshold(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Function rename changes public API (confidence: 94%)

The original function was named mark_failed_threshold and has been renamed to try_incident_threshold. The original occurrence creation function was named create_issue_platform_occurrence and has been renamed to create_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.

Evidence:

  • Original mark_failed.py had def mark_failed_threshold(...) — now def try_incident_threshold(...) in incidents.py
  • Original mark_failed.py had def create_issue_platform_occurrence(...) — now def create_incident_occurrence(...) in incident_occurrence.py
  • Acceptance criteria: 'The net behavior of mark_failed.py is identical before and after the refactor'

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (22 lines, 1 file) — review recommended)

The original function was named mark_failed_threshold and has been renamed to try_incident_threshold. The original occurrence creation function was named create_issue_platform_occurrence and has been renamed to create_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 
 import logging
+import warnings
 from datetime import datetime
 from typing import cast
 
@@ -101,3 +103,23 @@ def try_incident_threshold(
     monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
 
     return True
+
+
+def mark_failed_threshold(
+    failed_checkin: MonitorCheckIn,
+    failure_issue_threshold: int,
+    received: datetime | None,
+) -> bool:
+    """
+    Deprecated: Use `try_incident_threshold` instead.
+
+    Backward-compatible alias preserved because external callers (tests,
+    signal handlers, other modules) may still reference the original name
+    from mark_failed.py. This alias ensures the refactor does not break
+    the public API contract while callers are migrated.
+    """
+    warnings.warn(
+        "mark_failed_threshold is deprecated; use try_incident_threshold instead.",
+        DeprecationWarning,
+        stacklevel=2,
+    )
+    return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Import path changes breaking test mocks (confidence: 94%)

The function mark_failed_threshold was renamed to try_incident_threshold and create_issue_platform_occurrence was renamed to create_incident_occurrence. Any existing tests that mock or patch these functions by their original names and module paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold') or @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop patching and tests may pass or fail for the wrong reasons. The acceptance criteria states 'All existing tests for mark_failed.py continue to pass without modification', but this rename makes that unlikely if any tests reference the old function names/paths.

Evidence:

  • Original function in mark_failed.py was named mark_failed_threshold, renamed to try_incident_threshold in incidents.py
  • Original function create_issue_platform_occurrence renamed to create_incident_occurrence in incident_occurrence.py
  • The intent specification says 'No logic is lost or altered during the extraction' and 'All existing tests for mark_failed.py continue to pass without modification'
  • Test files are not included in this PR, so we cannot confirm whether mocks referencing old paths exist

Agent: logic

failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
) -> bool:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 100%)

The extracted function was renamed from mark_failed_threshold to try_incident_threshold, which changes the internal API surface. While mark_failed.py is updated to call the new name, any other callers in the codebase that directly imported or called mark_failed_threshold will break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search for mark_failed_threshold references (tests, tasks, other modules) is needed to confirm no breakage.

Evidence:

  • Original function in mark_failed.py was named mark_failed_threshold
  • New function in incidents.py is named try_incident_threshold
  • No re-export or alias is provided in mark_failed.py for backward compatibility
  • Tests that directly import mark_failed_threshold from mark_failed will fail

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Small fix (7 lines, 1 file))

The extracted function was renamed from mark_failed_threshold to try_incident_threshold, which changes the internal API surface. While mark_failed.py is updated to call the new name, any other callers in the codebase that directly imported or called mark_failed_threshold will break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search for mark_failed_threshold references (tests, tasks, other modules) is needed to confirm no breakage.

--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
 
+from sentry.monitors.logic.incidents import try_incident_threshold
+
 logger = logging.getLogger(__name__)
@@ -10,6 +12,10 @@ logger = logging.getLogger(__name__)
+# Backward-compatibility alias: `mark_failed_threshold` was extracted and
+# renamed to `try_incident_threshold` in incidents.py. Any existing callers
+# (tests, tasks, other modules) that imported `mark_failed_threshold` from
+# this module will continue to work without modification.
+mark_failed_threshold = try_incident_threshold

🤖 Grapple PR auto-fix • major • Review this diff before applying

from sentry.signals import monitor_environment_failed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Naming conventions (confidence: 95%)

Function name try_incident_threshold uses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider if check_incident_threshold or handle_incident_threshold would be more precise.

Evidence:

  • The function both checks thresholds AND creates incidents/occurrences with side effects (database writes, Kafka calls)
  • The 'try_' prefix typically suggests non-throwing error handling, but this function performs substantial work
  • This naming pattern could be confusing to future maintainers about the function's scope

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Function name try_incident_threshold uses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider if check_incident_threshold or handle_incident_threshold would be more precise.

Suggested change
def handle_incident_threshold(

🤖 Grapple PR auto-fix • minor • confidence: 95%

monitor_env = failed_checkin.monitor_environment

if monitor_env is None:
return False

# check to see if we need to update the status
if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
if failure_issue_threshold == 1:
previous_checkins: list[SimpleCheckIn] = [
{
"id": failed_checkin.id,
"date_added": failed_checkin.date_added,
"status": failed_checkin.status,
}
]
else:
previous_checkins = cast(
list[SimpleCheckIn],
# Using .values for performance reasons
MonitorCheckIn.objects.filter(
monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
)
.order_by("-date_added")
.values("id", "date_added", "status"),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Logic Error / Missing Filter (confidence: 100%)

When failure_issue_threshold > 1, the DB query fetches check-ins filtered only by date_added__lte=failed_checkin.date_added with no upper bound on count before the query executes. The slice [:failure_issue_threshold] is applied in Python after the full queryset is evaluated via list(reversed(...)), meaning Django will fetch ALL matching check-ins into memory before slicing. This is a performance regression risk for monitors with long histories, and diverges from the intent of using .values() 'for performance reasons'.

Evidence:

  • Lines 40-46: MonitorCheckIn.objects.filter(...).order_by('-date_added').values(...) — no .limit() or [:n] on the queryset itself
  • Line 48: previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) — slice happens after full Python list is constructed
  • The comment on line 41 says 'Using .values for performance reasons' but the queryset is not limited at the DB level
  • Original code in mark_failed.py had the same pattern — faithfully copied — but this refactor pass is an opportunity to fix it

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (4 lines, 1 file))

When failure_issue_threshold > 1, the DB query fetches check-ins filtered only by date_added__lte=failed_checkin.date_added with no upper bound on count before the query executes. The slice [:failure_issue_threshold] is applied in Python after the full queryset is evaluated via list(reversed(...)), meaning Django will fetch ALL matching check-ins into memory before slicing. This is a performance regression risk for monitors with long histories, and diverges from the intent of using .values() 'for performance reasons'.

Suggested change
.values("id", "date_added", "status")[:failure_issue_threshold],
previous_checkins = list(reversed(previous_checkins))

🤖 Grapple PR auto-fix • minor • confidence: 100%

# reverse the list after slicing in order to start with oldest check-in
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))

# If we have any successful check-ins within the threshold of
# commits we have NOT reached an incident state
if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
return False

# change monitor status + update fingerprint timestamp
monitor_env.status = MonitorStatus.ERROR
monitor_env.save(update_fields=("status",))

starting_checkin = previous_checkins[0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Potential IndexError (confidence: 100%)

When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent previous_checkins[0] at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.

Evidence:

  • Lines 42-50: DB query fetches check-ins with date_added__lte=failed_checkin.date_added, no guarantee of non-empty result
  • Line 52: previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) — empty if DB returns nothing
  • Line 60: starting_checkin = previous_checkins[0] — IndexError if list is empty
  • No guard between lines 56 and 60 to handle empty previous_checkins
  • This was present in original code but is a real bug that should be fixed during this refactor

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Larger fix (12 lines, 1 file) — review recommended)

When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent previous_checkins[0] at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -37,16 +37,25 @@
         else:
             previous_checkins = cast(
                 list[SimpleCheckIn],
                 # Using .values for performance reasons
                 MonitorCheckIn.objects.filter(
                     monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
                 )
+                .exclude(status=CheckInStatus.OK)
                 .order_by("-date_added")
+                [: failure_issue_threshold]
                 .values("id", "date_added", "status"),
             )
 
             # reverse the list after slicing in order to start with oldest check-in
-            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+            previous_checkins = list(reversed(previous_checkins))
 
-            # If we have any successful check-ins within the threshold of
-            # commits we have NOT reached an incident state
-            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+            # If we don't have enough consecutive failures to meet the threshold,
+            # we have NOT reached an incident state.
+            # Also guard against an empty result (race condition / data deleted)
+            # which would cause IndexError at previous_checkins[0] below.
+            if len(previous_checkins) < failure_issue_threshold:
                 return False
 
         # change monitor status + update fingerprint timestamp

🤖 Grapple PR auto-fix • major • Review this diff before applying

incident: MonitorIncident | None
incident, _ = MonitorIncident.objects.get_or_create(
monitor_environment=monitor_env,
resolving_checkin=None,
defaults={
"monitor": monitor_env.monitor,
"starting_checkin_id": starting_checkin["id"],
"starting_timestamp": starting_checkin["date_added"],
},
)

elif monitor_env.status == MonitorStatus.ERROR:
# if monitor environment has a failed status, use the failed
# check-in and send occurrence
previous_checkins = [
{
"id": failed_checkin.id,
"date_added": failed_checkin.date_added,
"status": failed_checkin.status,
}
]

# get the active incident from the monitor environment
incident = monitor_env.active_incident
else:
# don't send occurrence for other statuses
return False

# Only create an occurrence if:
# - We have an active incident and fingerprint
# - The monitor and env are not muted
if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
for checkin in checkins:
create_incident_occurrence(
previous_checkins,
checkin,
incident,
received=received,
)

monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))

return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Refactor Completeness (confidence: 100%)

The function was renamed from mark_failed_threshold to try_incident_threshold during extraction. While this is a reasonable rename for clarity, any external callers or tests that directly imported mark_failed_threshold from mark_failed.py would break. The intent says 'The public API surface consumed by callers of mark_failed.py should remain unchanged or be backward compatible.' If mark_failed_threshold was only called internally within mark_failed, this is fine. But if any test or module imported it directly, a backward-compatible alias should be provided.

Evidence:

  • Original: def mark_failed_threshold(...) in mark_failed.py
  • New: def try_incident_threshold(...) in incidents.py
  • mark_failed.py now calls try_incident_threshold instead of the old internal function
  • The top-level mark_failed() function signature is preserved, which is the primary public API

Agent: logic

Loading