Skip to content

ref(crons): Reorganize incident creation / issue occurrence logic#3

Open
Frawatson wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after
Open

ref(crons): Reorganize incident creation / issue occurrence logic#3
Frawatson wants to merge 1 commit into
monitor-incident-refactor-beforefrom
monitor-incident-refactor-after

Conversation

@Frawatson

Copy link
Copy Markdown
Owner

Benchmark test PR #8 - Replicating Greptile benchmark for Grapple PR evaluation.

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

🤖 Grapple PR Review

Summary

This PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py. This is a structural refactor to improve separation of concerns in the monitors subsystem.

Changes: 4 files (+289 / -264 lines)

  • 🟢 2 files added: src/sentry/monitors/logic/incident_occurrence.py, src/sentry/monitors/logic/incidents.py
  • 🟡 2 files modified: src/sentry/monitors/logic/mark_failed.py, src/sentry/monitors/types.py

Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)

CI: ⏳ 0 passed, 0 failed

Impact Diagram

graph TD
  pr_0[/"🔀 PR Changes: 4 files"/]
  f1["🟢 incident_occurrence.py +171/-0"]
  pr_0 --> f1
  f2["🟢 incidents.py +104/-0"]
  pr_0 --> f2
  f3["🟡 mark_failed.py +4/-264"]
  pr_0 --> f3
  f4["🟡 types.py +10/-0"]
  pr_0 --> f4
  f1 -.- f2
  f2 -.- f3
  style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
Loading

3 major issues found — review recommended.

Severity Count
🟠 Major 3
🟡 Minor 6
🔵 Info 5
Review details
  • Reviewed at: 2026-04-07T18:46:24.115Z
  • Agents used: security, logic, style, architecture, performance, verification
  • Context level: full
  • Execution time: 57.1s
  • Low-confidence findings suppressed: 1
  • Blast radius: 0 (contained)

Powered by Grapple PR — AI that understands your code.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MAJORsrc/sentry/monitors/logic/incidents.py:1

Two new modules are added under src/sentry/monitors/logic/ but there is no change to an __init__.py to export them. If the logic package's __init__.py has explicit __all__ or explicit imports that control the public surface, these new modules won't be discoverable. This is a minor concern — Python can still import them directly — but it could violate existing package conventions.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MAJORsrc/sentry/monitors/logic/incidents.py:79

A separate database query is issued for each check-in ID collected in previous_checkins, followed by a loop that calls create_incident_occurrence per check-in. When failure_issue_threshold is large, this results in N individual Kafka produce calls and an extra DB round-trip. The original code had the same pattern, so this is not a regression introduced by the refactor, but it is worth flagging.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MAJORsrc/sentry/monitors/logic/mark_failed.py:6

After the refactor, MonitorEnvironment is still imported in mark_failed.py. While this may be used in the remaining code (lines not shown in the diff context), it's worth verifying. More importantly, MonitorStatus was removed from the import but is likely still needed by the remaining mark_failed function body which checks/sets monitor environment statuses. However, looking at the remaining code more carefully, the status checks appear to have been moved to incidents.py, so this is likely fine.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/incidents.py:14

The function was renamed from mark_failed_threshold to try_incident_threshold during the refactor. Any external callers or tests that imported mark_failed_threshold directly from mark_failed.py will break. While the intent says the public API of mark_failed (the top-level function) is preserved, the internal helper had a known name. If any test or module references mark_failed_threshold by name, this is a breaking change.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/incident_occurrence.py:108

get_monitor_environment_context copies monitor_environment.monitor.config and conditionally replaces schedule_type, but the result is never stored in occurrence — it is only placed in the event_data dict. The copy is cheap, but config is assigned into the local config variable yet the returned dict still uses monitor_environment.monitor.config (the original), not the modified copy. This is a logic issue that also represents unnecessary work.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/incident_occurrence.py:29

The function was renamed from create_issue_platform_occurrence to create_incident_occurrence during the refactor. Any external callers or tests that referenced the original name will break.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/mark_failed.py:7

After the refactor, mark_failed.py still imports MonitorEnvironment but it is unclear from the visible diff whether the remaining code (lines 1-76) still uses it. If the remaining mark_failed function body references MonitorEnvironment, this is fine; otherwise it's an unused import.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/incident_occurrence.py:20

incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is the new canonical location. This is correct. However, incidents.py defines its own import of SimpleCheckIn also from sentry.monitors.types. Both new modules correctly import from the new types.py location - this is fine.

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

MINORsrc/sentry/monitors/logic/incident_occurrence.py:130

Typo in docstring: 'humam readible' should be 'human readable'

@grapple-pr

grapple-pr Bot commented Apr 7, 2026

Copy link
Copy Markdown

INFOsrc/sentry/monitors/logic/incidents.py:7

The dependency direction is mark_failed.py → incidents.py → incident_occurrence.py, which forms a clean unidirectional chain with no circular imports. types.py is a leaf dependency used by both new modules. This is a well-structured decomposition.

@Frawatson Frawatson closed this Apr 7, 2026
@Frawatson Frawatson reopened this Apr 7, 2026

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Grapple PR Review

Summary

This PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py, reducing mark_failed.py by ~264 lines.

Changes: 4 files (+289 / -264 lines)

  • 🟢 2 files added: src/sentry/monitors/logic/incident_occurrence.py, src/sentry/monitors/logic/incidents.py
  • 🟡 2 files modified: src/sentry/monitors/logic/mark_failed.py, src/sentry/monitors/types.py

Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)

CI: ⏳ 0 passed, 0 failed

Impact Diagram

graph TD
  pr_0[/"🔀 PR Changes: 4 files"/]
  f1["🟢 incident_occurrence.py +171/-0"]
  pr_0 --> f1
  f2["🟢 incidents.py +104/-0"]
  pr_0 --> f2
  f3["🟡 mark_failed.py +4/-264"]
  pr_0 --> f3
  f4["🟡 types.py +10/-0"]
  pr_0 --> f4
  f1 -.- f2
  f2 -.- f3
  style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
Loading

2 major issues found — review recommended.

Severity Count
🟠 Major 2
🟡 Minor 3
🔵 Info 1
Review details
  • Reviewed at: 2026-04-07T21:07:19.372Z
  • Agents used: security, logic, style, architecture, performance, verification
  • Context level: full
  • Execution time: 40.2s
  • Low-confidence findings suppressed: 1
  • Blast radius: 0 (contained)

Powered by Grapple PR — AI that understands your code.

"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}
return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Broken external imports (confidence: 100%)

The functions mark_failed_threshold, create_issue_platform_occurrence, get_failure_reason, get_monitor_environment_context, and the type SimpleCheckIn were previously importable from mark_failed.py. Any external callers importing these symbols (e.g., from sentry.monitors.logic.mark_failed import mark_failed_threshold or from sentry.monitors.logic.mark_failed import SimpleCheckIn) will now get ImportError. The PR does not re-export these symbols from mark_failed.py.

Evidence:

  • The original mark_failed.py defined mark_failed_threshold, SimpleCheckIn, create_issue_platform_occurrence, get_failure_reason, get_monitor_environment_context as module-level names.
  • After the refactor, none of these names are re-exported or aliased in mark_failed.py.
  • The acceptance criteria state: 'All public functions/methods previously accessible via mark_failed.py remain importable or are re-exported correctly'.
  • Test files or other modules in the codebase that import these symbols directly from mark_failed.py will break.

Agent: logic

logger = logging.getLogger(__name__)


def try_incident_threshold(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Function rename changes public API (confidence: 94%)

The original function was named mark_failed_threshold and has been renamed to try_incident_threshold. The original occurrence creation function was named create_issue_platform_occurrence and has been renamed to create_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.

Evidence:

  • Original mark_failed.py had def mark_failed_threshold(...) — now def try_incident_threshold(...) in incidents.py
  • Original mark_failed.py had def create_issue_platform_occurrence(...) — now def create_incident_occurrence(...) in incident_occurrence.py
  • Acceptance criteria: 'The net behavior of mark_failed.py is identical before and after the refactor'

Agent: logic



def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Code Organization (confidence: 100%)

Inconsistent handling of monitor config in get_monitor_environment_context: the function modifies a shallow copy of config but returns the original unmodified config in the dict

Evidence:

  • Line 156-157: config = monitor_environment.monitor.config.copy() followed by conditional mutation of config['schedule_type']
  • Line 162: 'config': monitor_environment.monitor.config, returns the original, not the modified config variable
  • This means the schedule_type transformation is lost in the returned context

Agent: style

MonitorEnvironment,
MonitorIncident,
)
from sentry.monitors.types import SimpleCheckIn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Unused import in types.py consumer (confidence: 88%)

incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.

Evidence:

  • incident_occurrence.py line 20: from sentry.monitors.types import SimpleCheckIn
  • incidents.py line 9: from sentry.monitors.types import SimpleCheckIn
  • Both modules use SimpleCheckIn in type annotations and dict construction

Agent: logic

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 76%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • The function get_failure_reason has a docstring with a spelling error: 'Builds a humam readible string...'
  • This is a public function that may be referenced in documentation or by other developers

Agent: style

MonitorIncident,
MonitorStatus,
)
from sentry.monitors.logic.incidents import try_incident_threshold

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Dead import after refactor (confidence: 75%)

MonitorEnvironment is still imported in mark_failed.py. Verify it is still used in the remaining mark_failed() function body (lines not shown in the diff but presumably still reference it). If not used, it's a dead import.

Evidence:

  • Line 8: from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnvironment
  • The remaining mark_failed function likely still uses MonitorEnvironment for the query at the top of the function

Agent: logic

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 [Grapple PR] 3 auto-fix suggestions available

Apply individually or use GitHub's "Add suggestion to batch""Commit suggestions" to apply all at once.

Commits from Grapple PR are prefixed with [Grapple PR] for easy identification in merge conflicts.



def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Inconsistent handling of monitor config in get_monitor_environment_context: the function modifies a shallow copy of config but returns the original unmodified config in the dict

Suggested change
config = monitor_environment.monitor.config.copy()
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%

MonitorEnvironment,
MonitorIncident,
)
from sentry.monitors.types import SimpleCheckIn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.

Suggested change
from sentry.monitors.types import SimpleCheckIn
from sentry.monitors.types import SimpleCheckIn

🤖 Grapple PR auto-fix • minor • confidence: 88%

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Typo in docstring: 'humam readible' should be 'human readable'

Suggested change
"""
Builds a human readable string from a list of failed check-ins.

🤖 Grapple PR auto-fix • minor • confidence: 76%

"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}
return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (33 lines, 1 file) — review recommended)

The functions mark_failed_threshold, create_issue_platform_occurrence, get_failure_reason, get_monitor_environment_context, and the type SimpleCheckIn were previously importable from mark_failed.py. Any external callers importing these symbols (e.g., from sentry.monitors.logic.mark_failed import mark_failed_threshold or from sentry.monitors.logic.mark_failed import SimpleCheckIn) will now get ImportError. The PR does not re-export these symbols from mark_failed.py.

--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import logging
+from datetime import datetime  # noqa: F401 - re-exported for backward compatibility
 from datetime import datetime
 
 from django.db.models import Q
@@ -10,6 +11,28 @@ from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnviron
 
 logger = logging.getLogger(__name__)
 
+# ---------------------------------------------------------------------------
+# Backward-compatibility re-exports
+#
+# These symbols were previously defined in this module. They have been moved
+# to dedicated sub-modules as part of a refactor. We re-export them here so
+# that any existing import sites of the form:
+#
+#   from sentry.monitors.logic.mark_failed import mark_failed_threshold
+#   from sentry.monitors.logic.mark_failed import SimpleCheckIn
+#   from sentry.monitors.logic.mark_failed import create_issue_platform_occurrence
+#   from sentry.monitors.logic.mark_failed import get_failure_reason
+#   from sentry.monitors.logic.mark_failed import get_monitor_environment_context
+#
+# continue to work without modification.
+# ---------------------------------------------------------------------------
+from sentry.monitors.logic.incident_occurrence import (  # noqa: F401, E402
+    SimpleCheckIn,
+    create_issue_platform_occurrence,
+    get_failure_reason,
+    get_monitor_environment_context,
+)
+from sentry.monitors.logic.incidents import try_incident_threshold as mark_failed_threshold  # noqa: F401, E402
+
+__all__ = [
+    "mark_failed",
+    "mark_failed_threshold",
+    "SimpleCheckIn",
+    "create_issue_platform_occurrence",
+    "get_failure_reason",
+    "get_monitor_environment_context",
+]
+
 
 def mark_failed(

🤖 Grapple PR auto-fix • major • Review this diff before applying

logger = logging.getLogger(__name__)


def try_incident_threshold(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (22 lines, 1 file) — review recommended)

The original function was named mark_failed_threshold and has been renamed to try_incident_threshold. The original occurrence creation function was named create_issue_platform_occurrence and has been renamed to create_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 
 import logging
+import warnings
 from datetime import datetime
 from typing import cast
 
@@ -101,3 +103,23 @@ def try_incident_threshold(
     monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
 
     return True
+
+
+def mark_failed_threshold(
+    failed_checkin: MonitorCheckIn,
+    failure_issue_threshold: int,
+    received: datetime | None,
+) -> bool:
+    """
+    Deprecated: Use `try_incident_threshold` instead.
+
+    Backward-compatible alias preserved because external callers (tests,
+    signal handlers, other modules) may still reference the original name
+    from mark_failed.py. This alias ensures the refactor does not break
+    the public API contract while callers are migrated.
+    """
+    warnings.warn(
+        "mark_failed_threshold is deprecated; use try_incident_threshold instead.",
+        DeprecationWarning,
+        stacklevel=2,
+    )
+    return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

🤖 Grapple PR auto-fix • major • Review this diff before applying

@Frawatson Frawatson closed this Apr 7, 2026
@Frawatson Frawatson reopened this Apr 7, 2026

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Grapple PR Review

Summary

This PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py, reducing mark_failed.py by 264 lines while adding 275 lines across the new files.

Changes: 4 files (+289 / -264 lines)

  • 🟢 2 files added: src/sentry/monitors/logic/incident_occurrence.py, src/sentry/monitors/logic/incidents.py
  • 🟡 2 files modified: src/sentry/monitors/logic/mark_failed.py, src/sentry/monitors/types.py

Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)

CI: ⏳ 0 passed, 0 failed

Impact Diagram

graph TD
  pr_0[/"🔀 PR Changes: 4 files"/]
  f1["🟢 incident_occurrence.py +171/-0"]
  pr_0 --> f1
  f2["🟢 incidents.py +104/-0"]
  pr_0 --> f2
  f3["🟡 mark_failed.py +4/-264"]
  pr_0 --> f3
  f4["🟡 types.py +10/-0"]
  pr_0 --> f4
  f1 -.- f2
  f2 -.- f3
  style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
Loading

1 critical issue found — please address before merging.

Severity Count
🔴 Critical 1
🟠 Major 5
🟡 Minor 6
🔵 Info 3
Review details
  • Reviewed at: 2026-04-08T00:47:12.319Z
  • Agents used: security, logic, style, architecture, performance, verification
  • Context level: full
  • Execution time: 104.7s
  • Low-confidence findings suppressed: 4
  • Blast radius: 0 (contained)

Powered by Grapple PR — AI that understands your code.

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Data flow / Bug (confidence: 100%)

get_monitor_environment_context builds a local config variable with the schedule_type overwritten, but then the returned dictionary uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.

Evidence:

  • Line 158: config = monitor_environment.monitor.config.copy()
  • Line 159-160: if 'schedule_type' in config: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display()
  • Line 165: 'config': monitor_environment.monitor.config, — uses the original config, not the local config variable

Agent: logic

failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
) -> bool:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 100%)

The extracted function was renamed from mark_failed_threshold to try_incident_threshold, which changes the internal API surface. While mark_failed.py is updated to call the new name, any other callers in the codebase that directly imported or called mark_failed_threshold will break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search for mark_failed_threshold references (tests, tasks, other modules) is needed to confirm no breakage.

Evidence:

  • Original function in mark_failed.py was named mark_failed_threshold
  • New function in incidents.py is named try_incident_threshold
  • No re-export or alias is provided in mark_failed.py for backward compatibility
  • Tests that directly import mark_failed_threshold from mark_failed will fail

Agent: architecture

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Data Integrity (confidence: 100%)

The config copy with humanized schedule_type is computed but never used. The returned dict references monitor_environment.monitor.config (the original dict) instead of the local config variable, so the schedule_type transformation is silently discarded.

Evidence:

  • Line 155: config = monitor_environment.monitor.config.copy() — copy is made and potentially modified
  • Line 157: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display() — written to local config
  • Line 160: 'config': monitor_environment.monitor.config — returns the ORIGINAL unmodified dict, not config
  • This bug exists verbatim in the original mark_failed.py and was faithfully copied, but it means the monitor context sent with every occurrence always has the raw integer schedule_type instead of the human-readable string

Agent: security

status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Edge case / KeyError risk (confidence: 100%)

In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.

Evidence:

  • Lines 136-139: status_counts = Counter(checkin['status'] for checkin in failed_checkins if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys())
  • If no check-ins match, status_counts is empty and sum(status_counts.values()) == 0
  • Falls through to get_text_list([], last_word=_('and')) which returns ''
  • Returns _('%(problem_checkins)s check-ins detected') % {'problem_checkins': ''}' check-ins detected'
  • This is then surfaced as the 'Failure reason' evidence in the issue occurrence shown to users
  • HUMAN_FAILURE_STATUS_MAP has keys: ERROR, MISSED, TIMEOUT
  • SINGULAR_HUMAN_FAILURE_MAP has keys: ERROR, MISSED, TIMEOUT
  • If these maps ever diverge, line 142 would raise KeyError
  • Also: if all check-ins are filtered out (status_counts is empty), sum == 0, so the code goes to get_text_list with an empty list, producing a malformed string like ' check-ins detected'

Agent: logic

incident: MonitorIncident,
received: datetime | None,
) -> None:
from sentry.issues.issue_occurrence import IssueEvidence, IssueOccurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 99%)

The occurrence creation function was renamed from create_issue_platform_occurrence to create_incident_occurrence. Any code outside mark_failed.py that imports or references create_issue_platform_occurrence by name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.

Evidence:

  • Original function was create_issue_platform_occurrence in mark_failed.py
  • New function is create_incident_occurrence in incident_occurrence.py
  • Tests commonly mock internal functions like create_issue_platform_occurrence by path; those mocks will silently pass (no-op) or fail

Agent: architecture

received: datetime | None,
) -> bool:
from sentry.signals import monitor_environment_failed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Naming conventions (confidence: 95%)

Function name try_incident_threshold uses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider if check_incident_threshold or handle_incident_threshold would be more precise.

Evidence:

  • The function both checks thresholds AND creates incidents/occurrences with side effects (database writes, Kafka calls)
  • The 'try_' prefix typically suggests non-throwing error handling, but this function performs substantial work
  • This naming pattern could be confusing to future maintainers about the function's scope

Agent: style


id: int
date_added: datetime
status: int

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Type Definition Location (confidence: 81%)

SimpleCheckIn is correctly defined in types.py but is also still used by incidents.py which imports it from types.py. Verify that no residual definition of SimpleCheckIn exists in the old mark_failed.py. The diff confirms it was removed — this is good. However, the docstring says 'A stripped down check in object' which could be more descriptive about its purpose (it represents check-ins used for incident threshold evaluation).

Evidence:

  • SimpleCheckIn moved from mark_failed.py to types.py
  • Imported correctly in both incidents.py (line 9) and incident_occurrence.py (line 20)

Agent: architecture


monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))

return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Refactor Completeness (confidence: 100%)

The function was renamed from mark_failed_threshold to try_incident_threshold during extraction. While this is a reasonable rename for clarity, any external callers or tests that directly imported mark_failed_threshold from mark_failed.py would break. The intent says 'The public API surface consumed by callers of mark_failed.py should remain unchanged or be backward compatible.' If mark_failed_threshold was only called internally within mark_failed, this is fine. But if any test or module imported it directly, a backward-compatible alias should be provided.

Evidence:

  • Original: def mark_failed_threshold(...) in mark_failed.py
  • New: def try_incident_threshold(...) in incidents.py
  • mark_failed.py now calls try_incident_threshold instead of the old internal function
  • The top-level mark_failed() function signature is preserved, which is the primary public API

Agent: logic

from typing import cast

from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Import Hygiene (confidence: 85%)

The MonitorStatus import is used in incidents.py but not explicitly imported in the file's imports section — actually it is imported on line 8. However, MonitorIncident is imported but only used as a type annotation for incident variable and in MonitorIncident.objects.get_or_create. This is fine, but note that CheckInStatus and MonitorCheckIn are also imported and used. All imports appear valid.

Evidence:

  • Line 8: from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
  • All four are used within the function body

Agent: logic

"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Documentation (confidence: 82%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • Line 130 contains: 'Builds a humam readible string from a list of failed check-ins.'
  • This is a documentation typo that should be corrected for clarity

Agent: style

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 [Grapple PR] 8 auto-fix suggestions available

Apply individually or use GitHub's "Add suggestion to batch""Commit suggestions" to apply all at once.

Commits from Grapple PR are prefixed with [Grapple PR] for easy identification in merge conflicts.

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

get_monitor_environment_context builds a local config variable with the schedule_type overwritten, but then the returned dictionary uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • critical • confidence: 100%

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))

The config copy with humanized schedule_type is computed but never used. The returned dict references monitor_environment.monitor.config (the original dict) instead of the local config variable, so the schedule_type transformation is silently discarded.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • major • confidence: 100%

from datetime import datetime, timezone
from typing import TYPE_CHECKING, TypedDict, cast
from datetime import datetime

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

The MonitorEnvironment import is retained in mark_failed.py but after the refactor it is no longer used directly in this file. The mark_failed function receives MonitorCheckIn and accesses monitor_environment through it; MonitorEnvironment is not referenced as a type annotation or used anywhere in the remaining code of mark_failed.py.

Suggested change
from sentry.monitors.models import CheckInStatus, MonitorCheckIn

🤖 Grapple PR auto-fix • major • confidence: 89%

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}


def get_monitor_environment_context(monitor_environment: MonitorEnvironment):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

In get_monitor_environment_context, the local variable config is computed and modified (line 152-153: config = monitor_environment.monitor.config.copy() and the schedule_type override), but the returned dict uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This is a pre-existing bug that was faithfully moved from mark_failed.py, but it means the schedule_type display override is silently discarded.

Suggested change
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%



def mark_failed(
failed_checkin: MonitorCheckIn,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (5 lines, 1 file))

SimpleCheckIn TypedDict was previously exported from sentry.monitors.logic.mark_failed and is now only defined in sentry.monitors.types. Any code outside this PR that imported SimpleCheckIn from mark_failed will get an ImportError at runtime. No re-export shim is provided.

Suggested change
failed_checkin: MonitorCheckIn,
# Re-exported for backward compatibility: SimpleCheckIn was previously defined
# in this module. It has been moved to sentry.monitors.types as part of the
# incident logic refactor. External callers importing from this path will
# continue to work without modification.
from sentry.monitors.types import SimpleCheckIn as SimpleCheckIn # noqa: F401

🤖 Grapple PR auto-fix • minor • confidence: 100%

.order_by("-date_added")
.values("id", "date_added", "status"),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (4 lines, 1 file))

When failure_issue_threshold > 1, the DB query fetches check-ins filtered only by date_added__lte=failed_checkin.date_added with no upper bound on count before the query executes. The slice [:failure_issue_threshold] is applied in Python after the full queryset is evaluated via list(reversed(...)), meaning Django will fetch ALL matching check-ins into memory before slicing. This is a performance regression risk for monitors with long histories, and diverges from the intent of using .values() 'for performance reasons'.

Suggested change
.values("id", "date_added", "status")[:failure_issue_threshold],
previous_checkins = list(reversed(previous_checkins))

🤖 Grapple PR auto-fix • minor • confidence: 100%

received: datetime | None,
) -> bool:
from sentry.signals import monitor_environment_failed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Function name try_incident_threshold uses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider if check_incident_threshold or handle_incident_threshold would be more precise.

Suggested change
def handle_incident_threshold(

🤖 Grapple PR auto-fix • minor • confidence: 95%


id: int
date_added: datetime
status: int

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — architecture agent (Small fix (4 lines, 1 file))

SimpleCheckIn is correctly defined in types.py but is also still used by incidents.py which imports it from types.py. Verify that no residual definition of SimpleCheckIn exists in the old mark_failed.py. The diff confirms it was removed — this is good. However, the docstring says 'A stripped down check in object' which could be more descriptive about its purpose (it represents check-ins used for incident threshold evaluation).

Suggested change
status: int
A minimal check-in representation used for incident threshold evaluation.
Contains only the fields required to assess monitor failure and recovery
conditions in incident and occurrence processing logic.

🤖 Grapple PR auto-fix • minor • confidence: 81%

failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
) -> bool:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Small fix (7 lines, 1 file))

The extracted function was renamed from mark_failed_threshold to try_incident_threshold, which changes the internal API surface. While mark_failed.py is updated to call the new name, any other callers in the codebase that directly imported or called mark_failed_threshold will break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search for mark_failed_threshold references (tests, tasks, other modules) is needed to confirm no breakage.

--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,6 +1,8 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
 
+from sentry.monitors.logic.incidents import try_incident_threshold
+
 logger = logging.getLogger(__name__)
@@ -10,6 +12,10 @@ logger = logging.getLogger(__name__)
+# Backward-compatibility alias: `mark_failed_threshold` was extracted and
+# renamed to `try_incident_threshold` in incidents.py. Any existing callers
+# (tests, tasks, other modules) that imported `mark_failed_threshold` from
+# this module will continue to work without modification.
+mark_failed_threshold = try_incident_threshold

🤖 Grapple PR auto-fix • major • Review this diff before applying

status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (16 lines, 1 file) — review recommended)

In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.

--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -139,10 +139,18 @@ def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
     status_counts = Counter(
         checkin["status"]
         for checkin in failed_checkins
         if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
     )
 
+    total = sum(status_counts.values())
+
+    # No check-ins with recognized failure statuses; return a generic fallback
+    # rather than producing a malformed string like ' check-ins detected'.
+    if total == 0:
+        return _("A failed check-in was detected")
+
-    if sum(status_counts.values()) == 1:
-        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+    if total == 1:
+        status_key = list(status_counts.keys())[0]
+        # Defensive .get() in case SINGULAR_HUMAN_FAILURE_MAP and
+        # HUMAN_FAILURE_STATUS_MAP ever diverge (they currently share keys).
+        return SINGULAR_HUMAN_FAILURE_MAP.get(
+            status_key, _("A failed check-in was detected")
+        )
 
     human_status = get_text_list(
         [

🤖 Grapple PR auto-fix • major • Review this diff before applying

incident: MonitorIncident,
received: datetime | None,
) -> None:
from sentry.issues.issue_occurrence import IssueEvidence, IssueOccurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (11 lines, 1 file) — review recommended)

The occurrence creation function was renamed from create_issue_platform_occurrence to create_incident_occurrence. Any code outside mark_failed.py that imports or references create_issue_platform_occurrence by name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.

--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -168,3 +168,13 @@
         "type": monitor_environment.monitor.get_type_display(),
     }
+
+
+# Backward-compatible alias for the previous name of this function when it lived
+# in mark_failed.py. Tests that mock or import `create_issue_platform_occurrence`
+# by path (e.g. via `mock.patch("...incident_occurrence.create_issue_platform_occurrence")`)
+# will continue to work without modification.
+#
+# Deprecated: use `create_incident_occurrence` directly.
+# TODO: Remove this alias once all references to `create_issue_platform_occurrence`
+#       have been updated across tests and callers.
+create_issue_platform_occurrence = create_incident_occurrence

🤖 Grapple PR auto-fix • major • Review this diff before applying

from datetime import datetime
from typing import cast

from sentry.monitors.logic.incident_occurrence import create_incident_occurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (12 lines, 1 file) — review recommended)

incidents.py imports directly from incident_occurrence.py, creating a fixed coupling direction. This is acceptable for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency will result. Consider whether the call to create_incident_occurrence should instead be injected or orchestrated from a higher-level caller (like mark_failed.py) to keep the two new modules independent.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,11 +1,11 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
-from typing import cast
+from typing import Callable, cast
 
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
 from sentry.monitors.types import SimpleCheckIn
 
 logger = logging.getLogger(__name__)
 
 
 def try_incident_threshold(
     failed_checkin: MonitorCheckIn,
     failure_issue_threshold: int,
     received: datetime | None,
+    # Injected to avoid a coupling dependency between incidents.py and
+    # incident_occurrence.py. If incident_occurrence.py ever needs to reference
+    # incident state, a circular import would result. The orchestrating caller
+    # (mark_failed.py) is responsible for wiring this dependency.
+    create_occurrence: Callable = None,
 ) -> bool:
@@ -88,10 +88,13 @@
     # Only create an occurrence if:
     # - We have an active incident and fingerprint
     # - The monitor and env are not muted
-    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident and create_occurrence is not None:
         checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
         for checkin in checkins:
-            create_incident_occurrence(
+            create_occurrence(
                 previous_checkins,
                 checkin,
                 incident,
                 received=received,
             )

🤖 Grapple PR auto-fix • minor • Review this diff before applying

@Frawatson

Copy link
Copy Markdown
Owner Author

@grapple-pr review

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Grapple PR Review

Summary

This PR reorganizes the cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py. This is a refactoring change that improves code organization and separation of concerns.

Changes: 4 files (+289 / -264 lines)

  • 🟢 2 files added: src/sentry/monitors/logic/incident_occurrence.py, src/sentry/monitors/logic/incidents.py
  • 🟡 2 files modified: src/sentry/monitors/logic/mark_failed.py, src/sentry/monitors/types.py

Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)

CI: ⏳ 0 passed, 0 failed

Impact Diagram

graph TD
  pr_0[/"🔀 PR Changes: 4 files"/]
  f1["🟢 incident_occurrence.py +171/-0"]
  pr_0 --> f1
  f2["🟢 incidents.py +104/-0"]
  pr_0 --> f2
  f3["🟡 mark_failed.py +4/-264"]
  pr_0 --> f3
  f4["🟡 types.py +10/-0"]
  pr_0 --> f4
  f1 -.- f2
  f2 -.- f3
  style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
  style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
Loading

5 major issues found — review recommended.

Severity Count
🟠 Major 5
🟡 Minor 5
🔵 Info 1
Review details
  • Reviewed at: 2026-04-12T03:17:25.375Z
  • Agents used: security, logic, style, architecture, performance, verification
  • Context level: full
  • Execution time: 84.2s
  • Low-confidence findings suppressed: 3
  • Blast radius: 0 (contained)

Powered by Grapple PR — AI that understands your code.

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Potential KeyError or Misleading Output (confidence: 100%)

get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.

Evidence:

  • Line 140-144: Counter filters only statuses in HUMAN_FAILURE_STATUS_MAP (ERROR, MISSED, TIMEOUT)
  • Line 146: if sum(status_counts.values()) == 1 — only handles the singular case
  • No else/fallback branch for sum == 0 (empty status_counts)
  • Docstring at line 133 mentions 'A failed check-in was detected' but no code path produces that string — SINGULAR_HUMAN_FAILURE_MAP only covers ERROR, MISSED, TIMEOUT
  • If a check-in with status IN_PROGRESS (or any unrecognized status) is in failed_checkins, get_text_list receives an empty list

Agent: security

monitor_env.save(update_fields=("status",))

starting_checkin = previous_checkins[0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Potential IndexError (confidence: 100%)

When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent previous_checkins[0] at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.

Evidence:

  • Lines 42-50: DB query fetches check-ins with date_added__lte=failed_checkin.date_added, no guarantee of non-empty result
  • Line 52: previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) — empty if DB returns nothing
  • Line 60: starting_checkin = previous_checkins[0] — IndexError if list is empty
  • No guard between lines 56 and 60 to handle empty previous_checkins
  • This was present in original code but is a real bug that should be fixed during this refactor

Agent: security

"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error - Config Context Bug Preserved (confidence: 100%)

The get_monitor_environment_context function builds a config copy and conditionally modifies schedule_type, but then returns monitor_environment.monitor.config (the original) instead of the modified config copy. This bug existed in the original mark_failed.py and was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.

Evidence:

  • Line 163: config = monitor_environment.monitor.config.copy()
  • Line 164-165: if 'schedule_type' in config: config['schedule_type'] = ... modifies the copy
  • Line 170: 'config': monitor_environment.monitor.config returns the ORIGINAL, not the modified copy
  • This means the schedule_type display transformation is silently discarded

Agent: logic

logger = logging.getLogger(__name__)


def try_incident_threshold(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Import path changes breaking test mocks (confidence: 94%)

The function mark_failed_threshold was renamed to try_incident_threshold and create_issue_platform_occurrence was renamed to create_incident_occurrence. Any existing tests that mock or patch these functions by their original names and module paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold') or @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop patching and tests may pass or fail for the wrong reasons. The acceptance criteria states 'All existing tests for mark_failed.py continue to pass without modification', but this rename makes that unlikely if any tests reference the old function names/paths.

Evidence:

  • Original function in mark_failed.py was named mark_failed_threshold, renamed to try_incident_threshold in incidents.py
  • Original function create_issue_platform_occurrence renamed to create_incident_occurrence in incident_occurrence.py
  • The intent specification says 'No logic is lost or altered during the extraction' and 'All existing tests for mark_failed.py continue to pass without modification'
  • Test files are not included in this PR, so we cannot confirm whether mocks referencing old paths exist

Agent: logic

"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}
return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API contract / Test mock breakage (confidence: 94%)

The function mark_failed_threshold has been renamed to try_incident_threshold and moved to a new module. Similarly, create_issue_platform_occurrence has been renamed to create_incident_occurrence and moved. Any existing tests or callers that mock or patch these functions at their old paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold'), @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop intercepting the real function and tests will either fail or pass incorrectly. The function renaming (not just moving) doubles the risk since even callers doing from sentry.monitors.logic.mark_failed import create_issue_platform_occurrence will get an ImportError.

Evidence:

  • Original: mark_failed_threshold in mark_failed.py → now try_incident_threshold in incidents.py
  • Original: create_issue_platform_occurrence in mark_failed.py → now create_incident_occurrence in incident_occurrence.py
  • The intent specification states 'All existing tests for mark_failed.py continue to pass without modification', but function renaming means any mock targeting old names/paths will break
  • No re-exports of old names are provided in mark_failed.py for backward compatibility

Agent: architecture

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Logic Error / Stale Data (confidence: 100%)

get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local config copy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.

Evidence:

  • Line 155: config = monitor_environment.monitor.config.copy() — a copy is made and potentially mutated
  • Line 157: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display() — human-readable value stored in copy
  • Line 163: 'config': monitor_environment.monitor.config — returns the original raw dict, not the transformed copy
  • The caller in create_incident_occurrence at line 86 embeds this context in event_data sent to Kafka, so consumers see the raw schedule_type integer instead of the display string

Agent: security

if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Edge Case - Empty failed_checkins (confidence: 100%)

The get_failure_reason function will raise a KeyError if failed_checkins contains only check-ins with statuses NOT in HUMAN_FAILURE_STATUS_MAP (e.g., all filtered out). In that case status_counts would be empty, sum(status_counts.values()) would be 0, which is not == 1, so it falls through to the get_text_list path with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.

Evidence:

  • If all check-in statuses are filtered out by the if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys() guard, status_counts is an empty Counter
  • sum of empty Counter values is 0, which != 1, so it proceeds to format an empty list
  • get_text_list with an empty list returns '' which then gets interpolated into the message as ' check-ins detected'

Agent: logic

"A failed check-in was detected"
"""

status_counts = Counter(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 86%)

Typo in docstring: 'humam readible' should be 'human readable'

Evidence:

  • The docstring for get_failure_reason contains a spelling error that reduces code clarity
  • This is a public function with a docstring that should follow proper English conventions

Agent: style

from datetime import datetime
from typing import cast

from sentry.monitors.logic.incident_occurrence import create_incident_occurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Module boundaries / Dependency direction (confidence: 84%)

incidents.py directly imports from incident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency would result. Consider whether the create_incident_occurrence call should instead be injected or dispatched from mark_failed.py (the orchestrator) to keep both new modules independent of each other.

Evidence:

  • incidents.py line 7: from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
  • mark_failed.py already serves as the orchestrator; it could call both modules sequentially rather than having incidents.py call incident_occurrence.py

Agent: architecture


id: int
date_added: datetime
status: int

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Type definition placement (confidence: 85%)

The SimpleCheckIn TypedDict was correctly moved from mark_failed.py to the shared types.py module. Both incidents.py and incident_occurrence.py import it from types.py, which is the right pattern for shared type definitions. Good architectural choice.

Evidence:

  • incident_occurrence.py line 20: from sentry.monitors.types import SimpleCheckIn
  • incidents.py line 9: from sentry.monitors.types import SimpleCheckIn

Agent: architecture

@grapple-pr grapple-pr Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 [Grapple PR] 6 auto-fix suggestions available

Apply individually or use GitHub's "Add suggestion to batch""Commit suggestions" to apply all at once.

Commits from Grapple PR are prefixed with [Grapple PR] for easy identification in merge conflicts.

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (3 lines, 1 file))

get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.

Suggested change
"id": str(monitor_environment.monitor.guid),
if sum(status_counts.values()) == 0:
return _("A failed check-in was detected")

🤖 Grapple PR auto-fix • major • confidence: 100%

"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

The get_monitor_environment_context function builds a config copy and conditionally modifies schedule_type, but then returns monitor_environment.monitor.config (the original) instead of the modified config copy. This bug existed in the original mark_failed.py and was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.

Suggested change
}
"config": config,

🤖 Grapple PR auto-fix • major • confidence: 100%

MonitorEnvironment,
MonitorIncident,
)
from sentry.monitors.types import SimpleCheckIn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (6 lines, 1 file))

The SimpleCheckIn type is imported from sentry.monitors.types which is correct, but any external code that previously imported SimpleCheckIn from sentry.monitors.logic.mark_failed will break since it has been removed from that module and no re-export is provided.

Suggested change
from sentry.monitors.types import SimpleCheckIn
# Re-export SimpleCheckIn for backward compatibility.
# The canonical definition has moved to sentry.monitors.types as part of the
# incident_occurrence.py refactor. External code importing SimpleCheckIn from
# this module will continue to work without modification.
from sentry.monitors.types import SimpleCheckIn as SimpleCheckIn # noqa: F401

🤖 Grapple PR auto-fix • minor • confidence: 100%

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))

get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local config copy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.

Suggested change
"id": str(monitor_environment.monitor.guid),
"config": config,

🤖 Grapple PR auto-fix • minor • confidence: 100%

if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (7 lines, 1 file))

The get_failure_reason function will raise a KeyError if failed_checkins contains only check-ins with statuses NOT in HUMAN_FAILURE_STATUS_MAP (e.g., all filtered out). In that case status_counts would be empty, sum(status_counts.values()) would be 0, which is not == 1, so it falls through to the get_text_list path with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.

Suggested change
human_status = get_text_list(
# Guard against the edge case where all check-ins have statuses not present
# in HUMAN_FAILURE_STATUS_MAP (e.g. unexpected/future statuses). Without this,
# get_text_list([]) returns '' and we'd produce the malformed string
# " check-ins detected".
if not status_counts:
return _("A failed check-in was detected")

🤖 Grapple PR auto-fix • minor • confidence: 100%

"A failed check-in was detected"
"""

status_counts = Counter(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))

Typo in docstring: 'humam readible' should be 'human readable'

Suggested change
status_counts = Counter(
Builds a human readable string from a list of failed check-ins.

🤖 Grapple PR auto-fix • minor • confidence: 86%

monitor_env.save(update_fields=("status",))

starting_checkin = previous_checkins[0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Larger fix (12 lines, 1 file) — review recommended)

When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent previous_checkins[0] at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -37,16 +37,25 @@
         else:
             previous_checkins = cast(
                 list[SimpleCheckIn],
                 # Using .values for performance reasons
                 MonitorCheckIn.objects.filter(
                     monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
                 )
+                .exclude(status=CheckInStatus.OK)
                 .order_by("-date_added")
+                [: failure_issue_threshold]
                 .values("id", "date_added", "status"),
             )
 
             # reverse the list after slicing in order to start with oldest check-in
-            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+            previous_checkins = list(reversed(previous_checkins))
 
-            # If we have any successful check-ins within the threshold of
-            # commits we have NOT reached an incident state
-            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+            # If we don't have enough consecutive failures to meet the threshold,
+            # we have NOT reached an incident state.
+            # Also guard against an empty result (race condition / data deleted)
+            # which would cause IndexError at previous_checkins[0] below.
+            if len(previous_checkins) < failure_issue_threshold:
                 return False
 
         # change monitor status + update fingerprint timestamp

🤖 Grapple PR auto-fix • major • Review this diff before applying

@grapple-pr

grapple-pr Bot commented Apr 12, 2026

Copy link
Copy Markdown

🔴 [Grapple PR] Partial fix — logic agent (Validation failed: No valid hunks found in patch)

The function mark_failed_threshold was renamed to try_incident_threshold and create_issue_platform_occurrence was renamed to create_incident_occurrence. Any existing tests that mock or patch these functions by their original names and module paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold') or @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop patching and tests may pass or fail for the wrong reasons. The acceptance criteria states 'All existing tests for mark_failed.py continue to pass without modification', but this rename makes that unlikely if any tests reference the old function names/paths.

View suggested fix (needs manual review)
--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
 from typing import cast
 
 from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
 from sentry.monitors.types import SimpleCheckIn
 
 logger = logging.getLogger(__name__)
 
 
 def try_incident_threshold(
     failed_checkin: MonitorCheckIn,
     failure_issue_threshold: int,
     received: datetime | None,
 ) -> bool:

⚠️ Validation failed: No valid hunks found in patch

🤖 Grapple PR auto-fix • major • This fix needs manual verification

"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}
return try_incident_threshold(failed_checkin, failure_issue_threshold, received)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (20 lines, 1 file) — review recommended)

The function mark_failed_threshold has been renamed to try_incident_threshold and moved to a new module. Similarly, create_issue_platform_occurrence has been renamed to create_incident_occurrence and moved. Any existing tests or callers that mock or patch these functions at their old paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold'), @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop intercepting the real function and tests will either fail or pass incorrectly. The function renaming (not just moving) doubles the risk since even callers doing from sentry.monitors.logic.mark_failed import create_issue_platform_occurrence will get an ImportError.

--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,7 +1,7 @@
 from __future__ import annotations
 
 import logging
-from datetime import datetime
+from datetime import datetime  # noqa: F401
 
 from django.db.models import Q
 
-from sentry.monitors.logic.incidents import try_incident_threshold
+from sentry.monitors.logic.incidents import try_incident_threshold
+from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnvironment
 
 logger = logging.getLogger(__name__)
 
+# Backward-compatible re-exports: these old names are preserved so that any
+# existing test mocks patching the old paths continue to work correctly:
+#
+#   @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold')
+#   @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')
+#
+# Without these aliases, mocks targeting the old names would silently stop
+# intercepting the real function, causing tests to either fail or pass
+# incorrectly (depending on whether the mock was expected to prevent side effects
+# or assert call counts).
+#
+# When all callers and test mocks have been updated to use the new names and
+# module paths, these aliases can be safely removed.
+mark_failed_threshold = try_incident_threshold
+create_issue_platform_occurrence = create_incident_occurrence

🤖 Grapple PR auto-fix • major • Review this diff before applying

from datetime import datetime
from typing import cast

from sentry.monitors.logic.incident_occurrence import create_incident_occurrence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (33 lines, 1 file) — review recommended)

incidents.py directly imports from incident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency would result. Consider whether the create_incident_occurrence call should instead be injected or dispatched from mark_failed.py (the orchestrator) to keep both new modules independent of each other.

--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,10 +1,9 @@
 from __future__ import annotations
 
 import logging
 from datetime import datetime
 from typing import cast
 
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
 from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
 from sentry.monitors.types import SimpleCheckIn
 
@@ -14,7 +13,14 @@ logger = logging.getLogger(__name__)
 
 def try_incident_threshold(
     failed_checkin: MonitorCheckIn,
     failure_issue_threshold: int,
     received: datetime | None,
-) -> bool:
+) -> tuple[bool, list[SimpleCheckIn], MonitorIncident | None]:
+    """
+    Evaluate whether the failure threshold has been reached and return the
+    incident context needed to create occurrences.
+
+    Returns a 3-tuple of (threshold_reached, previous_checkins, incident).
+    The caller (mark_failed.py orchestrator) is responsible for calling
+    create_incident_occurrence so that incidents.py and incident_occurrence.py
+    remain independent of each other.
+    """
     from sentry.signals import monitor_environment_failed
 
     monitor_env = failed_checkin.monitor_environment
 
     if monitor_env is None:
-        return False
+        return False, [], None
 
     # check to see if we need to update the status
     if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
@@ -63,7 +69,7 @@ def try_incident_threshold(
         # get the active incident from the monitor environment
         incident = monitor_env.active_incident
     else:
         # don't send occurrence for other statuses
-        return False
+        return False, [], None
 
     # Only create an occurrence if:
     # - We have an active incident and fingerprint
     # - The monitor and env are not muted
-    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
-        checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
-        for checkin in checkins:
-            create_incident_occurrence(
-                previous_checkins,
-                checkin,
-                incident,
-                received=received,
-            )
-
     monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
 
-    return True
+    # Return context to the orchestrator so it can call create_incident_occurrence
+    # without incidents.py taking a direct dependency on incident_occurrence.py.
+    if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+        return True, previous_checkins, incident
+
+    return True, [], None

🤖 Grapple PR auto-fix • minor • Review this diff before applying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants