ref(crons): Reorganize incident creation / issue occurrence logic#3
ref(crons): Reorganize incident creation / issue occurrence logic#3Frawatson wants to merge 1 commit into
Conversation
…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527
🤖 Grapple PR ReviewSummaryThis PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py. This is a structural refactor to improve separation of concerns in the monitors subsystem. Changes: 4 files (+289 / -264 lines)
Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations) CI: ⏳ 0 passed, 0 failed Impact Diagramgraph TD
pr_0[/"🔀 PR Changes: 4 files"/]
f1["🟢 incident_occurrence.py +171/-0"]
pr_0 --> f1
f2["🟢 incidents.py +104/-0"]
pr_0 --> f2
f3["🟡 mark_failed.py +4/-264"]
pr_0 --> f3
f4["🟡 types.py +10/-0"]
pr_0 --> f4
f1 -.- f2
f2 -.- f3
style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
Review details
Powered by Grapple PR — AI that understands your code. |
|
MAJOR — Two new modules are added under |
|
MAJOR — A separate database query is issued for each check-in ID collected in |
|
MAJOR — After the refactor, |
|
MINOR — The function was renamed from |
|
MINOR —
|
|
MINOR — The function was renamed from |
|
MINOR — After the refactor, |
|
MINOR — incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is the new canonical location. This is correct. However, incidents.py defines its own import of SimpleCheckIn also from sentry.monitors.types. Both new modules correctly import from the new types.py location - this is fine. |
|
MINOR — Typo in docstring: 'humam readible' should be 'human readable' |
|
INFO — The dependency direction is |
There was a problem hiding this comment.
🤖 Grapple PR Review
Summary
This PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py, reducing mark_failed.py by ~264 lines.
Changes: 4 files (+289 / -264 lines)
- 🟢 2 files added:
src/sentry/monitors/logic/incident_occurrence.py,src/sentry/monitors/logic/incidents.py - 🟡 2 files modified:
src/sentry/monitors/logic/mark_failed.py,src/sentry/monitors/types.py
Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)
CI: ⏳ 0 passed, 0 failed
Impact Diagram
graph TD
pr_0[/"🔀 PR Changes: 4 files"/]
f1["🟢 incident_occurrence.py +171/-0"]
pr_0 --> f1
f2["🟢 incidents.py +104/-0"]
pr_0 --> f2
f3["🟡 mark_failed.py +4/-264"]
pr_0 --> f3
f4["🟡 types.py +10/-0"]
pr_0 --> f4
f1 -.- f2
f2 -.- f3
style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
2 major issues found — review recommended.
| Severity | Count |
|---|---|
| 🟠 Major | 2 |
| 🟡 Minor | 3 |
| 🔵 Info | 1 |
Review details
- Reviewed at: 2026-04-07T21:07:19.372Z
- Agents used: security, logic, style, architecture, performance, verification
- Context level: full
- Execution time: 40.2s
- Low-confidence findings suppressed: 1
- Blast radius: 0 (contained)
Powered by Grapple PR — AI that understands your code.
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } | ||
| return try_incident_threshold(failed_checkin, failure_issue_threshold, received) |
There was a problem hiding this comment.
🟠 MAJOR — Broken external imports (confidence: 100%)
The functions mark_failed_threshold, create_issue_platform_occurrence, get_failure_reason, get_monitor_environment_context, and the type SimpleCheckIn were previously importable from mark_failed.py. Any external callers importing these symbols (e.g., from sentry.monitors.logic.mark_failed import mark_failed_threshold or from sentry.monitors.logic.mark_failed import SimpleCheckIn) will now get ImportError. The PR does not re-export these symbols from mark_failed.py.
Evidence:
- The original
mark_failed.pydefinedmark_failed_threshold,SimpleCheckIn,create_issue_platform_occurrence,get_failure_reason,get_monitor_environment_contextas module-level names. - After the refactor, none of these names are re-exported or aliased in
mark_failed.py. - The acceptance criteria state: 'All public functions/methods previously accessible via mark_failed.py remain importable or are re-exported correctly'.
- Test files or other modules in the codebase that import these symbols directly from mark_failed.py will break.
Agent: logic
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def try_incident_threshold( |
There was a problem hiding this comment.
🟠 MAJOR — Function rename changes public API (confidence: 94%)
The original function was named mark_failed_threshold and has been renamed to try_incident_threshold. The original occurrence creation function was named create_issue_platform_occurrence and has been renamed to create_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.
Evidence:
- Original mark_failed.py had
def mark_failed_threshold(...)— nowdef try_incident_threshold(...)in incidents.py - Original mark_failed.py had
def create_issue_platform_occurrence(...)— nowdef create_incident_occurrence(...)in incident_occurrence.py - Acceptance criteria: 'The net behavior of mark_failed.py is identical before and after the refactor'
Agent: logic
|
|
||
|
|
||
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | ||
| config = monitor_environment.monitor.config.copy() |
There was a problem hiding this comment.
🟡 MINOR — Code Organization (confidence: 100%)
Inconsistent handling of monitor config in get_monitor_environment_context: the function modifies a shallow copy of config but returns the original unmodified config in the dict
Evidence:
- Line 156-157:
config = monitor_environment.monitor.config.copy()followed by conditional mutation ofconfig['schedule_type'] - Line 162:
'config': monitor_environment.monitor.config,returns the original, not the modifiedconfigvariable - This means the schedule_type transformation is lost in the returned context
Agent: style
| MonitorEnvironment, | ||
| MonitorIncident, | ||
| ) | ||
| from sentry.monitors.types import SimpleCheckIn |
There was a problem hiding this comment.
🟡 MINOR — Unused import in types.py consumer (confidence: 88%)
incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.
Evidence:
- incident_occurrence.py line 20:
from sentry.monitors.types import SimpleCheckIn - incidents.py line 9:
from sentry.monitors.types import SimpleCheckIn - Both modules use SimpleCheckIn in type annotations and dict construction
Agent: logic
| "3 missed check-ins detected" | ||
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | ||
| "A failed check-in was detected" | ||
| """ |
There was a problem hiding this comment.
🟡 MINOR — Documentation (confidence: 76%)
Typo in docstring: 'humam readible' should be 'human readable'
Evidence:
- The function
get_failure_reasonhas a docstring with a spelling error: 'Builds a humam readible string...' - This is a public function that may be referenced in documentation or by other developers
Agent: style
| MonitorIncident, | ||
| MonitorStatus, | ||
| ) | ||
| from sentry.monitors.logic.incidents import try_incident_threshold |
There was a problem hiding this comment.
🔵 INFO — Dead import after refactor (confidence: 75%)
MonitorEnvironment is still imported in mark_failed.py. Verify it is still used in the remaining mark_failed() function body (lines not shown in the diff but presumably still reference it). If not used, it's a dead import.
Evidence:
- Line 8:
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnvironment - The remaining mark_failed function likely still uses MonitorEnvironment for the query at the top of the function
Agent: logic
|
|
||
|
|
||
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | ||
| config = monitor_environment.monitor.config.copy() |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))
Inconsistent handling of monitor config in
get_monitor_environment_context: the function modifies a shallow copy ofconfigbut returns the original unmodifiedconfigin the dict
| config = monitor_environment.monitor.config.copy() | |
| "config": config, |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| MonitorEnvironment, | ||
| MonitorIncident, | ||
| ) | ||
| from sentry.monitors.types import SimpleCheckIn |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))
incident_occurrence.py imports SimpleCheckIn from sentry.monitors.types, which is correct. However, incidents.py imports SimpleCheckIn from sentry.monitors.types as well (line 9). Both new modules correctly use the new type from types.py. This is consistent — noting for completeness that the type is not orphaned.
| from sentry.monitors.types import SimpleCheckIn | |
| from sentry.monitors.types import SimpleCheckIn |
🤖 Grapple PR auto-fix • minor • confidence: 88%
| "3 missed check-ins detected" | ||
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | ||
| "A failed check-in was detected" | ||
| """ |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))
Typo in docstring: 'humam readible' should be 'human readable'
| """ | |
| Builds a human readable string from a list of failed check-ins. |
🤖 Grapple PR auto-fix • minor • confidence: 76%
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } | ||
| return try_incident_threshold(failed_checkin, failure_issue_threshold, received) |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (33 lines, 1 file) — review recommended)
The functions
mark_failed_threshold,create_issue_platform_occurrence,get_failure_reason,get_monitor_environment_context, and the typeSimpleCheckInwere previously importable frommark_failed.py. Any external callers importing these symbols (e.g.,from sentry.monitors.logic.mark_failed import mark_failed_thresholdorfrom sentry.monitors.logic.mark_failed import SimpleCheckIn) will now get ImportError. The PR does not re-export these symbols frommark_failed.py.
--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,6 +1,7 @@
from __future__ import annotations
import logging
+from datetime import datetime # noqa: F401 - re-exported for backward compatibility
from datetime import datetime
from django.db.models import Q
@@ -10,6 +11,28 @@ from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnviron
logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Backward-compatibility re-exports
+#
+# These symbols were previously defined in this module. They have been moved
+# to dedicated sub-modules as part of a refactor. We re-export them here so
+# that any existing import sites of the form:
+#
+# from sentry.monitors.logic.mark_failed import mark_failed_threshold
+# from sentry.monitors.logic.mark_failed import SimpleCheckIn
+# from sentry.monitors.logic.mark_failed import create_issue_platform_occurrence
+# from sentry.monitors.logic.mark_failed import get_failure_reason
+# from sentry.monitors.logic.mark_failed import get_monitor_environment_context
+#
+# continue to work without modification.
+# ---------------------------------------------------------------------------
+from sentry.monitors.logic.incident_occurrence import ( # noqa: F401, E402
+ SimpleCheckIn,
+ create_issue_platform_occurrence,
+ get_failure_reason,
+ get_monitor_environment_context,
+)
+from sentry.monitors.logic.incidents import try_incident_threshold as mark_failed_threshold # noqa: F401, E402
+
+__all__ = [
+ "mark_failed",
+ "mark_failed_threshold",
+ "SimpleCheckIn",
+ "create_issue_platform_occurrence",
+ "get_failure_reason",
+ "get_monitor_environment_context",
+]
+
def mark_failed(🤖 Grapple PR auto-fix • major • Review this diff before applying
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def try_incident_threshold( |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (22 lines, 1 file) — review recommended)
The original function was named
mark_failed_thresholdand has been renamed totry_incident_threshold. The original occurrence creation function was namedcreate_issue_platform_occurrenceand has been renamed tocreate_incident_occurrence. These are not just internal refactors — any code referencing the old names (tests, signal handlers, external modules) will break.
--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,6 +1,8 @@
from __future__ import annotations
import logging
+import warnings
from datetime import datetime
from typing import cast
@@ -101,3 +103,23 @@ def try_incident_threshold(
monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
return True
+
+
+def mark_failed_threshold(
+ failed_checkin: MonitorCheckIn,
+ failure_issue_threshold: int,
+ received: datetime | None,
+) -> bool:
+ """
+ Deprecated: Use `try_incident_threshold` instead.
+
+ Backward-compatible alias preserved because external callers (tests,
+ signal handlers, other modules) may still reference the original name
+ from mark_failed.py. This alias ensures the refactor does not break
+ the public API contract while callers are migrated.
+ """
+ warnings.warn(
+ "mark_failed_threshold is deprecated; use try_incident_threshold instead.",
+ DeprecationWarning,
+ stacklevel=2,
+ )
+ return try_incident_threshold(failed_checkin, failure_issue_threshold, received)🤖 Grapple PR auto-fix • major • Review this diff before applying
There was a problem hiding this comment.
🤖 Grapple PR Review
Summary
This PR reorganizes cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py, reducing mark_failed.py by 264 lines while adding 275 lines across the new files.
Changes: 4 files (+289 / -264 lines)
- 🟢 2 files added:
src/sentry/monitors/logic/incident_occurrence.py,src/sentry/monitors/logic/incidents.py - 🟡 2 files modified:
src/sentry/monitors/logic/mark_failed.py,src/sentry/monitors/types.py
Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)
CI: ⏳ 0 passed, 0 failed
Impact Diagram
graph TD
pr_0[/"🔀 PR Changes: 4 files"/]
f1["🟢 incident_occurrence.py +171/-0"]
pr_0 --> f1
f2["🟢 incidents.py +104/-0"]
pr_0 --> f2
f3["🟡 mark_failed.py +4/-264"]
pr_0 --> f3
f4["🟡 types.py +10/-0"]
pr_0 --> f4
f1 -.- f2
f2 -.- f3
style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
1 critical issue found — please address before merging.
| Severity | Count |
|---|---|
| 🔴 Critical | 1 |
| 🟠 Major | 5 |
| 🟡 Minor | 6 |
| 🔵 Info | 3 |
Review details
- Reviewed at: 2026-04-08T00:47:12.319Z
- Agents used: security, logic, style, architecture, performance, verification
- Context level: full
- Execution time: 104.7s
- Low-confidence findings suppressed: 4
- Blast radius: 0 (contained)
Powered by Grapple PR — AI that understands your code.
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
🔴 CRITICAL — Data flow / Bug (confidence: 100%)
get_monitor_environment_context builds a local config variable with the schedule_type overwritten, but then the returned dictionary uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.
Evidence:
- Line 158:
config = monitor_environment.monitor.config.copy() - Line 159-160:
if 'schedule_type' in config: config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display() - Line 165:
'config': monitor_environment.monitor.config,— uses the original config, not the localconfigvariable
Agent: logic
| failed_checkin: MonitorCheckIn, | ||
| failure_issue_threshold: int, | ||
| received: datetime | None, | ||
| ) -> bool: |
There was a problem hiding this comment.
🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 100%)
The extracted function was renamed from mark_failed_threshold to try_incident_threshold, which changes the internal API surface. While mark_failed.py is updated to call the new name, any other callers in the codebase that directly imported or called mark_failed_threshold will break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search for mark_failed_threshold references (tests, tasks, other modules) is needed to confirm no breakage.
Evidence:
- Original function in mark_failed.py was named
mark_failed_threshold - New function in incidents.py is named
try_incident_threshold - No re-export or alias is provided in mark_failed.py for backward compatibility
- Tests that directly import
mark_failed_thresholdfrommark_failedwill fail
Agent: architecture
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
🟠 MAJOR — Logic Error / Data Integrity (confidence: 100%)
The config copy with humanized schedule_type is computed but never used. The returned dict references monitor_environment.monitor.config (the original dict) instead of the local config variable, so the schedule_type transformation is silently discarded.
Evidence:
- Line 155:
config = monitor_environment.monitor.config.copy()— copy is made and potentially modified - Line 157:
config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display()— written to localconfig - Line 160:
'config': monitor_environment.monitor.config— returns the ORIGINAL unmodified dict, notconfig - This bug exists verbatim in the original
mark_failed.pyand was faithfully copied, but it means the monitor context sent with every occurrence always has the raw integerschedule_typeinstead of the human-readable string
Agent: security
| status_counts = Counter( | ||
| checkin["status"] | ||
| for checkin in failed_checkins | ||
| if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() |
There was a problem hiding this comment.
🟠 MAJOR — Edge case / KeyError risk (confidence: 100%)
In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.
Evidence:
- Lines 136-139:
status_counts = Counter(checkin['status'] for checkin in failed_checkins if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys()) - If no check-ins match,
status_countsis empty andsum(status_counts.values()) == 0 - Falls through to
get_text_list([], last_word=_('and'))which returns'' - Returns
_('%(problem_checkins)s check-ins detected') % {'problem_checkins': ''}→' check-ins detected' - This is then surfaced as the 'Failure reason' evidence in the issue occurrence shown to users
- HUMAN_FAILURE_STATUS_MAP has keys: ERROR, MISSED, TIMEOUT
- SINGULAR_HUMAN_FAILURE_MAP has keys: ERROR, MISSED, TIMEOUT
- If these maps ever diverge, line 142 would raise KeyError
- Also: if all check-ins are filtered out (status_counts is empty), sum == 0, so the code goes to get_text_list with an empty list, producing a malformed string like ' check-ins detected'
Agent: logic
| incident: MonitorIncident, | ||
| received: datetime | None, | ||
| ) -> None: | ||
| from sentry.issues.issue_occurrence import IssueEvidence, IssueOccurrence |
There was a problem hiding this comment.
🟠 MAJOR — API Contract / Naming Inconsistency (confidence: 99%)
The occurrence creation function was renamed from create_issue_platform_occurrence to create_incident_occurrence. Any code outside mark_failed.py that imports or references create_issue_platform_occurrence by name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.
Evidence:
- Original function was
create_issue_platform_occurrencein mark_failed.py - New function is
create_incident_occurrencein incident_occurrence.py - Tests commonly mock internal functions like
create_issue_platform_occurrenceby path; those mocks will silently pass (no-op) or fail
Agent: architecture
| received: datetime | None, | ||
| ) -> bool: | ||
| from sentry.signals import monitor_environment_failed | ||
|
|
There was a problem hiding this comment.
🟡 MINOR — Naming conventions (confidence: 95%)
Function name try_incident_threshold uses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider if check_incident_threshold or handle_incident_threshold would be more precise.
Evidence:
- The function both checks thresholds AND creates incidents/occurrences with side effects (database writes, Kafka calls)
- The 'try_' prefix typically suggests non-throwing error handling, but this function performs substantial work
- This naming pattern could be confusing to future maintainers about the function's scope
Agent: style
|
|
||
| id: int | ||
| date_added: datetime | ||
| status: int |
There was a problem hiding this comment.
🟡 MINOR — Type Definition Location (confidence: 81%)
SimpleCheckIn is correctly defined in types.py but is also still used by incidents.py which imports it from types.py. Verify that no residual definition of SimpleCheckIn exists in the old mark_failed.py. The diff confirms it was removed — this is good. However, the docstring says 'A stripped down check in object' which could be more descriptive about its purpose (it represents check-ins used for incident threshold evaluation).
Evidence:
- SimpleCheckIn moved from mark_failed.py to types.py
- Imported correctly in both incidents.py (line 9) and incident_occurrence.py (line 20)
Agent: architecture
|
|
||
| monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env)) | ||
|
|
||
| return True |
There was a problem hiding this comment.
🔵 INFO — Refactor Completeness (confidence: 100%)
The function was renamed from mark_failed_threshold to try_incident_threshold during extraction. While this is a reasonable rename for clarity, any external callers or tests that directly imported mark_failed_threshold from mark_failed.py would break. The intent says 'The public API surface consumed by callers of mark_failed.py should remain unchanged or be backward compatible.' If mark_failed_threshold was only called internally within mark_failed, this is fine. But if any test or module imported it directly, a backward-compatible alias should be provided.
Evidence:
- Original:
def mark_failed_threshold(...)in mark_failed.py - New:
def try_incident_threshold(...)in incidents.py - mark_failed.py now calls
try_incident_thresholdinstead of the old internal function - The top-level
mark_failed()function signature is preserved, which is the primary public API
Agent: logic
| from typing import cast | ||
|
|
||
| from sentry.monitors.logic.incident_occurrence import create_incident_occurrence | ||
| from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus |
There was a problem hiding this comment.
🔵 INFO — Import Hygiene (confidence: 85%)
The MonitorStatus import is used in incidents.py but not explicitly imported in the file's imports section — actually it is imported on line 8. However, MonitorIncident is imported but only used as a type annotation for incident variable and in MonitorIncident.objects.get_or_create. This is fine, but note that CheckInStatus and MonitorCheckIn are also imported and used. All imports appear valid.
Evidence:
- Line 8:
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus - All four are used within the function body
Agent: logic
| "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" | ||
| "A failed check-in was detected" | ||
| """ | ||
|
|
There was a problem hiding this comment.
🔵 INFO — Documentation (confidence: 82%)
Typo in docstring: 'humam readible' should be 'human readable'
Evidence:
- Line 130 contains: 'Builds a humam readible string from a list of failed check-ins.'
- This is a documentation typo that should be corrected for clarity
Agent: style
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))
get_monitor_environment_context builds a local
configvariable with the schedule_type overwritten, but then the returned dictionary usesmonitor_environment.monitor.config(the original, unmodified config) instead of the localconfigvariable. This means theschedule_typetransformation is dead code and the context will always contain the raw schedule_type value instead of the display value. This bug existed in the original code and was faithfully copied, but it is still a logic error.
| "id": str(monitor_environment.monitor.guid), | |
| "config": config, |
🤖 Grapple PR auto-fix • critical • confidence: 100%
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))
The
configcopy with humanizedschedule_typeis computed but never used. The returned dict referencesmonitor_environment.monitor.config(the original dict) instead of the localconfigvariable, so theschedule_typetransformation is silently discarded.
| "id": str(monitor_environment.monitor.guid), | |
| "config": config, |
🤖 Grapple PR auto-fix • major • confidence: 100%
| from datetime import datetime, timezone | ||
| from typing import TYPE_CHECKING, TypedDict, cast | ||
| from datetime import datetime | ||
|
|
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))
The
MonitorEnvironmentimport is retained inmark_failed.pybut after the refactor it is no longer used directly in this file. Themark_failedfunction receivesMonitorCheckInand accessesmonitor_environmentthrough it;MonitorEnvironmentis not referenced as a type annotation or used anywhere in the remaining code ofmark_failed.py.
| from sentry.monitors.models import CheckInStatus, MonitorCheckIn |
🤖 Grapple PR auto-fix • major • confidence: 89%
| return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status} | ||
|
|
||
|
|
||
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))
In
get_monitor_environment_context, the local variableconfigis computed and modified (line 152-153:config = monitor_environment.monitor.config.copy()and theschedule_typeoverride), but the returned dict usesmonitor_environment.monitor.config(the original, unmodified config) instead of the localconfigvariable. This is a pre-existing bug that was faithfully moved frommark_failed.py, but it means theschedule_typedisplay override is silently discarded.
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | |
| "config": config, |
🤖 Grapple PR auto-fix • minor • confidence: 100%
|
|
||
|
|
||
| def mark_failed( | ||
| failed_checkin: MonitorCheckIn, |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — security agent (Small fix (5 lines, 1 file))
SimpleCheckInTypedDict was previously exported fromsentry.monitors.logic.mark_failedand is now only defined insentry.monitors.types. Any code outside this PR that importedSimpleCheckInfrommark_failedwill get anImportErrorat runtime. No re-export shim is provided.
| failed_checkin: MonitorCheckIn, | |
| # Re-exported for backward compatibility: SimpleCheckIn was previously defined | |
| # in this module. It has been moved to sentry.monitors.types as part of the | |
| # incident logic refactor. External callers importing from this path will | |
| # continue to work without modification. | |
| from sentry.monitors.types import SimpleCheckIn as SimpleCheckIn # noqa: F401 |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| .order_by("-date_added") | ||
| .values("id", "date_added", "status"), | ||
| ) | ||
|
|
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — security agent (Small fix (4 lines, 1 file))
When
failure_issue_threshold > 1, the DB query fetches check-ins filtered only bydate_added__lte=failed_checkin.date_addedwith no upper bound on count before the query executes. The slice[:failure_issue_threshold]is applied in Python after the full queryset is evaluated vialist(reversed(...)), meaning Django will fetch ALL matching check-ins into memory before slicing. This is a performance regression risk for monitors with long histories, and diverges from the intent of using.values()'for performance reasons'.
| .values("id", "date_added", "status")[:failure_issue_threshold], | |
| previous_checkins = list(reversed(previous_checkins)) |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| received: datetime | None, | ||
| ) -> bool: | ||
| from sentry.signals import monitor_environment_failed | ||
|
|
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))
Function name
try_incident_thresholduses a 'try_' prefix which typically suggests an operation that may fail gracefully and returns a success/failure boolean. However, the name doesn't clearly indicate it may create incidents. Consider ifcheck_incident_thresholdorhandle_incident_thresholdwould be more precise.
| def handle_incident_threshold( |
🤖 Grapple PR auto-fix • minor • confidence: 95%
|
|
||
| id: int | ||
| date_added: datetime | ||
| status: int |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — architecture agent (Small fix (4 lines, 1 file))
SimpleCheckIn is correctly defined in types.py but is also still used by incidents.py which imports it from types.py. Verify that no residual definition of SimpleCheckIn exists in the old mark_failed.py. The diff confirms it was removed — this is good. However, the docstring says 'A stripped down check in object' which could be more descriptive about its purpose (it represents check-ins used for incident threshold evaluation).
| status: int | |
| A minimal check-in representation used for incident threshold evaluation. | |
| Contains only the fields required to assess monitor failure and recovery | |
| conditions in incident and occurrence processing logic. |
🤖 Grapple PR auto-fix • minor • confidence: 81%
| failed_checkin: MonitorCheckIn, | ||
| failure_issue_threshold: int, | ||
| received: datetime | None, | ||
| ) -> bool: |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — architecture agent (Small fix (7 lines, 1 file))
The extracted function was renamed from
mark_failed_thresholdtotry_incident_threshold, which changes the internal API surface. Whilemark_failed.pyis updated to call the new name, any other callers in the codebase that directly imported or calledmark_failed_thresholdwill break. The blast radius analysis only covers the 4 changed files, but a codebase-wide search formark_failed_thresholdreferences (tests, tasks, other modules) is needed to confirm no breakage.
--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,6 +1,8 @@
from __future__ import annotations
import logging
from datetime import datetime
+from sentry.monitors.logic.incidents import try_incident_threshold
+
logger = logging.getLogger(__name__)
@@ -10,6 +12,10 @@ logger = logging.getLogger(__name__)
+# Backward-compatibility alias: `mark_failed_threshold` was extracted and
+# renamed to `try_incident_threshold` in incidents.py. Any existing callers
+# (tests, tasks, other modules) that imported `mark_failed_threshold` from
+# this module will continue to work without modification.
+mark_failed_threshold = try_incident_threshold🤖 Grapple PR auto-fix • major • Review this diff before applying
| status_counts = Counter( | ||
| checkin["status"] | ||
| for checkin in failed_checkins | ||
| if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (16 lines, 1 file) — review recommended)
In get_failure_reason, when sum(status_counts.values()) == 1, the code accesses SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]. If a failed check-in has a status that is NOT in HUMAN_FAILURE_STATUS_MAP (filtered out on line 138), status_counts could be empty (sum == 0), and the code falls through to the get_text_list path which would produce an empty string. However, if a check-in has a status that IS in HUMAN_FAILURE_STATUS_MAP but NOT in SINGULAR_HUMAN_FAILURE_MAP, this would raise a KeyError. Currently the maps have identical keys so this is safe, but there's no defensive check.
--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -139,10 +139,18 @@ def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
)
+ total = sum(status_counts.values())
+
+ # No check-ins with recognized failure statuses; return a generic fallback
+ # rather than producing a malformed string like ' check-ins detected'.
+ if total == 0:
+ return _("A failed check-in was detected")
+
- if sum(status_counts.values()) == 1:
- return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+ if total == 1:
+ status_key = list(status_counts.keys())[0]
+ # Defensive .get() in case SINGULAR_HUMAN_FAILURE_MAP and
+ # HUMAN_FAILURE_STATUS_MAP ever diverge (they currently share keys).
+ return SINGULAR_HUMAN_FAILURE_MAP.get(
+ status_key, _("A failed check-in was detected")
+ )
human_status = get_text_list(
[🤖 Grapple PR auto-fix • major • Review this diff before applying
| incident: MonitorIncident, | ||
| received: datetime | None, | ||
| ) -> None: | ||
| from sentry.issues.issue_occurrence import IssueEvidence, IssueOccurrence |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (11 lines, 1 file) — review recommended)
The occurrence creation function was renamed from
create_issue_platform_occurrencetocreate_incident_occurrence. Any code outsidemark_failed.pythat imports or referencescreate_issue_platform_occurrenceby name (e.g., tests that mock it, or other callers) will break without a backward-compatible alias.
--- a/src/sentry/monitors/logic/incident_occurrence.py
+++ b/src/sentry/monitors/logic/incident_occurrence.py
@@ -168,3 +168,13 @@
"type": monitor_environment.monitor.get_type_display(),
}
+
+
+# Backward-compatible alias for the previous name of this function when it lived
+# in mark_failed.py. Tests that mock or import `create_issue_platform_occurrence`
+# by path (e.g. via `mock.patch("...incident_occurrence.create_issue_platform_occurrence")`)
+# will continue to work without modification.
+#
+# Deprecated: use `create_incident_occurrence` directly.
+# TODO: Remove this alias once all references to `create_issue_platform_occurrence`
+# have been updated across tests and callers.
+create_issue_platform_occurrence = create_incident_occurrence🤖 Grapple PR auto-fix • major • Review this diff before applying
| from datetime import datetime | ||
| from typing import cast | ||
|
|
||
| from sentry.monitors.logic.incident_occurrence import create_incident_occurrence |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (12 lines, 1 file) — review recommended)
incidents.py imports directly from incident_occurrence.py, creating a fixed coupling direction. This is acceptable for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency will result. Consider whether the call to
create_incident_occurrenceshould instead be injected or orchestrated from a higher-level caller (likemark_failed.py) to keep the two new modules independent.
--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,11 +1,11 @@
from __future__ import annotations
import logging
from datetime import datetime
-from typing import cast
+from typing import Callable, cast
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
from sentry.monitors.types import SimpleCheckIn
logger = logging.getLogger(__name__)
def try_incident_threshold(
failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
+ # Injected to avoid a coupling dependency between incidents.py and
+ # incident_occurrence.py. If incident_occurrence.py ever needs to reference
+ # incident state, a circular import would result. The orchestrating caller
+ # (mark_failed.py) is responsible for wiring this dependency.
+ create_occurrence: Callable = None,
) -> bool:
@@ -88,10 +88,13 @@
# Only create an occurrence if:
# - We have an active incident and fingerprint
# - The monitor and env are not muted
- if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+ if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident and create_occurrence is not None:
checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
for checkin in checkins:
- create_incident_occurrence(
+ create_occurrence(
previous_checkins,
checkin,
incident,
received=received,
)🤖 Grapple PR auto-fix • minor • Review this diff before applying
|
@grapple-pr review |
There was a problem hiding this comment.
🤖 Grapple PR Review
Summary
This PR reorganizes the cron monitor incident creation and issue occurrence logic by extracting it from mark_failed.py into two new dedicated modules: incident_occurrence.py and incidents.py. This is a refactoring change that improves code organization and separation of concerns.
Changes: 4 files (+289 / -264 lines)
- 🟢 2 files added:
src/sentry/monitors/logic/incident_occurrence.py,src/sentry/monitors/logic/incidents.py - 🟡 2 files modified:
src/sentry/monitors/logic/mark_failed.py,src/sentry/monitors/types.py
Resolves: #8 (feat(upsampling) - Support upsampled error count with performance optimizations)
CI: ⏳ 0 passed, 0 failed
Impact Diagram
graph TD
pr_0[/"🔀 PR Changes: 4 files"/]
f1["🟢 incident_occurrence.py +171/-0"]
pr_0 --> f1
f2["🟢 incidents.py +104/-0"]
pr_0 --> f2
f3["🟡 mark_failed.py +4/-264"]
pr_0 --> f3
f4["🟡 types.py +10/-0"]
pr_0 --> f4
f1 -.- f2
f2 -.- f3
style f1 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f2 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f3 fill:#7c3aed,color:#fff,stroke:#7c3aed
style f4 fill:#7c3aed,color:#fff,stroke:#7c3aed
5 major issues found — review recommended.
| Severity | Count |
|---|---|
| 🟠 Major | 5 |
| 🟡 Minor | 5 |
| 🔵 Info | 1 |
Review details
- Reviewed at: 2026-04-12T03:17:25.375Z
- Agents used: security, logic, style, architecture, performance, verification
- Context level: full
- Execution time: 84.2s
- Low-confidence findings suppressed: 3
- Blast radius: 0 (contained)
Powered by Grapple PR — AI that understands your code.
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
🟠 MAJOR — Logic Error / Potential KeyError or Misleading Output (confidence: 100%)
get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.
Evidence:
- Line 140-144: Counter filters only statuses in HUMAN_FAILURE_STATUS_MAP (ERROR, MISSED, TIMEOUT)
- Line 146:
if sum(status_counts.values()) == 1— only handles the singular case - No else/fallback branch for sum == 0 (empty status_counts)
- Docstring at line 133 mentions 'A failed check-in was detected' but no code path produces that string — SINGULAR_HUMAN_FAILURE_MAP only covers ERROR, MISSED, TIMEOUT
- If a check-in with status IN_PROGRESS (or any unrecognized status) is in failed_checkins, get_text_list receives an empty list
Agent: security
| monitor_env.save(update_fields=("status",)) | ||
|
|
||
| starting_checkin = previous_checkins[0] | ||
|
|
There was a problem hiding this comment.
🟠 MAJOR — Logic Error / Potential IndexError (confidence: 100%)
When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent previous_checkins[0] at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.
Evidence:
- Lines 42-50: DB query fetches check-ins with
date_added__lte=failed_checkin.date_added, no guarantee of non-empty result - Line 52:
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))— empty if DB returns nothing - Line 60:
starting_checkin = previous_checkins[0]— IndexError if list is empty - No guard between lines 56 and 60 to handle empty previous_checkins
- This was present in original code but is a real bug that should be fixed during this refactor
Agent: security
| "config": monitor_environment.monitor.config, | ||
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } |
There was a problem hiding this comment.
🟠 MAJOR — Logic Error - Config Context Bug Preserved (confidence: 100%)
The get_monitor_environment_context function builds a config copy and conditionally modifies schedule_type, but then returns monitor_environment.monitor.config (the original) instead of the modified config copy. This bug existed in the original mark_failed.py and was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.
Evidence:
- Line 163:
config = monitor_environment.monitor.config.copy() - Line 164-165:
if 'schedule_type' in config: config['schedule_type'] = ...modifies the copy - Line 170:
'config': monitor_environment.monitor.configreturns the ORIGINAL, not the modified copy - This means the
schedule_typedisplay transformation is silently discarded
Agent: logic
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def try_incident_threshold( |
There was a problem hiding this comment.
🟠 MAJOR — Import path changes breaking test mocks (confidence: 94%)
The function mark_failed_threshold was renamed to try_incident_threshold and create_issue_platform_occurrence was renamed to create_incident_occurrence. Any existing tests that mock or patch these functions by their original names and module paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold') or @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop patching and tests may pass or fail for the wrong reasons. The acceptance criteria states 'All existing tests for mark_failed.py continue to pass without modification', but this rename makes that unlikely if any tests reference the old function names/paths.
Evidence:
- Original function in mark_failed.py was named
mark_failed_threshold, renamed totry_incident_thresholdin incidents.py - Original function
create_issue_platform_occurrencerenamed tocreate_incident_occurrencein incident_occurrence.py - The intent specification says 'No logic is lost or altered during the extraction' and 'All existing tests for mark_failed.py continue to pass without modification'
- Test files are not included in this PR, so we cannot confirm whether mocks referencing old paths exist
Agent: logic
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } | ||
| return try_incident_threshold(failed_checkin, failure_issue_threshold, received) |
There was a problem hiding this comment.
🟠 MAJOR — API contract / Test mock breakage (confidence: 94%)
The function mark_failed_threshold has been renamed to try_incident_threshold and moved to a new module. Similarly, create_issue_platform_occurrence has been renamed to create_incident_occurrence and moved. Any existing tests or callers that mock or patch these functions at their old paths (e.g., @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold'), @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop intercepting the real function and tests will either fail or pass incorrectly. The function renaming (not just moving) doubles the risk since even callers doing from sentry.monitors.logic.mark_failed import create_issue_platform_occurrence will get an ImportError.
Evidence:
- Original:
mark_failed_thresholdin mark_failed.py → nowtry_incident_thresholdin incidents.py - Original:
create_issue_platform_occurrencein mark_failed.py → nowcreate_incident_occurrencein incident_occurrence.py - The intent specification states 'All existing tests for mark_failed.py continue to pass without modification', but function renaming means any mock targeting old names/paths will break
- No re-exports of old names are provided in mark_failed.py for backward compatibility
Agent: architecture
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
🟡 MINOR — Logic Error / Stale Data (confidence: 100%)
get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local config copy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.
Evidence:
- Line 155:
config = monitor_environment.monitor.config.copy()— a copy is made and potentially mutated - Line 157:
config['schedule_type'] = monitor_environment.monitor.get_schedule_type_display()— human-readable value stored in copy - Line 163:
'config': monitor_environment.monitor.config— returns the original raw dict, not the transformed copy - The caller in create_incident_occurrence at line 86 embeds this context in event_data sent to Kafka, so consumers see the raw schedule_type integer instead of the display string
Agent: security
| if sum(status_counts.values()) == 1: | ||
| return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] | ||
|
|
||
| human_status = get_text_list( |
There was a problem hiding this comment.
🟡 MINOR — Edge Case - Empty failed_checkins (confidence: 100%)
The get_failure_reason function will raise a KeyError if failed_checkins contains only check-ins with statuses NOT in HUMAN_FAILURE_STATUS_MAP (e.g., all filtered out). In that case status_counts would be empty, sum(status_counts.values()) would be 0, which is not == 1, so it falls through to the get_text_list path with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.
Evidence:
- If all check-in statuses are filtered out by the
if checkin['status'] in HUMAN_FAILURE_STATUS_MAP.keys()guard,status_countsis an empty Counter - sum of empty Counter values is 0, which != 1, so it proceeds to format an empty list
- get_text_list with an empty list returns '' which then gets interpolated into the message as ' check-ins detected'
Agent: logic
| "A failed check-in was detected" | ||
| """ | ||
|
|
||
| status_counts = Counter( |
There was a problem hiding this comment.
🟡 MINOR — Documentation (confidence: 86%)
Typo in docstring: 'humam readible' should be 'human readable'
Evidence:
- The docstring for
get_failure_reasoncontains a spelling error that reduces code clarity - This is a public function with a docstring that should follow proper English conventions
Agent: style
| from datetime import datetime | ||
| from typing import cast | ||
|
|
||
| from sentry.monitors.logic.incident_occurrence import create_incident_occurrence |
There was a problem hiding this comment.
🟡 MINOR — Module boundaries / Dependency direction (confidence: 84%)
incidents.py directly imports from incident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but if incident_occurrence.py ever needs to reference incident logic, a circular dependency would result. Consider whether the create_incident_occurrence call should instead be injected or dispatched from mark_failed.py (the orchestrator) to keep both new modules independent of each other.
Evidence:
- incidents.py line 7:
from sentry.monitors.logic.incident_occurrence import create_incident_occurrence - mark_failed.py already serves as the orchestrator; it could call both modules sequentially rather than having incidents.py call incident_occurrence.py
Agent: architecture
|
|
||
| id: int | ||
| date_added: datetime | ||
| status: int |
There was a problem hiding this comment.
🔵 INFO — Type definition placement (confidence: 85%)
The SimpleCheckIn TypedDict was correctly moved from mark_failed.py to the shared types.py module. Both incidents.py and incident_occurrence.py import it from types.py, which is the right pattern for shared type definitions. Good architectural choice.
Evidence:
- incident_occurrence.py line 20:
from sentry.monitors.types import SimpleCheckIn - incidents.py line 9:
from sentry.monitors.types import SimpleCheckIn
Agent: architecture
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — security agent (Small fix (3 lines, 1 file))
get_failure_reason has two edge cases: (1) If all check-ins in failed_checkins have statuses not in HUMAN_FAILURE_STATUS_MAP, status_counts is empty, sum is 0, and the function falls through to produce a malformed/empty string like '0 check-ins detected'. (2) The docstring mentions 'A failed check-in was detected' as a possible output but this string does not exist in the code — it appears the original had a fallback that was not ported.
| "id": str(monitor_environment.monitor.guid), | |
| if sum(status_counts.values()) == 0: | |
| return _("A failed check-in was detected") | |
🤖 Grapple PR auto-fix • major • confidence: 100%
| "config": monitor_environment.monitor.config, | ||
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))
The
get_monitor_environment_contextfunction builds aconfigcopy and conditionally modifiesschedule_type, but then returnsmonitor_environment.monitor.config(the original) instead of the modifiedconfigcopy. This bug existed in the originalmark_failed.pyand was faithfully copied, but it's worth noting as a pre-existing logic error where the config copy is created but never used in the return value.
| } | |
| "config": config, |
🤖 Grapple PR auto-fix • major • confidence: 100%
| MonitorEnvironment, | ||
| MonitorIncident, | ||
| ) | ||
| from sentry.monitors.types import SimpleCheckIn |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (6 lines, 1 file))
The
SimpleCheckIntype is imported fromsentry.monitors.typeswhich is correct, but any external code that previously importedSimpleCheckInfromsentry.monitors.logic.mark_failedwill break since it has been removed from that module and no re-export is provided.
| from sentry.monitors.types import SimpleCheckIn | |
| # Re-export SimpleCheckIn for backward compatibility. | |
| # The canonical definition has moved to sentry.monitors.types as part of the | |
| # incident_occurrence.py refactor. External code importing SimpleCheckIn from | |
| # this module will continue to work without modification. | |
| from sentry.monitors.types import SimpleCheckIn as SimpleCheckIn # noqa: F401 | |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — security agent (Small fix (2 lines, 1 file))
get_monitor_environment_context returns monitor_environment.monitor.config (the original dict) instead of the local
configcopy that has schedule_type transformed to a display string. The copy is built and modified but then discarded.
| "id": str(monitor_environment.monitor.guid), | |
| "config": config, |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| if sum(status_counts.values()) == 1: | ||
| return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] | ||
|
|
||
| human_status = get_text_list( |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — logic agent (Small fix (7 lines, 1 file))
The
get_failure_reasonfunction will raise aKeyErroriffailed_checkinscontains only check-ins with statuses NOT inHUMAN_FAILURE_STATUS_MAP(e.g., all filtered out). In that casestatus_countswould be empty,sum(status_counts.values())would be 0, which is not == 1, so it falls through to theget_text_listpath with an empty list. While this is a pre-existing edge case, the extraction is a good opportunity to note it.
| human_status = get_text_list( | |
| # Guard against the edge case where all check-ins have statuses not present | |
| # in HUMAN_FAILURE_STATUS_MAP (e.g. unexpected/future statuses). Without this, | |
| # get_text_list([]) returns '' and we'd produce the malformed string | |
| # " check-ins detected". | |
| if not status_counts: | |
| return _("A failed check-in was detected") | |
🤖 Grapple PR auto-fix • minor • confidence: 100%
| "A failed check-in was detected" | ||
| """ | ||
|
|
||
| status_counts = Counter( |
There was a problem hiding this comment.
✅ [Grapple PR] Auto-fix — style agent (Small fix (2 lines, 1 file))
Typo in docstring: 'humam readible' should be 'human readable'
| status_counts = Counter( | |
| Builds a human readable string from a list of failed check-ins. |
🤖 Grapple PR auto-fix • minor • confidence: 86%
| monitor_env.save(update_fields=("status",)) | ||
|
|
||
| starting_checkin = previous_checkins[0] | ||
|
|
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — security agent (Larger fix (12 lines, 1 file) — review recommended)
When failure_issue_threshold > 1, previous_checkins is populated from a DB query. If the query returns 0 rows (race condition, data deleted), the subsequent
previous_checkins[0]at line 60 raises an IndexError, causing an unhandled exception. Also, the query does not filter out OK check-ins before slicing — it fetches the N most recent and then checks if any are OK, but if there are fewer than N check-ins total, the threshold check is applied to fewer items than intended.
--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -37,16 +37,25 @@
else:
previous_checkins = cast(
list[SimpleCheckIn],
# Using .values for performance reasons
MonitorCheckIn.objects.filter(
monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
)
+ .exclude(status=CheckInStatus.OK)
.order_by("-date_added")
+ [: failure_issue_threshold]
.values("id", "date_added", "status"),
)
# reverse the list after slicing in order to start with oldest check-in
- previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+ previous_checkins = list(reversed(previous_checkins))
- # If we have any successful check-ins within the threshold of
- # commits we have NOT reached an incident state
- if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+ # If we don't have enough consecutive failures to meet the threshold,
+ # we have NOT reached an incident state.
+ # Also guard against an empty result (race condition / data deleted)
+ # which would cause IndexError at previous_checkins[0] below.
+ if len(previous_checkins) < failure_issue_threshold:
return False
# change monitor status + update fingerprint timestamp🤖 Grapple PR auto-fix • major • Review this diff before applying
|
🔴
View suggested fix (needs manual review)--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,6 +1,7 @@
from __future__ import annotations
import logging
from datetime import datetime
from typing import cast
from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
from sentry.monitors.types import SimpleCheckIn
logger = logging.getLogger(__name__)
def try_incident_threshold(
failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
) -> bool:
🤖 Grapple PR auto-fix • major • This fix needs manual verification |
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } | ||
| return try_incident_threshold(failed_checkin, failure_issue_threshold, received) |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (20 lines, 1 file) — review recommended)
The function
mark_failed_thresholdhas been renamed totry_incident_thresholdand moved to a new module. Similarly,create_issue_platform_occurrencehas been renamed tocreate_incident_occurrenceand moved. Any existing tests or callers that mock or patch these functions at their old paths (e.g.,@mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold'),@mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')) will silently stop intercepting the real function and tests will either fail or pass incorrectly. The function renaming (not just moving) doubles the risk since even callers doingfrom sentry.monitors.logic.mark_failed import create_issue_platform_occurrencewill get an ImportError.
--- a/src/sentry/monitors/logic/mark_failed.py
+++ b/src/sentry/monitors/logic/mark_failed.py
@@ -1,7 +1,7 @@
from __future__ import annotations
import logging
-from datetime import datetime
+from datetime import datetime # noqa: F401
from django.db.models import Q
-from sentry.monitors.logic.incidents import try_incident_threshold
+from sentry.monitors.logic.incidents import try_incident_threshold
+from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorEnvironment
logger = logging.getLogger(__name__)
+# Backward-compatible re-exports: these old names are preserved so that any
+# existing test mocks patching the old paths continue to work correctly:
+#
+# @mock.patch('sentry.monitors.logic.mark_failed.mark_failed_threshold')
+# @mock.patch('sentry.monitors.logic.mark_failed.create_issue_platform_occurrence')
+#
+# Without these aliases, mocks targeting the old names would silently stop
+# intercepting the real function, causing tests to either fail or pass
+# incorrectly (depending on whether the mock was expected to prevent side effects
+# or assert call counts).
+#
+# When all callers and test mocks have been updated to use the new names and
+# module paths, these aliases can be safely removed.
+mark_failed_threshold = try_incident_threshold
+create_issue_platform_occurrence = create_incident_occurrence🤖 Grapple PR auto-fix • major • Review this diff before applying
| from datetime import datetime | ||
| from typing import cast | ||
|
|
||
| from sentry.monitors.logic.incident_occurrence import create_incident_occurrence |
There was a problem hiding this comment.
🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (33 lines, 1 file) — review recommended)
incidents.pydirectly imports fromincident_occurrence.py, establishing a one-way dependency. This is architecturally fine for now, but ifincident_occurrence.pyever needs to reference incident logic, a circular dependency would result. Consider whether thecreate_incident_occurrencecall should instead be injected or dispatched frommark_failed.py(the orchestrator) to keep both new modules independent of each other.
--- a/src/sentry/monitors/logic/incidents.py
+++ b/src/sentry/monitors/logic/incidents.py
@@ -1,10 +1,9 @@
from __future__ import annotations
import logging
from datetime import datetime
from typing import cast
-from sentry.monitors.logic.incident_occurrence import create_incident_occurrence
from sentry.monitors.models import CheckInStatus, MonitorCheckIn, MonitorIncident, MonitorStatus
from sentry.monitors.types import SimpleCheckIn
@@ -14,7 +13,14 @@ logger = logging.getLogger(__name__)
def try_incident_threshold(
failed_checkin: MonitorCheckIn,
failure_issue_threshold: int,
received: datetime | None,
-) -> bool:
+) -> tuple[bool, list[SimpleCheckIn], MonitorIncident | None]:
+ """
+ Evaluate whether the failure threshold has been reached and return the
+ incident context needed to create occurrences.
+
+ Returns a 3-tuple of (threshold_reached, previous_checkins, incident).
+ The caller (mark_failed.py orchestrator) is responsible for calling
+ create_incident_occurrence so that incidents.py and incident_occurrence.py
+ remain independent of each other.
+ """
from sentry.signals import monitor_environment_failed
monitor_env = failed_checkin.monitor_environment
if monitor_env is None:
- return False
+ return False, [], None
# check to see if we need to update the status
if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]:
@@ -63,7 +69,7 @@ def try_incident_threshold(
# get the active incident from the monitor environment
incident = monitor_env.active_incident
else:
# don't send occurrence for other statuses
- return False
+ return False, [], None
# Only create an occurrence if:
# - We have an active incident and fingerprint
# - The monitor and env are not muted
- if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
- checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
- for checkin in checkins:
- create_incident_occurrence(
- previous_checkins,
- checkin,
- incident,
- received=received,
- )
-
monitor_environment_failed.send(monitor_environment=monitor_env, sender=type(monitor_env))
- return True
+ # Return context to the orchestrator so it can call create_incident_occurrence
+ # without incidents.py taking a direct dependency on incident_occurrence.py.
+ if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
+ return True, previous_checkins, incident
+
+ return True, [], None🤖 Grapple PR auto-fix • minor • Review this diff before applying
Benchmark test PR #8 - Replicating Greptile benchmark for Grapple PR evaluation.