Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/sentry/consumers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def uptime_options() -> list[click.Option]:
options = [
click.Option(
["--mode", "mode"],
type=click.Choice(["serial", "parallel", "batched-parallel"]),
type=click.Choice(["serial", "parallel", "batched-parallel", "thread-queue-parallel"]),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — API Contract / Consumer Registration (confidence: 100%)

The new mode thread-queue-parallel is added to the CLI choices, but there's no validation downstream that max_workers is provided or has a sensible value when this mode is selected. The default falls back to max_workers or 20 in result_consumer.py, but if someone passes --max-workers 0 or a negative number, the behavior is undefined. The FixedQueuePool.__init__ doesn't validate num_queues.

Evidence:

  • Line 121: 'thread-queue-parallel' added as a valid choice
  • result_consumer.py line 134: num_queues=max_workers or 20 — if max_workers=0, this would use 20 (falsy), but max_workers=-1 would use -1
  • Intent specification calls out: 'Behavior when queue count is 0 or negative — should fail fast with a clear error'

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — architecture agent (Small fix (2 lines, 1 file))

The new mode thread-queue-parallel is added to the CLI choices, but there's no validation downstream that max_workers is provided or has a sensible value when this mode is selected. The default falls back to max_workers or 20 in result_consumer.py, but if someone passes --max-workers 0 or a negative number, the behavior is undefined. The FixedQueuePool.__init__ doesn't validate num_queues.

Suggested change
type=click.Choice(["serial", "parallel", "batched-parallel", "thread-queue-parallel"]),
help="The maximum number of workers/queues to use when in a parallel mode. Must be a positive integer when using thread-queue-parallel mode.",

🤖 Grapple PR auto-fix • minor • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Intent Mismatch / PR-Issue Linkage (confidence: 82%)

The linked issue #9 is titled 'Optimize spans buffer insertion with eviction during insert' which is about spans buffer optimization, not about queue-based parallelism for uptime consumers. This PR does not appear to address issue #9's actual content.

Evidence:

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (7 lines, 1 file))

The linked issue #9 is titled 'Optimize spans buffer insertion with eviction during insert' which is about spans buffer optimization, not about queue-based parallelism for uptime consumers. This PR does not appear to address issue #9's actual content.

Suggested change
type=click.Choice(["serial", "parallel", "batched-parallel", "thread-queue-parallel"]),
"""Return a list of uptime-results consumer options.
These options control parallelism for the uptime/remote-subscription
consumer, including the queue-based 'thread-queue-parallel' mode which
dispatches work to parallel queues via QueueConsumer.
"""

🤖 Grapple PR auto-fix • minor • confidence: 82%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — API Contract / Consumer Registration (confidence: 82%)

The new mode 'thread-queue-parallel' is added to the CLI choices, but the diff only shows 2 lines changed in this file. If the consumer factory wiring (the function that maps mode to strategy factory) does not also handle 'thread-queue-parallel', selecting this mode at runtime would pass validation but could fail silently or raise an error downstream. The ResultsStrategyFactory.__init__ does handle it, but verify that any dispatch logic in __init__.py that conditionally constructs strategy factories based on mode also recognizes this new value.

Evidence:

  • Only 2 lines changed in init.py: the Choice list and the help text
  • If there is a mode-based dispatch/validation elsewhere in init.py (not shown in diff), it may not handle 'thread-queue-parallel'

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — architecture agent (Small fix (4 lines, 1 file))

The new mode 'thread-queue-parallel' is added to the CLI choices, but the diff only shows 2 lines changed in this file. If the consumer factory wiring (the function that maps mode to strategy factory) does not also handle 'thread-queue-parallel', selecting this mode at runtime would pass validation but could fail silently or raise an error downstream. The ResultsStrategyFactory.__init__ does handle it, but verify that any dispatch logic in __init__.py that conditionally constructs strategy factories based on mode also recognizes this new value.

Suggested change
type=click.Choice(["serial", "parallel", "batched-parallel", "thread-queue-parallel"]),
type=click.Choice(["serial", "parallel", "batched-parallel", "thread-queue-parallel"]),
help="The mode to process results in. 'serial' processes one at a time, 'parallel' uses multithreading, 'batched-parallel' batches results before parallel processing, 'thread-queue-parallel' uses a queue-based concurrency model.",

🤖 Grapple PR auto-fix • major • confidence: 82%

default="serial",
help="The mode to process results in. Parallel uses multithreading.",
),
Expand All @@ -138,7 +138,7 @@ def uptime_options() -> list[click.Option]:
["--max-workers", "max_workers"],
type=int,
default=None,
help="The maximum number of threads to spawn in parallel mode.",
help="The maximum amount of parallelism to use when in a parallel mode.",
),
click.Option(["--processes", "num_processes"], default=1, type=int),
click.Option(["--input-block-size"], type=int, default=None),
Expand Down
345 changes: 345 additions & 0 deletions src/sentry/remote_subscriptions/consumers/queue_consumer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
from __future__ import annotations

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Intent/Ticket Mismatch (confidence: 95%)

The linked issue #9 is titled 'Optimize spans buffer insertion with eviction during insert' which describes spans buffer optimization, not queue-based parallelism for uptime consumers. This PR's changes are entirely about adding a QueueConsumer for uptime/remote subscription consumers. The PR does not address the linked ticket.

Evidence:

Agent: logic


import logging
import queue
import threading
import time
from collections import defaultdict
from collections.abc import Callable
from dataclasses import dataclass
from typing import Any, Generic, TypeVar

import sentry_sdk
from arroyo.backends.kafka.consumer import KafkaPayload
from arroyo.processing.strategies import ProcessingStrategy
from arroyo.types import BrokerValue, FilteredPayload, Message, Partition

from sentry.utils import metrics

logger = logging.getLogger(__name__)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Code Organization (confidence: 85%)

The module lacks a docstring explaining the purpose of the QueueConsumer abstraction and its role in the system. This is important for maintainability given the complexity of offset tracking and thread synchronization.

Evidence:

  • No module-level docstring explaining the queue-consumer pattern, why it exists, or how it relates to the broader result consumer system
  • The code introduces complex thread-safety mechanisms (OffsetTracker with locks, WorkItem queues, worker threads) that need documentation
  • New developers won't understand the design intent without a high-level explanation

Agent: style

T = TypeVar("T")


@dataclass
class WorkItem(Generic[T]):
"""Work item that includes the original message for offset tracking."""

partition: Partition
offset: int
result: T
message: Message[KafkaPayload | FilteredPayload]


class OffsetTracker:
"""
Tracks outstanding offsets and determines which offsets are safe to commit.

- Tracks offsets per partition
- Only commits offsets when all prior offsets are processed
- Thread-safe for concurrent access with per-partition locks
"""

def __init__(self) -> None:
self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
self.outstanding: dict[Partition, set[int]] = defaultdict(set)
self.last_committed: dict[Partition, int] = {}
self.partition_locks: dict[Partition, threading.Lock] = {}

def _get_partition_lock(self, partition: Partition) -> threading.Lock:
"""Get or create a lock for a partition."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 87%)

Missing module-level docstring explaining the purpose, architecture, and key design decisions of the QueueConsumer system. This is important for understanding offset commit semantics, concurrency model, and when to use this mode.

Evidence:

  • Module starts directly with imports without explaining the queue-based parallelism strategy
  • Complex offset tracking and queue management logic would benefit from high-level architecture documentation
  • Risk areas identified mention critical offset commit and deadlock concerns that should be documented at module level

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — style agent (Larger fix (57 lines, 1 file) — review recommended)

Missing module-level docstring explaining the purpose, architecture, and key design decisions of the QueueConsumer system. This is important for understanding offset commit semantics, concurrency model, and when to use this mode.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -1,4 +1,45 @@
 from __future__ import annotations
 
+"""
+Queue-based parallel processing strategy for remote subscription / uptime consumers.
+
+Architecture Overview
+---------------------
+This module implements a fan-out parallelism model on top of Arroyo's
+``ProcessingStrategy`` interface.  Instead of processing each Kafka message
+sequentially inside the Arroyo event loop, messages are decoded and dispatched
+into a fixed pool of in-process queues, each drained by a dedicated worker
+thread.  This lets CPU-bound or I/O-bound result processing happen concurrently
+without spinning up a new thread per message.
+
+Key components
+--------------
+``WorkItem``
+    Thin wrapper that carries a decoded result together with the Kafka
+    partition/offset needed for offset tracking.
+
+``OffsetTracker``
+    Maintains per-partition sets of *all* seen offsets and *outstanding*
+    (not-yet-completed) offsets.  An offset is only eligible for commit once
+    every lower offset in the same partition has been marked complete,
+    preserving the at-least-once delivery guarantee expected by Kafka consumers.
+    All mutations are protected by per-partition ``threading.Lock`` instances to
+    allow safe concurrent access from worker threads.
+
+``FixedQueuePool``
+    Owns a fixed number of ``queue.Queue`` objects (default 20) and a 1-to-1
+    mapping of worker threads.  Groups (e.g. subscription IDs) are assigned to
+    a queue via ``hash(group_key) % num_queues``, so all messages belonging to
+    the same group are always processed by the same worker in FIFO order.
+
+``SimpleQueueProcessingStrategy``
+    The Arroyo ``ProcessingStrategy`` entry point.  Its ``submit()`` method
+    decodes each message and hands it to ``FixedQueuePool``.  A background
+    *commit thread* wakes up every second, asks ``OffsetTracker`` for the
+    highest safely committable offset per partition, invokes the upstream
+    commit callback, and advances the committed-offset watermark.
+
+Critical design considerations
+-------------------------------
+- **Offset commit safety**: offsets are committed only after *all* preceding
+  offsets in the partition are complete.  Never call ``mark_committed`` before
+  ``get_committable_offsets`` confirms contiguous completion, or messages may
+  be silently skipped on consumer restart.
+- **Deadlock avoidance**: per-partition locks are held only for the duration of
+  a single set/dict operation; the locks are never nested.  Do not acquire
+  multiple partition locks simultaneously.
+- **Backpressure**: ``queue.Queue.put()`` blocks when the queue is full (if a
+  ``maxsize`` is set).  The current default is unbounded; add a ``maxsize`` if
+  memory growth under lag needs to be bounded.
+- **When to use this mode**: prefer ``SimpleQueueProcessingStrategy`` over a
+  purely sequential strategy when result processing is independently safe to
+  parallelise across groups (i.e. no cross-group shared mutable state) and
+  per-message latency variance is high enough to create head-of-line blocking.
+"""
+
 import logging
 import queue
 import threading

🤖 Grapple PR auto-fix • minor • Review this diff before applying

lock = self.partition_locks.get(partition)
if lock:
return lock
return self.partition_locks.setdefault(partition, threading.Lock())

def add_offset(self, partition: Partition, offset: int) -> None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Thread Safety - Race Condition (confidence: 100%)

The _get_partition_lock method has a TOCTOU race condition. Two threads could simultaneously check self.partition_locks.get(partition) and both get None, then both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, the partition_locks dict itself is a regular dict being accessed from multiple threads without synchronization. The initial get call and the subsequent setdefault are two separate operations, and the partition_locks dictionary is shared mutable state accessed from worker threads (via complete_offset) and the main thread (via add_offset and get_committable_offsets).

Evidence:

  • The PR description explicitly lists 'Exception propagation from worker threads: unhandled exceptions in threads that don’t propagate to the main thread can cause silent failures' as a risk area
  • Worker threads are daemon threads or background threads consuming from internal queues; if one dies, items put into that queue are never consumed
  • The OffsetTracker will have outstanding offsets that never get completed, so get_committable_offsets() will never advance past the dead worker's last offset
  • No heartbeat, health check, or thread resurrection mechanism is visible in the truncated code
  • partition_locks is a plain dict, not a thread-safe structure
  • Two threads can concurrently call _get_partition_lock for a new partition
  • Python 3.13 introduces experimental no-GIL mode (PEP 703); Sentry may eventually adopt it
  • setdefault atomicity is a CPython implementation detail, not a language guarantee
  • Worker threads call complete_offset which calls _get_partition_lock
  • Main thread calls add_offset and get_committable_offsets which also call _get_partition_lock
  • The partition_locks dict is a plain dict with no synchronization protecting its own access
  • While CPython's GIL makes dict.setdefault effectively atomic, this is an implementation detail, not a language guarantee
  • Intent: 'On shutdown/close, the QueueConsumer should drain in-flight messages from all queues before stopping'
  • Edge case: 'Shutdown signal received while queues have a large backlog — verify drain timeout or maximum wait behavior'
  • The time module is imported but no timeout constants or drain timeout logic is visible in the non-truncated portion
  • queue_consumer.py ~line 170-180: if self._shutdown.is_set() and self._queue.empty(): break
  • queue_consumer.py ~line 200-210: shutdown() sets _shutdown then puts None sentinel
  • Between is_set() and empty(), the main thread could still be putting items via submit()

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (23 lines, 1 file) — review recommended)

The _get_partition_lock method has a TOCTOU race condition. Two threads could simultaneously check self.partition_locks.get(partition) and both get None, then both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, the partition_locks dict itself is a regular dict being accessed from multiple threads without synchronization. The initial get call and the subsequent setdefault are two separate operations, and the partition_locks dictionary is shared mutable state accessed from worker threads (via complete_offset) and the main thread (via add_offset and get_committable_offsets).

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -43,10 +43,17 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Protects partition_locks dict itself during check-and-create.
+        # Per-partition locks are still used for per-partition data access.
+        # This is necessary for correctness under no-GIL Python (PEP 703)
+        # and for clarity: dict.setdefault atomicity is a CPython detail,
+        # not a language guarantee.
+        self._registry_lock: threading.Lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: lock already exists (no registry lock needed for read
+        # once the partition is registered, since we never delete entries).
+        lock = self.partition_locks.get(partition)
+        if lock is not None:
+            return lock
+        # Slow path: partition not yet registered — acquire registry lock
+        # to eliminate the TOCTOU race between get() and insertion.
+        with self._registry_lock:
+            # Re-check inside the lock: another thread may have inserted
+            # the lock between our get() above and acquiring _registry_lock.
+            if partition not in self.partition_locks:
+                self.partition_locks[partition] = threading.Lock()
+            return self.partition_locks[partition]

🤖 Grapple PR auto-fix • critical • Review this diff before applying

"""Record that we've started processing an offset."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Concurrency issues (confidence: 100%)

Race condition in _get_partition_lock: the check-then-set pattern for creating partition locks is not atomic, potentially creating two locks for the same partition under concurrent access.

Evidence:

  • The method first checks self.partition_locks.get(partition), then calls setdefault — but between the get and setdefect, another thread could have inserted a different lock object.
  • While dict.setdefault is atomic in CPython due to the GIL, the initial get call creates a TOCTOU window: the code path if lock: return lock may return a stale reference if the dict is being concurrently modified.
  • More critically, self.all_offsets and self.outstanding in add_offset/complete_offset are defaultdict(set), which are NOT thread-safe. Multiple threads calling add_offset concurrently on a new partition could race on defaultdict.__missing__, corrupting the set.
  • The partition lock is only acquired AFTER accessing self.all_offsets[partition] and self.outstanding[partition] in add_offset — no, wait, looking again: add_offset acquires the lock first. But get_committable_offsets iterates self.all_offsets.keys() without a global lock while other threads may be adding new partitions.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Caching opportunities (confidence: 100%)

_get_partition_lock is called on every add_offset and complete_offset invocation with a dict lookup + potential setdefault, adding unnecessary overhead on the hot path.

Evidence:

  • Every message processed requires two calls to _get_partition_lock (once in add_offset, once in complete_offset).
  • The number of Kafka partitions is fixed and small (typically 8–64). Pre-creating locks for known partitions at startup would eliminate runtime dict operations.
  • The setdefault pattern causes unnecessary lock object creation attempts on every new-partition call.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (42 lines, 1 file) — review recommended)

Race condition in _get_partition_lock: the check-then-set pattern for creating partition locks is not atomic, potentially creating two locks for the same partition under concurrent access.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -43,16 +43,33 @@ class OffsetTracker:
     def __init__(self) -> None:
-        self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
-        self.outstanding: dict[Partition, set[int]] = defaultdict(set)
+        # Use plain dicts (not defaultdict) so that accesses outside a lock
+        # never silently create entries via __missing__, which is not thread-safe.
+        self.all_offsets: dict[Partition, set[int]] = {}
+        self.outstanding: dict[Partition, set[int]] = {}
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Guards creation of new partition entries in all four dicts above.
+        # Per-partition locks guard mutations to already-registered partitions.
+        self._meta_lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
-        """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        """
+        Get or create a lock for a partition.
+
+        Uses _meta_lock to make partition registration fully atomic: the lock
+        object and the corresponding set entries in all_offsets/outstanding are
+        created together so no caller can ever see a partition lock without its
+        backing data structures already in place.
+        """
+        # Fast path: partition already registered (no write, safe to read under GIL
+        # after the initial write was done while holding _meta_lock).
+        lock = self.partition_locks.get(partition)
+        if lock is not None:
+            return lock
+        # Slow path: register the partition atomically.
+        with self._meta_lock:
+            # Re-check inside the lock to handle the race between two threads
+            # both reaching the slow path simultaneously.
+            if partition not in self.partition_locks:
+                self.partition_locks[partition] = threading.Lock()
+                self.all_offsets[partition] = set()
+                self.outstanding[partition] = set()
+            return self.partition_locks[partition]
 
     def add_offset(self, partition: Partition, offset: int) -> None:
         """Record that we've started processing an offset."""
@@ -67,7 +84,12 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
         """
         committable = {}
-        for partition in list(self.all_offsets.keys()):
+        # Snapshot the key list under _meta_lock so we don't iterate a dict
+        # that another thread is concurrently expanding with new partitions.
+        with self._meta_lock:
+            partitions = list(self.partition_locks.keys())
+        for partition in partitions:
             with self._get_partition_lock(partition):

🤖 Grapple PR auto-fix • critical • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (32 lines, 1 file) — review recommended)

_get_partition_lock is called on every add_offset and complete_offset invocation with a dict lookup + potential setdefault, adding unnecessary overhead on the hot path.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -43,14 +43,26 @@ class OffsetTracker:
 
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
 
+    def initialize_partition(self, partition: Partition) -> None:
+        """
+        Pre-create lock and data structures for a partition.
+
+        Should be called at partition assignment time so that the hot path
+        (add_offset / complete_offset) only performs a plain dict lookup
+        with no branching or lock-object creation.
+        """
+        if partition not in self.partition_locks:
+            self.partition_locks[partition] = threading.Lock()
+            # Pre-warm the defaultdict entries so hot-path accesses are O(1)
+            _ = self.all_offsets[partition]
+            _ = self.outstanding[partition]
+
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
-        """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        """
+        Return the lock for a partition.
+
+        Fast path: plain dict lookup after initialize_partition() has been called.
+        Slow path (unexpected partition): creates the lock lazily so existing
+        behaviour is preserved if initialize_partition() was not called.
+        """
+        lock = self.partition_locks.get(partition)
+        if lock is not None:
+            return lock
+        # Unexpected partition — initialize defensively rather than crashing.
+        self.initialize_partition(partition)
+        return self.partition_locks[partition]
 
     def add_offset(self, partition: Partition, offset: int) -> None:

🤖 Grapple PR auto-fix • minor • Review this diff before applying

with self._get_partition_lock(partition):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Offset Commit Correctness (confidence: 100%)

The _get_partition_lock method has a race condition: two threads can simultaneously find lock is None, both call setdefault, but the first thread may proceed with a different lock object than what was actually stored. While dict.setdefault is atomic in CPython due to the GIL, the initial self.partition_locks.get(partition) check followed by setdefault creates a TOCTOU pattern. More critically, self.partition_locks itself is a plain dict being accessed from multiple threads without synchronization — if a new partition is encountered concurrently from two threads, dict mutation during iteration in get_committable_offsets (which iterates self.all_offsets.keys()) could cause issues.

Evidence:

  • Line 56: lock = self.partition_locks.get(partition) — no global lock protects this
  • Line 58: return self.partition_locks.setdefault(partition, threading.Lock()) — while setdefault is atomic in CPython, the pattern is not thread-safe in general
  • The OffsetTracker is accessed from both the main consumer thread (add_offset) and worker threads (complete_offset) concurrently
  • get_committable_offsets iterates self.all_offsets.keys() while worker threads may be modifying the dict via complete_offset

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (25 lines, 1 file) — review recommended)

The _get_partition_lock method has a race condition: two threads can simultaneously find lock is None, both call setdefault, but the first thread may proceed with a different lock object than what was actually stored. While dict.setdefault is atomic in CPython due to the GIL, the initial self.partition_locks.get(partition) check followed by setdefault creates a TOCTOU pattern. More critically, self.partition_locks itself is a plain dict being accessed from multiple threads without synchronization — if a new partition is encountered concurrently from two threads, dict mutation during iteration in get_committable_offsets (which iterates self.all_offsets.keys()) could cause issues.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -38,9 +38,15 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Global lock guards creation of per-partition locks only.
+        # Once a partition lock exists, this lock is not needed for
+        # add_offset / complete_offset / mark_committed (fast path).
+        self._global_lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: partition lock already exists, no global lock needed.
+        # dict reads are safe under CPython GIL for existing keys, and we
+        # only mutate partition_locks under _global_lock below.
+        lock = self.partition_locks.get(partition)
+        if lock is not None:
+            return lock
+        # Slow path: create the partition lock under the global lock to
+        # eliminate the TOCTOU between get() and setdefault().
+        with self._global_lock:
+            # Re-check inside the lock: another thread may have created
+            # the lock between our get() above and acquiring _global_lock.
+            lock = self.partition_locks.get(partition)
+            if lock is not None:
+                return lock
+            lock = threading.Lock()
+            self.partition_locks[partition] = lock
+            return lock

🤖 Grapple PR auto-fix • critical • Review this diff before applying

self.all_offsets[partition].add(offset)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Thread Safety / Race Condition (confidence: 100%)

The _get_partition_lock method has a TOCTOU race condition. Two threads could both check self.partition_locks.get(partition) simultaneously, find it missing, and both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, this is an implementation detail and not guaranteed. More critically, self.partition_locks itself is a plain dict that is being mutated from multiple threads without synchronization. The _get_partition_lock method is called from add_offset and complete_offset which can be called from different threads (main thread and worker threads respectively).

Evidence:

  • add_offset is called from the main consumer thread in SimpleQueueProcessingStrategy.submit()
  • complete_offset is called from worker threads in FixedQueuePool._worker()
  • Both call _get_partition_lock which reads and writes to self.partition_locks without a global lock
  • defaultdict mutation for self.all_offsets and self.outstanding also happens inside per-partition locks, but new partitions could cause dict resize from multiple threads

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (19 lines, 1 file) — review recommended)

The _get_partition_lock method has a TOCTOU race condition. Two threads could both check self.partition_locks.get(partition) simultaneously, find it missing, and both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, this is an implementation detail and not guaranteed. More critically, self.partition_locks itself is a plain dict that is being mutated from multiple threads without synchronization. The _get_partition_lock method is called from add_offset and complete_offset which can be called from different threads (main thread and worker threads respectively).

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -43,12 +43,22 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Protects partition_locks dict itself against concurrent reads/writes
+        # from multiple threads (main consumer thread via add_offset, worker
+        # threads via complete_offset).  Per-partition locks are acquired
+        # separately *after* releasing this meta-lock so that unrelated
+        # partitions never contend on a single global lock.
+        self._meta_lock: threading.Lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: partition already registered — read under meta-lock to
+        # avoid a TOCTOU race on the underlying dict (which is not thread-safe
+        # for concurrent reads during a resize triggered by another writer).
+        with self._meta_lock:
+            lock = self.partition_locks.get(partition)
+            if lock is None:
+                lock = threading.Lock()
+                self.partition_locks[partition] = lock
+            return lock

🤖 Grapple PR auto-fix • critical • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Thread Safety - Race Condition (confidence: 100%)

The _get_partition_lock method has a TOCTOU race condition. Two threads could simultaneously check self.partition_locks.get(partition), both get None, and then both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, the initial get followed by setdefault is not an atomic operation, and this pattern is fragile. More critically, self.partition_locks is a plain dict that is accessed from multiple threads (worker threads calling complete_offset and the main thread calling add_offset and get_committable_offsets). In non-CPython implementations or future Python versions without a GIL (free-threaded Python 3.13+), this is a data race.

Evidence:

  • The method first checks self.partition_locks.get(partition) then conditionally calls self.partition_locks.setdefault(partition, threading.Lock())
  • Multiple worker threads can call add_offset and complete_offset concurrently, which both call _get_partition_lock
  • dict.setdefault is only safe due to CPython GIL - this is not guaranteed by the Python language spec
  • Multiple worker threads call complete_offset which calls _get_partition_lock
  • The main thread calls add_offset and get_committable_offsets which also call _get_partition_lock
  • dict.setdefault is only safe under CPython GIL; the two-step get-then-setdefault pattern is not atomic
  • Python 3.13 introduces free-threaded mode (no GIL) where this would be a real data race
  • lock = self.partition_locks.get(partition) — unlocked read
  • if lock: return lock — returns early without any synchronization guarantee outside CPython's GIL
  • The pattern is fragile: if partition_locks itself were replaced with a non-dict structure, this becomes a real race

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (21 lines, 1 file) — review recommended)

The _get_partition_lock method has a TOCTOU race condition. Two threads could simultaneously check self.partition_locks.get(partition), both get None, and then both call setdefault. While dict.setdefault is atomic in CPython due to the GIL, the initial get followed by setdefault is not an atomic operation, and this pattern is fragile. More critically, self.partition_locks is a plain dict that is accessed from multiple threads (worker threads calling complete_offset and the main thread calling add_offset and get_committable_offsets). In non-CPython implementations or future Python versions without a GIL (free-threaded Python 3.13+), this is a data race.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -40,12 +40,21 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Guards mutations to partition_locks dict itself, ensuring safety
+        # under free-threaded Python 3.13+ (no-GIL) where dict operations
+        # are no longer implicitly serialized.
+        self._partition_locks_lock: threading.Lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: read under the meta-lock to avoid TOCTOU between the
+        # initial get() and setdefault().  Under CPython with the GIL this
+        # is equivalent to the old pattern but is also safe under
+        # free-threaded Python 3.13+ (PEP 703) where dict operations are
+        # no longer serialized by the GIL.
+        lock = self.partition_locks.get(partition)
+        if lock is not None:
+            return lock
+        with self._partition_locks_lock:
+            # Re-check inside the lock (double-checked locking) so we
+            # don't allocate an extra Lock() object unnecessarily when two
+            # threads race here simultaneously.
+            return self.partition_locks.setdefault(partition, threading.Lock())

🤖 Grapple PR auto-fix • critical • Review this diff before applying

self.outstanding[partition].add(offset)

def complete_offset(self, partition: Partition, offset: int) -> None:
"""Mark an offset as completed."""
with self._get_partition_lock(partition):
self.outstanding[partition].discard(offset)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Edge Case - Empty State (confidence: 97%)

The all_offsets set grows unboundedly. After mark_committed is called (visible in tests), offsets should be cleaned from all_offsets to prevent memory growth. For long-running consumers processing millions of messages, this set will consume significant memory.

Evidence:

  • Tests call self.tracker.mark_committed(self.partition1, 100) suggesting a cleanup mechanism exists
  • all_offsets[partition].add(offset) is called for every message but cleanup depends on mark_committed implementation
  • If mark_committed doesn't remove from all_offsets, the set grows forever and min(all_offsets) / max(all_offsets) range iteration becomes increasingly expensive

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (9 lines, 1 file))

The all_offsets set grows unboundedly. After mark_committed is called (visible in tests), offsets should be cleaned from all_offsets to prevent memory growth. For long-running consumers processing millions of messages, this set will consume significant memory.

Suggested change
# Prune all_offsets of any entries already below the last committed
# offset to prevent unbounded memory growth. Offsets <= last_committed
# are already safe to discard since mark_committed handles bulk cleanup,
# but complete_offset may be called many times between commit cycles.
last_committed = self.last_committed.get(partition, -1)
if last_committed >= 0:
self.all_offsets[partition] = {
o for o in self.all_offsets[partition] if o > last_committed
}

🤖 Grapple PR auto-fix • minor • confidence: 97%

def get_committable_offsets(self) -> dict[Partition, int]:
"""
Get the highest offset per partition that can be safely committed.

For each partition, finds the highest contiguous offset that has been processed.
"""
committable = {}
for partition in list(self.all_offsets.keys()):
with self._get_partition_lock(partition):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Concurrency issues (confidence: 100%)

_get_partition_lock() has a race condition: checking existence and then calling setdefault are not atomic relative to concurrent reads of partition_locks itself, and partition_locks is a plain dict with no lock protecting it. Two threads could simultaneously pass the if lock: return lock check, both see None, and both attempt setdefault — while a third thread could be iterating partition_locks in get_committable_offsets().

Evidence:

  • self.partition_locks is a plain dict with no protecting lock
  • get_committable_offsets() calls list(self.all_offsets.keys()) and then self._get_partition_lock(partition) — dict is being read while worker threads could be writing new partition entries
  • Python dict is not thread-safe for concurrent structural modifications (adding new keys while iterating)
  • The per-partition lock creation itself is not atomic: lock = self.partition_locks.get(partition) followed by setdefault is a TOCTOU pattern

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (21 lines, 1 file) — review recommended)

_get_partition_lock() has a race condition: checking existence and then calling setdefault are not atomic relative to concurrent reads of partition_locks itself, and partition_locks is a plain dict with no lock protecting it. Two threads could simultaneously pass the if lock: return lock check, both see None, and both attempt setdefault — while a third thread could be iterating partition_locks in get_committable_offsets().

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -43,10 +43,16 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Protects structural modifications to partition_locks (i.e. adding new
+        # partition entries).  Per-partition locks are still used for the actual
+        # offset data so contention on this registry lock is minimal — it is
+        # only held long enough to do a dict lookup + possible setdefault.
+        self._registry_lock: threading.Lock = threading.Lock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: partition already registered.  Read under registry lock so
+        # we never observe a partially-constructed partition_locks dict.
+        with self._registry_lock:
+            lock = self.partition_locks.get(partition)
+            if lock is not None:
+                return lock
+            # Slow path: first time we see this partition — create and register
+            # atomically under the same registry lock so no other thread can
+            # race us to setdefault and end up with a different lock object.
+            new_lock = threading.Lock()
+            self.partition_locks[partition] = new_lock
+            return new_lock

🤖 Grapple PR auto-fix • critical • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Memory leak (confidence: 100%)

all_offsets[partition] is never cleaned up after mark_committed is called. Offsets are added to all_offsets in add_offset() but the mark_committed() method only updates last_committed and removes from outstanding — it never removes the offset from all_offsets. Over time this set grows without bound, consuming increasing memory.

Evidence:

  • add_offset adds to self.all_offsets[partition]
  • complete_offset only discards from self.outstanding[partition]
  • mark_committed updates self.last_committed[partition] but does not prune self.all_offsets[partition]
  • In a long-running consumer processing millions of messages, this will cause significant memory growth
  • The mark_committed method only sets self.last_committed[partition] = offset but never calls self.all_offsets[partition].discard() or clears entries below the committed offset.
  • In get_committable_offsets, min_offset and max_offset are computed over ALL offsets ever seen for a partition, not just those since the last commit.
  • In a long-running consumer, all_offsets for each partition will accumulate millions of integers, making min() and max() calls O(n) over the entire history.
  • The test test_simple_tracking does call mark_committed but never checks memory behavior over many offsets.
  • Line 46: self.all_offsets: dict[Partition, set[int]] = defaultdict(set) — unbounded growth
  • The intent spec mentions this risk: 'All workers fail simultaneously (e.g., downstream service unavailable) — verify the consumer does not deadlock waiting for queue slots to free'
  • A single stuck offset blocks all cleanup for that partition

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — performance agent (Small fix (5 lines, 1 file))

all_offsets[partition] is never cleaned up after mark_committed is called. Offsets are added to all_offsets in add_offset() but the mark_committed() method only updates last_committed and removes from outstanding — it never removes the offset from all_offsets. Over time this set grows without bound, consuming increasing memory.

Suggested change
with self._get_partition_lock(partition):
# Prune all_offsets to prevent unbounded memory growth.
# Only retain offsets strictly greater than the committed watermark,
# since anything <= offset has already been committed and will never
# need to be considered by get_committable_offsets again.
self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}

🤖 Grapple PR auto-fix • major • confidence: 100%

all_offsets = self.all_offsets[partition]
if not all_offsets:
continue

outstanding = self.outstanding[partition]
last_committed = self.last_committed.get(partition, -1)

min_offset = min(all_offsets)
max_offset = max(all_offsets)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Memory leaks (confidence: 100%)

OffsetTracker.all_offsets grows unboundedly — completed offsets are never removed from all_offsets, only from outstanding. Over time this set accumulates every offset ever seen for a partition, causing unbounded memory growth in long-running consumers.

Evidence:

  • add_offset adds to self.all_offsets[partition] but neither complete_offset nor mark_committed remove entries from all_offsets
  • get_committable_offsets calls min(all_offsets) and max(all_offsets) on the full historical set, not just the window since last commit
  • In mark_committed, only last_committed is updated; all_offsets is untouched
  • A consumer processing millions of messages per day will accumulate millions of integers in this set per partition
  • add_offset() adds to both all_offsets and outstanding
  • complete_offset() only calls outstanding[partition].discard(offset) — all_offsets is never pruned
  • mark_committed() (referenced in tests) presumably removes from all_offsets, but get_committable_offsets() iterates all_offsets which will grow indefinitely if mark_committed is not called correctly or lags behind
  • In a high-throughput uptime consumer processing millions of messages per day, this set will consume gigabytes of memory
  • No revoke_partition or clear_partition method exists on OffsetTracker
  • The SimpleQueueProcessingStrategy has no code path that clears partition state on rebalance
  • last_committed persists across rebalances, so if partition 0 was at offset 500 before revoke, and comes back at offset 100 after reassignment, get_committable_offsets will never return a value for it since all new offsets will be < last_committed

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (32 lines, 1 file) — review recommended)

OffsetTracker.all_offsets grows unboundedly — completed offsets are never removed from all_offsets, only from outstanding. Over time this set accumulates every offset ever seen for a partition, causing unbounded memory growth in long-running consumers.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -93,6 +93,18 @@ class OffsetTracker:
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
             # Remove all offsets <= committed offset to prevent unbounded growth
+            # all_offsets now only holds the in-flight window, not the full history
             self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+
+    def revoke_partition(self, partition: Partition) -> None:
+        """
+        Clear all state for a partition on rebalance.
+
+        Must be called when a partition is revoked to prevent stale last_committed
+        from suppressing valid offsets after reassignment at a lower offset.
+        """
+        with self._get_partition_lock(partition):
+            self.all_offsets.pop(partition, None)
+            self.outstanding.pop(partition, None)
+            self.last_committed.pop(partition, None)
+            # Note: keep the lock entry — it's cheap and avoids a TOCTOU race
+            # if a worker thread completes an offset for this partition concurrently.

@@ -240,6 +252,20 @@ class SimpleQueueProcessingStrategy(ProcessingStrategy[KafkaPayload], Generic[T
         self.commit_thread = threading.Thread(target=self._commit_loop, daemon=True)
         self.commit_thread.start()
 
+    def revoke_partitions(self, partitions: list[Partition]) -> None:
+        """
+        Called by Arroyo on partition rebalance.
+
+        Clears OffsetTracker state for revoked partitions so that stale
+        last_committed values don't suppress offsets after reassignment.
+        Without this, a partition reassigned at a lower offset (e.g. after
+        consumer group rebalance) would never produce a committable offset
+        because all new offsets would be < last_committed.
+        """
+        for partition in partitions:
+            logger.info(
+                "Revoking partition state from offset tracker",
+                extra={"partition": partition},
+            )
+            self.queue_pool.offset_tracker.revoke_partition(partition)
+
     def _commit_loop(self) -> None:
         while not self.shutdown_event.is_set():
             try:

🤖 Grapple PR auto-fix • critical • Review this diff before applying

start = max(last_committed + 1, min_offset)

highest_committable = last_committed
for offset in range(start, max_offset + 1):
if offset in all_offsets and offset not in outstanding:
highest_committable = offset
else:
break

if highest_committable > last_committed:
committable[partition] = highest_committable

return committable

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Logic Error - Offset Commit Correctness (confidence: 100%)

The get_committable_offsets method iterates offsets from min_offset to max_offset checking if each is NOT in outstanding, but it does NOT check if each offset is actually in all_offsets. If there are gaps in the tracked offsets (e.g., offsets 100, 102 are tracked but 101 was never seen), the method will walk through offset 101, see it's not outstanding, and consider it committed — advancing past an offset that was never processed. This violates the acceptance criterion that 'offsets should only be committed after all in-flight work at or before that offset is complete'.

Evidence:

  • Lines 86-93: the loop does for offset in range(min_offset, max_offset + 1) and checks if offset in outstanding but never checks if offset in all_offsets
  • In Kafka, it's possible for the consumer to receive non-contiguous offsets (e.g., due to compaction, filtered payloads, or partition reassignment)
  • The intent spec explicitly states: 'Offset commits are not advanced past unprocessed messages'

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (2 lines, 1 file))

The get_committable_offsets method iterates offsets from min_offset to max_offset checking if each is NOT in outstanding, but it does NOT check if each offset is actually in all_offsets. If there are gaps in the tracked offsets (e.g., offsets 100, 102 are tracked but 101 was never seen), the method will walk through offset 101, see it's not outstanding, and consider it committed — advancing past an offset that was never processed. This violates the acceptance criterion that 'offsets should only be committed after all in-flight work at or before that offset is complete'.

Suggested change
return committable
if offset in all_offsets and offset not in outstanding:

🤖 Grapple PR auto-fix • critical • confidence: 100%


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Naming conventions (confidence: 83%)

The method _get_partition_lock() initializes a lock that may already exist but doesn't use the dict's setdefault pattern consistently with all_offsets and outstanding initialization.

Evidence:

  • Line 49: defaultdict(set) is used for auto-initialization of partition collections
  • Lines 78-82: Lock initialization uses manual lookup and setdefault instead of a defaultdict
  • This inconsistency makes the code harder to scan for initialization patterns

Agent: style

def mark_committed(self, partition: Partition, offset: int) -> None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Concurrency issues (confidence: 100%)

get_committable_offsets calls min(all_offsets) and max(all_offsets) while holding per-partition locks, but the outer loop iterates self.all_offsets.keys() without any lock, creating a race with add_offset adding new partitions.

Evidence:

  • for partition in list(self.all_offsets.keys()) — the list() snapshot is taken without a lock, so a concurrent add_offset call on a new partition may add a key to all_offsets but not yet have acquired the per-partition lock, leading to incomplete state being read.
  • The inner lock only protects per-partition operations, not the cross-partition key enumeration.
  • This is called from the main Kafka poll thread while worker threads call complete_offset concurrently.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Algorithmic complexity (confidence: 100%)

get_committable_offsets is O(n) in the number of outstanding offsets per partition for each call to min() and max(), and this is called on every Kafka poll cycle.

Evidence:

  • min(all_offsets) and max(all_offsets) both iterate the entire set — O(k) where k is number of tracked offsets per partition.
  • If all_offsets is not pruned (see memory leak finding above), this degrades to O(total messages processed) per poll cycle.
  • The contiguous-offset scan loop for offset in range(min_offset, max_offset + 1) is also O(gap size), which could be large if a single offset stalls.
  • This is invoked on every message submission in SimpleQueueProcessingStrategy.submit, multiplied by the number of partitions.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Algorithmic complexity (confidence: 100%)

The contiguous offset scan for offset in range(min_offset, max_offset + 1) is O(gap_size) and runs inside a per-partition lock on every get_committable_offsets call.

Evidence:

  • If offset 1000 is outstanding and offsets 1001–2000 are complete, the range scan will iterate 1000 entries every poll cycle to determine no progress can be made.
  • This is called frequently (on every submit() call or poll) and holds the partition lock during the scan, blocking complete_offset calls from worker threads.
  • The scan correctly stops at the first outstanding offset, but the lock contention during the scan delays worker threads from marking completions.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (25 lines, 1 file) — review recommended)

get_committable_offsets calls min(all_offsets) and max(all_offsets) while holding per-partition locks, but the outer loop iterates self.all_offsets.keys() without any lock, creating a race with add_offset adding new partitions.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -38,6 +38,12 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # Global lock guards cross-partition operations (key enumeration, new partition
+        # registration). Per-partition locks guard per-partition data mutations.
+        # Use RLock in case future refactors call these methods re-entrantly.
+        self._global_lock = threading.RLock()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
-        lock = self.partition_locks.get(partition)
-        if lock:
-            return lock
-        return self.partition_locks.setdefault(partition, threading.Lock())
+        # Fast path: partition lock already exists, no global lock needed.
+        lock = self.partition_locks.get(partition)
+        if lock:
+            return lock
+        # Slow path: new partition — hold global lock to prevent two threads
+        # from racing on setdefault for the same partition.
+        with self._global_lock:
+            return self.partition_locks.setdefault(partition, threading.Lock())
 
     def add_offset(self, partition: Partition, offset: int) -> None:
         """Record that we've started processing an offset."""
         with self._get_partition_lock(partition):
             self.all_offsets[partition].add(offset)
             self.outstanding[partition].add(offset)
 
     def complete_offset(self, partition: Partition, offset: int) -> None:
         """Mark an offset as completed."""
         with self._get_partition_lock(partition):
             self.outstanding[partition].discard(offset)
 
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
         """
         committable = {}
-        for partition in list(self.all_offsets.keys()):
+        # Hold the global lock only long enough to take a consistent snapshot of
+        # the partition keys. This prevents add_offset from inserting a brand-new
+        # partition (and its incomplete defaultdict entry) between our snapshot and
+        # the per-partition lock acquisition below.
+        with self._global_lock:
+            partitions = list(self.all_offsets.keys())
+
+        for partition in partitions:
             with self._get_partition_lock(partition):
                 all_offsets = self.all_offsets[partition]
                 if not all_offsets:
                     continue
 
                 outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
                 min_offset = min(all_offsets)
                 max_offset = max(all_offsets)
 
                 start = max(last_committed + 1, min_offset)
 
                 highest_committable = last_committed
                 for offset in range(start, max_offset + 1):
                     if offset in all_offsets and offset not in outstanding:
                         highest_committable = offset
                     else:
                         break
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable
 
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
             # Remove all offsets <= committed offset
             self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}

🤖 Grapple PR auto-fix • critical • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (50 lines, 1 file) — review recommended)

get_committable_offsets is O(n) in the number of outstanding offsets per partition for each call to min() and max(), and this is called on every Kafka poll cycle.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import logging
+import queue
 import threading
 import time
 from collections import defaultdict
@@ -8,6 +9,8 @@ from collections.abc import Callable
 from dataclasses import dataclass
 from typing import Any, Generic, TypeVar
 
+from sortedcontainers import SortedList
+
 import sentry_sdk
 from arroyo.backends.kafka.consumer import KafkaPayload
 from arroyo.processing.strategies import ProcessingStrategy
@@ -33,7 +36,9 @@ class OffsetTracker:
     - Thread-safe for concurrent access with per-partition locks
     """
 
+    # Uses SortedList so min() is O(1) via index 0, and iteration is already
+    # sorted — eliminating O(k) min()/max() scans and O(gap) range walks.
     def __init__(self) -> None:
-        self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
+        self.all_offsets: dict[Partition, SortedList[int]] = defaultdict(SortedList)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
@@ -52,7 +57,7 @@ class OffsetTracker:
     def add_offset(self, partition: Partition, offset: int) -> None:
         """Record that we've started processing an offset."""
         with self._get_partition_lock(partition):
-            self.all_offsets[partition].add(offset)
+            self.all_offsets[partition].add(offset)  # O(log k) insert into SortedList
             self.outstanding[partition].add(offset)
 
     def complete_offset(self, partition: Partition, offset: int) -> None:
@@ -67,32 +72,46 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
+
+        Complexity: O(c * log k) per partition where c is the number of newly
+        committable offsets this cycle and k is the total tracked offsets.
+        Previously O(k) for min/max + O(gap) for the range scan.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
                 all_offsets = self.all_offsets[partition]
                 if not all_offsets:
                     continue
 
                 outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
-
+                # SortedList[0] is O(1) — no full iteration needed.
+                # We start from the smallest tracked offset that is beyond
+                # last_committed, so we never revisit already-committed offsets.
                 highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
-                    else:
+
+                # bisect_right returns the insertion point after last_committed,
+                # giving us the index of the first candidate offset — O(log k).
+                start_idx = all_offsets.bisect_right(last_committed)
+
+                # Walk the sorted offsets sequentially. We stop as soon as we
+                # find a gap (next offset != previous + 1) or an outstanding
+                # (not-yet-completed) offset. This is O(c) where c is the
+                # number of contiguous completed offsets, NOT O(max - min).
+                expected = last_committed + 1
+                for offset in all_offsets.islice(start_idx):
+                    if offset != expected:
+                        # Gap in offsets — cannot commit past here.
+                        break
+                    if offset in outstanding:
+                        # This offset is still being processed — stop.
                         break
+                    highest_committable = offset
+                    expected += 1
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable
 
@@ -101,9 +120,11 @@ class OffsetTracker:
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
-            # Remove all offsets <= committed offset
-            self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+            # Remove all offsets <= committed offset from the SortedList.
+            # irange gives us only the values to remove — O(c * log k).
+            to_remove = list(self.all_offsets[partition].irange(None, offset))
+            for o in to_remove:
+                self.all_offsets[partition].remove(o)

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (27 lines, 1 file) — review recommended)

The contiguous offset scan for offset in range(min_offset, max_offset + 1) is O(gap_size) and runs inside a per-partition lock on every get_committable_offsets call.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -67,22 +67,23 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
+        Uses a sorted walk over tracked offsets (O(k log k)) rather than a range scan
+        over the full integer gap (O(gap_size)), avoiding lock contention on large gaps.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
                 all_offsets = self.all_offsets[partition]
                 if not all_offsets:
                     continue
 
                 outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
-
+                # Sort once; walk until we hit a gap or an outstanding offset.
+                # This is O(k log k) where k = len(all_offsets), which is bounded
+                # by the in-flight window — not the numeric gap between offsets.
+                sorted_offsets = sorted(all_offsets)
                 highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
+
+                prev = None
+                for offset in sorted_offsets:
+                    # Skip offsets already committed
+                    if offset <= last_committed:
+                        continue
+                    # Stop if there is a numeric gap (offset was never added but
+                    # lies between last_committed and this offset), or if the
+                    # offset is still outstanding.
+                    expected = (prev + 1) if prev is not None else (last_committed + 1)
+                    if offset != expected or offset in outstanding:
                     else:
                         break
+                        highest_committable = offset
+                    prev = offset
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable

🤖 Grapple PR auto-fix • minor • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Offset Commit Correctness (confidence: 100%)

The get_committable_offsets method iterates from min_offset to max_offset looking for the highest contiguous completed offset. However, after calling mark_committed, the committed offsets are removed from all_offsets via cleanup. This means after the first batch is committed, the next call to get_committable_offsets starts from the new min(all_offsets). If offset 100 was committed and removed, and offset 101 is still outstanding but 102 is complete, the method would start scanning from 101, find it outstanding, and correctly return nothing. This logic appears correct on closer inspection, BUT the mark_committed method's cleanup of all_offsets removes offsets <= committed value. If offsets are not perfectly contiguous (e.g., filtered messages skip offsets), the scan from min_offset to max_offset will encounter gaps and treat missing offsets as completed (since they're not in outstanding), potentially committing past unprocessed messages.

Evidence:

  • The scan loop for offset in range(min_offset, max_offset + 1) checks if offset in outstanding but does NOT check if offset in all_offsets
  • If a Kafka offset exists that was never added to all_offsets (e.g., filtered messages, rebalancing), the scan would skip over it as if it were completed
  • FilteredPayload messages may not go through add_offset but their offsets still need to be tracked for correct commit behavior

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / get_committable_offsets Iterates Stale all_offsets Set (confidence: 100%)

The get_committable_offsets method computes committable offsets by iterating over all_offsets[partition] from min_offset to max_offset. However, all_offsets is never pruned — it accumulates every offset ever seen. Over time, min(all_offsets) will always return a very old, already-committed offset. The logic checks if offset <= last_committed: continue but then the min_offset scan starts from an ancient baseline, causing O(N) iteration over all historical offsets on every call. This is a correctness issue: after mark_committed is called, all_offsets still retains those committed entries, so the min_offset computed from all_offsets is a long-past offset, and the loop scan from min_offset will always include stale offsets that are neither in outstanding nor above last_committed, causing the logic to never find gaps correctly as the consumer runs longer.

Evidence:

  • add_offset adds to all_offsets but mark_committed does not remove from all_offsets
  • get_committable_offsets computes min_offset = min(all_offsets[partition]) — after thousands of messages, this min is an ancient committed offset
  • The loop for offset in range(min_offset, max_offset + 1) will iterate over all historical offsets, and already-committed ones satisfy neither outstanding nor the gap condition, leading to incorrect early termination of the contiguous range

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Offset Tracking Cleanup on Partition Revocation (confidence: 100%)

OffsetTracker has no method to remove/reset state for a partition when Kafka partitions are revoked during a rebalance. If a partition is revoked and later reassigned, stale offset data in all_offsets, outstanding, and last_committed will corrupt the commit logic. The SimpleQueueProcessingStrategy has no on_partitions_revoked handler, and arroyo's strategy interface doesn't directly support it, but the factory's create_with_partitions is called on rebalance. Since the factory creates a new strategy on rebalance but the FixedQueuePool (and its OffsetTracker) lives on the factory and persists across rebalances, stale partition state will accumulate.

Evidence:

  • result_consumer.py line ~130: queue_pool is created once in init and persists
  • result_consumer.py line ~200+: create_with_partitions creates new strategies but reuses self.queue_pool
  • queue_consumer.py OffsetTracker has no reset/remove_partition method
  • On rebalance, old partition data remains in OffsetTracker causing incorrect offset calculations

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (29 lines, 1 file) — review recommended)

The get_committable_offsets method iterates from min_offset to max_offset looking for the highest contiguous completed offset. However, after calling mark_committed, the committed offsets are removed from all_offsets via cleanup. This means after the first batch is committed, the next call to get_committable_offsets starts from the new min(all_offsets). If offset 100 was committed and removed, and offset 101 is still outstanding but 102 is complete, the method would start scanning from 101, find it outstanding, and correctly return nothing. This logic appears correct on closer inspection, BUT the mark_committed method's cleanup of all_offsets removes offsets <= committed value. If offsets are not perfectly contiguous (e.g., filtered messages skip offsets), the scan from min_offset to max_offset will encounter gaps and treat missing offsets as completed (since they're not in outstanding), potentially committing past unprocessed messages.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -67,6 +67,15 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
+
+        Note: Every Kafka offset that this consumer receives must be registered via
+        add_offset() (and subsequently complete_offset()) — including filtered/skipped
+        messages — so that the contiguity scan below has a complete picture. If an
+        offset is never added to all_offsets, a gap between last_committed and
+        min(all_offsets) will block commits for that partition, which is the safe
+        behavior. See submit() in SimpleQueueProcessingStrategy for how FilteredPayload
+        and decoder-None results are handled.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
@@ -83,8 +92,21 @@ class OffsetTracker:
                 min_offset = min(all_offsets)
                 max_offset = max(all_offsets)
 
+                # If there is a gap between the last committed offset and the lowest
+                # tracked offset, it means one or more Kafka offsets were never
+                # registered with add_offset(). We cannot safely commit past
+                # last_committed because we have no record of those intermediate
+                # offsets being processed.
+                #
+                # Example: last_committed=100, all_offsets={102} means offset 101
+                # was never tracked. We must not advance the commit position until
+                # offset 101 is accounted for (or we know it cannot appear).
+                if min_offset > last_committed + 1:
+                    # There is an untracked gap — cannot commit anything new.
+                    continue
+
                 start = max(last_committed + 1, min_offset)
 
                 highest_committable = last_committed
                 for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
+                    if offset in all_offsets:
+                        if offset in outstanding:
+                            # This offset is still being processed — stop here.
+                            break
                         highest_committable = offset
                     else:
+                        # This offset is not in all_offsets — it is a gap between
+                        # two tracked offsets (e.g., offset was never registered).
+                        # This is unsafe to skip; stop scanning.
                         break
 
                 if highest_committable > last_committed:

🤖 Grapple PR auto-fix • critical • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Larger fix (44 lines, 1 file) — review recommended)

The get_committable_offsets method computes committable offsets by iterating over all_offsets[partition] from min_offset to max_offset. However, all_offsets is never pruned — it accumulates every offset ever seen. Over time, min(all_offsets) will always return a very old, already-committed offset. The logic checks if offset <= last_committed: continue but then the min_offset scan starts from an ancient baseline, causing O(N) iteration over all historical offsets on every call. This is a correctness issue: after mark_committed is called, all_offsets still retains those committed entries, so the min_offset computed from all_offsets is a long-past offset, and the loop scan from min_offset will always include stale offsets that are neither in outstanding nor above last_committed, causing the logic to never find gaps correctly as the consumer runs longer.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -62,7 +62,7 @@ class OffsetTracker:
     def complete_offset(self, partition: Partition, offset: int) -> None:
         """Mark an offset as completed."""
         with self._get_partition_lock(partition):
             self.outstanding[partition].discard(offset)
+            self.all_offsets[partition].discard(offset)
 
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
-        For each partition, finds the highest contiguous offset that has been processed.
+        For each partition, finds the highest contiguous completed offset that has been processed.
+        An offset is "completed" when it has been removed from both outstanding and all_offsets
+        (via complete_offset). We track the committed frontier via last_committed and only
+        advance it over a contiguous run of completed offsets above that frontier.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
-                all_offsets = self.all_offsets[partition]
-                if not all_offsets:
+                outstanding = self.outstanding[partition]
+                if not outstanding:
+                    # No in-flight offsets; nothing new to commit
                     continue
 
-                outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
+                # The next offset to commit is just past what we last committed.
+                # The frontier of outstanding tells us how far we can scan.
+                start = last_committed + 1
+                # We can commit up to (but not through) the smallest in-flight offset.
+                # Everything below min(outstanding) has been completed.
+                committable_up_to = min(outstanding) - 1
 
                 highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
-                    else:
-                        break
+                # Every offset in [start, committable_up_to] has been completed
+                # (no longer outstanding, no longer in all_offsets) so we can
+                # advance the committed frontier to committable_up_to.
+                if committable_up_to >= start:
+                    highest_committable = committable_up_to
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable
 
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
-            # Remove all offsets <= committed offset
-            self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+            # Prune any stale entries <= committed offset from both tracking sets.
+            # Under normal operation these should already be empty (complete_offset
+            # removes from all_offsets; outstanding entries are in-flight and should
+            # not be below the committed frontier), but we prune defensively.
+            self.all_offsets[partition] = {
+                o for o in self.all_offsets[partition] if o > offset
+            }
+            self.outstanding[partition] = {
+                o for o in self.outstanding[partition] if o > offset
+            }

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (44 lines, 1 file) — review recommended)

OffsetTracker has no method to remove/reset state for a partition when Kafka partitions are revoked during a rebalance. If a partition is revoked and later reassigned, stale offset data in all_offsets, outstanding, and last_committed will corrupt the commit logic. The SimpleQueueProcessingStrategy has no on_partitions_revoked handler, and arroyo's strategy interface doesn't directly support it, but the factory's create_with_partitions is called on rebalance. Since the factory creates a new strategy on rebalance but the FixedQueuePool (and its OffsetTracker) lives on the factory and persists across rebalances, stale partition state will accumulate.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -100,6 +100,30 @@ class OffsetTracker:
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
             # Remove all offsets <= committed offset
             self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+
+    def remove_partition(self, partition: Partition) -> None:
+        """
+        Remove all tracking state for a partition.
+
+        Must be called when a partition is revoked during a Kafka rebalance.
+        Since OffsetTracker persists on FixedQueuePool (which lives on the factory
+        across rebalances), failing to clean up here will cause stale offsets from
+        a previous assignment to corrupt get_committable_offsets() and mark_committed()
+        on reassignment.
+
+        Acquires the partition lock before removal to guard against concurrent
+        worker threads that may still be calling complete_offset() for in-flight
+        work items on this partition.
+        """
+        lock = self.partition_locks.get(partition)
+        if lock is None:
+            # Partition was never tracked; nothing to clean up.
+            return
+        with lock:
+            self.all_offsets.pop(partition, None)
+            self.outstanding.pop(partition, None)
+            self.last_committed.pop(partition, None)
+        # Remove the lock entry last, outside the lock itself, so any thread
+        # that acquired it just before us will finish its critical section
+        # before the entry disappears.
+        self.partition_locks.pop(partition, None)
 
 
 class OrderedQueueWorker(threading.Thread, Generic[T]):
@@ -196,6 +220,19 @@ class FixedQueuePool(Generic[T]):
     def wait_until_empty(self, timeout: float = 5.0) -> bool:
         """Wait until all queues are empty. Returns True if successful, False if timeout."""
         start_time = time.time()
         while time.time() - start_time < timeout:
             if self.get_stats()["total_items"] == 0:
                 return True
             time.sleep(0.01)
         return False
 
+    def on_partitions_revoked(self, revoked: list[Partition]) -> None:
+        """
+        Clean up OffsetTracker state for revoked partitions.
+
+        Called by the consumer factory's create_with_partitions on rebalance.
+        Because FixedQueuePool (and its OffsetTracker) is shared across strategy
+        instances, we must explicitly clear per-partition state here; otherwise
+        stale all_offsets / outstanding / last_committed entries from the previous
+        assignment will corrupt commit logic when the partition is reassigned.
+        """
+        for partition in revoked:
+            logger.info(
+                "Clearing OffsetTracker state for revoked partition",
+                extra={"partition": partition},
+            )
+            self.offset_tracker.remove_partition(partition)
+
     def shutdown(self) -> None:
         """Gracefully shutdown all workers."""
         for worker in self.workers:

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Offset Commit Correctness (confidence: 100%)

The get_committable_offsets method iterates over all_offsets to find the highest contiguous completed offset, but the logic has a subtle correctness issue: it iterates from min_offset to max_offset and finds the contiguous completed range. However, after mark_committed cleans up offsets from all_offsets, there is no synchronization between the cleanup in mark_committed and a concurrent call to get_committable_offsets. If a worker thread completes an offset while get_committable_offsets is iterating, the all_offsets and outstanding sets could be in an inconsistent state within the lock scope, since get_committable_offsets acquires and releases the per-partition lock, but mark_committed also modifies all_offsets under the same lock. The real issue is that list(self.all_offsets.keys()) is taken without any lock, so a partition could be added or removed concurrently.

Evidence:

  • The algorithm walks from min_offset to max_offset checking if offset in outstanding: break and if offset not in all_offsets: break
  • Kafka offsets can have gaps due to compaction, transactions, or filtered messages
  • The break on offset not in all_offsets means any gap in tracked offsets stops contiguity scanning
  • Line 82: for partition in list(self.all_offsets.keys()) iterates without holding any lock
  • A worker thread could call complete_offset on a new partition, modifying self.all_offsets and self.partition_locks concurrently
  • mark_committed removes offsets from all_offsets, which could interact with concurrent get_committable_offsets calls

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (84 lines, 1 file) — review recommended)

The get_committable_offsets method iterates over all_offsets to find the highest contiguous completed offset, but the logic has a subtle correctness issue: it iterates from min_offset to max_offset and finds the contiguous completed range. However, after mark_committed cleans up offsets from all_offsets, there is no synchronization between the cleanup in mark_committed and a concurrent call to get_committable_offsets. If a worker thread completes an offset while get_committable_offsets is iterating, the all_offsets and outstanding sets could be in an inconsistent state within the lock scope, since get_committable_offsets acquires and releases the per-partition lock, but mark_committed also modifies all_offsets under the same lock. The real issue is that list(self.all_offsets.keys()) is taken without any lock, so a partition could be added or removed concurrently.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -38,6 +38,7 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        self._global_lock = threading.RLock()
 
@@ -45,7 +46,10 @@ class OffsetTracker:
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
         lock = self.partition_locks.get(partition)
         if lock:
             return lock
         return self.partition_locks.setdefault(partition, threading.Lock())
+        # NOTE: callers must hold self._global_lock before calling this method
+        # to avoid a race between the .get() check and .setdefault() on
+        # self.partition_locks from concurrent threads introducing a new partition.
 
@@ -53,10 +57,14 @@ class OffsetTracker:
     def add_offset(self, partition: Partition, offset: int) -> None:
         """Record that we've started processing an offset."""
-        with self._get_partition_lock(partition):
+        with self._global_lock:
+         lock = self._get_partition_lock(partition)
+        with lock:
             self.all_offsets[partition].add(offset)
             self.outstanding[partition].add(offset)
 
@@ -63,8 +71,11 @@ class OffsetTracker:
     def complete_offset(self, partition: Partition, offset: int) -> None:
         """Mark an offset as completed."""
-        with self._get_partition_lock(partition):
+        with self._global_lock:
+            lock = self._get_partition_lock(partition)
+        with lock:
             self.outstanding[partition].discard(offset)
 
@@ -73,26 +84,44 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
+        Contiguity is evaluated over the set of *tracked* offsets only — gaps from
+        Kafka compaction, transactions, or filtered messages (offsets never added via
+        add_offset) are intentionally ignored.
         """
         committable = {}
-        for partition in list(self.all_offsets.keys()):
-            with self._get_partition_lock(partition):
-                all_offsets = self.all_offsets[partition]
-                if not all_offsets:
-                    continue
-
-                outstanding = self.outstanding[partition]
-                last_committed = self.last_committed.get(partition, -1)
-
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
-
-                highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
-                    else:
-                        break
-
-                if highest_committable > last_committed:
-                    committable[partition] = highest_committable
+        # Take a snapshot of partitions under the global lock so that concurrent
+        # add_offset / complete_offset calls on new partitions cannot mutate
+        # self.all_offsets or self.partition_locks while we iterate.
+        with self._global_lock:
+            partitions = list(self.all_offsets.keys())
 
+        for partition in partitions:
+            with self._get_partition_lock(partition):
+                all_offsets = self.all_offsets[partition]
+                if not all_offsets:
+                    continue
+
+                outstanding = self.outstanding[partition]
+                last_committed = self.last_committed.get(partition, -1)
+
+                # Iterate over sorted tracked offsets rather than a dense integer
+                # range.  A dense range would break on any gap (compacted /
+                # transactional / filtered offset) that was never registered with
+                # add_offset, producing an artificially low committable offset.
+                sorted_offsets = sorted(all_offsets)
+
+                highest_committable = last_committed
+                prev = None
+                for offset in sorted_offsets:
+                    if offset <= last_committed:
+                        # Already committed; skip but update prev so the next
+                        # offset's contiguity check is against the right base.
+                        prev = offset
+                        continue
+                    # Break contiguity only when there is a gap between two
+                    # *consecutive tracked* offsets (i.e. both were registered
+                    # via add_offset). Untracked gaps are invisible here by design.
+                    if prev is not None and offset != prev + 1:
+                        break
+                    if offset in outstanding:
+                        # This offset is still being processed — cannot advance.
+                        break
+                    highest_committable = offset
+                    prev = offset
+
+                if highest_committable > last_committed:
+                    committable[partition] = highest_committable
+
         return committable
 
@@ -100,7 +129,9 @@ class OffsetTracker:
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
-        with self._get_partition_lock(partition):
+        with self._global_lock:
+            lock = self._get_partition_lock(partition)
+        with lock:
             self.last_committed[partition] = offset
             # Remove all offsets <= committed offset
             self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}

🤖 Grapple PR auto-fix • critical • Review this diff before applying

"""Update the last committed offset for a partition."""
with self._get_partition_lock(partition):
self.last_committed[partition] = offset
# Remove all offsets <= committed offset
self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Thread Safety - Missing Partition Lock in mark_committed (confidence: 100%)

The mark_committed method modifies self.all_offsets[partition] and self.last_committed[partition] without acquiring the partition lock, while add_offset, complete_offset, and get_committable_offsets all acquire the lock. This creates a data race when a worker thread calls complete_offset (which holds the lock and modifies outstanding) concurrently with the main thread calling mark_committed (which modifies all_offsets without the lock). Set mutation during concurrent iteration/mutation is not thread-safe in CPython beyond the GIL's basic guarantees.

Evidence:

  • Lines 99-107: mark_committed modifies self.all_offsets[partition] and self.last_committed[partition] without locking
  • Lines 62-65: add_offset acquires lock before modifying the same structures
  • Lines 67-70: complete_offset acquires lock before modifying outstanding
  • Lines 75-97: get_committable_offsets acquires lock

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (1 lines, 1 file))

The mark_committed method modifies self.all_offsets[partition] and self.last_committed[partition] without acquiring the partition lock, while add_offset, complete_offset, and get_committable_offsets all acquire the lock. This creates a data race when a worker thread calls complete_offset (which holds the lock and modifies outstanding) concurrently with the main thread calling mark_committed (which modifies all_offsets without the lock). Set mutation during concurrent iteration/mutation is not thread-safe in CPython beyond the GIL's basic guarantees.

Suggested change
# NOTE: Lock is held here — safe for concurrent access with add_offset/complete_offset/get_committable_offsets

🤖 Grapple PR auto-fix • major • confidence: 100%

class OrderedQueueWorker(threading.Thread, Generic[T]):
"""Worker thread that processes items from a queue in order."""

def __init__(
self,
worker_id: int,
work_queue: queue.Queue[WorkItem[T]],
result_processor: Callable[[str, T], None],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Algorithmic complexity (confidence: 100%)

get_committable_offsets() acquires each partition lock individually in sequence, which means the set of partitions seen between the outer list() call and individual lock acquisitions can be inconsistent. More critically, the contiguous-offset scan is O(n) per partition per call to get_committable_offsets(), and this method is called from the main thread's submit() or flush() path while worker threads are concurrently modifying outstanding. Holding the lock for a linear scan blocks workers from completing offsets.

Evidence:

  • The inner loop iterates from min_offset to max_offset+1 checking each offset — this is O(range_size) not O(outstanding_count)
  • For a partition with sparse offsets or a large gap, this scan could be very slow while holding the partition lock
  • Worker threads trying to call complete_offset() on the same partition are blocked for the duration of the scan

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Unnecessary work (confidence: 97%)

get_committable_offsets() acquires locks and performs set operations on every call from the main thread. If the main consumer's flush() calls this frequently (e.g., on every message), the lock contention between the main thread and N worker threads will become a throughput bottleneck at high parallelism.

Evidence:

  • The main thread calls get_committable_offsets() to determine what to pass to the commit callback
  • Each call acquires O(num_partitions) locks sequentially
  • Worker threads are blocked from completing offsets while the main thread holds partition locks for scanning
  • With 20 queues and high message throughput, this creates a serialization point

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (38 lines, 1 file) — review recommended)

get_committable_offsets() acquires each partition lock individually in sequence, which means the set of partitions seen between the outer list() call and individual lock acquisitions can be inconsistent. More critically, the contiguous-offset scan is O(n) per partition per call to get_committable_offsets(), and this method is called from the main thread's submit() or flush() path while worker threads are concurrently modifying outstanding. Holding the lock for a linear scan blocks workers from completing offsets.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -68,25 +68,28 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
         For each partition, finds the highest contiguous offset that has been processed.
+        Uses a sorted scan over only the tracked offsets (O(k log k) where k is the
+        number of tracked offsets) rather than iterating over the full offset range
+        (O(range_size)), so sparse partitions or large gaps don't stall worker threads.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
-                all_offsets = self.all_offsets[partition]
-                if not all_offsets:
+                # Take a consistent snapshot under the lock so the scan
+                # sees a stable view even as workers call complete_offset().
+                all_offsets_snapshot = set(self.all_offsets[partition])
+                if not all_offsets_snapshot:
                     continue
 
-                outstanding = self.outstanding[partition]
+                outstanding_snapshot = set(self.outstanding[partition])
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
-
+                # Sort only the offsets we actually track — O(k log k) where k is
+                # the number of outstanding/tracked offsets, not the range size.
+                # This prevents long lock holds on partitions with sparse offsets.
                 highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
-                    else:
+                prev = None
+                for offset in sorted(all_offsets_snapshot):
+                    if offset <= last_committed:
+                        # Already committed; skip but don't break the contiguity check.
+                        prev = offset
+                        continue
+                    # Check for a gap: if this offset is not contiguous with the
+                    # previous committable offset (or last_committed), stop.
+                    expected = (prev if prev is not None else last_committed) + 1
+                    if offset != expected:
+                        break
+                    if offset in outstanding_snapshot:
+                        # This offset is not yet done; contiguous run ends here.
                         break
+                    highest_committable = offset
+                    prev = offset
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Algorithmic complexity (confidence: 100%)

get_committable_offsets() iterates over all tracked offsets for each partition to find the highest contiguous committed offset. Because all_offsets and outstanding are plain sets (not sorted structures), min() and max() are O(n) each call, and the inner contiguous-scan loop is also O(n). This method is called on every commit cycle, and since offsets are never evicted from all_offsets until mark_committed is called, the sets grow unboundedly over the consumer's lifetime.

Evidence:

  • all_offsets[partition] is a set that accumulates every offset ever seen; there is no eviction except via mark_committed
  • The inner loop in get_committable_offsets scans from min_offset to max_offset one-by-one (range iteration) to find the contiguous frontier — O(n) per partition per commit cycle
  • min() and max() on a Python set are O(n) operations
  • For a long-running consumer, all_offsets will contain every offset since last committed, making each commit cycle progressively more expensive

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (49 lines, 1 file) — review recommended)

get_committable_offsets() iterates over all tracked offsets for each partition to find the highest contiguous committed offset. Because all_offsets and outstanding are plain sets (not sorted structures), min() and max() are O(n) each call, and the inner contiguous-scan loop is also O(n). This method is called on every commit cycle, and since offsets are never evicted from all_offsets until mark_committed is called, the sets grow unboundedly over the consumer's lifetime.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -1,7 +1,8 @@
 from __future__ import annotations
 
 import logging
 import queue
 import threading
 import time
 from collections import defaultdict
+from sortedcontainers import SortedList
 from collections.abc import Callable
 from dataclasses import dataclass
 from typing import Any, Generic, TypeVar
@@ -38,8 +39,13 @@ class OffsetTracker:
     def __init__(self) -> None:
-        self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
+        # SortedList gives O(log n) add/remove and O(1) min/max via index,
+        # replacing the plain set whose min()/max() were O(n) each call.
+        self.all_offsets: dict[Partition, SortedList] = defaultdict(SortedList)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        # next_expected[partition] is the lowest offset we still need to scan
+        # from when computing the contiguous frontier. Avoids re-scanning from
+        # min(all_offsets) on every commit cycle.
+        self.next_expected: dict[Partition, int] = {}
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
@@ -52,6 +58,10 @@ class OffsetTracker:
     def add_offset(self, partition: Partition, offset: int) -> None:
         """Record that we've started processing an offset."""
         with self._get_partition_lock(partition):
             self.all_offsets[partition].add(offset)
             self.outstanding[partition].add(offset)
+            # Initialise the frontier pointer the first time we see an offset
+            # for this partition.
+            if partition not in self.next_expected:
+                self.next_expected[partition] = offset
 
     def complete_offset(self, partition: Partition, offset: int) -> None:
         """Mark an offset as completed."""
@@ -67,28 +77,37 @@ class OffsetTracker:
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
                 all_offsets = self.all_offsets[partition]
                 if not all_offsets:
                     continue
 
                 outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
+                # O(1): SortedList index access instead of O(n) min()/max().
+                max_offset = all_offsets[-1]
 
+                # Resume from the known frontier rather than re-scanning from
+                # the beginning of all_offsets every cycle.
+                start = max(
+                    last_committed + 1,
+                    self.next_expected.get(partition, all_offsets[0]),
+                )
+
                 highest_committable = last_committed
                 for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
+                    # Membership check is O(log n) for SortedList and O(1) for set.
+                    if offset in all_offsets and offset not in outstanding:
                         highest_committable = offset
                     else:
                         break
 
+                # Advance the frontier pointer so the next commit cycle starts
+                # here instead of at last_committed + 1.
+                if highest_committable > last_committed:
+                    self.next_expected[partition] = highest_committable + 1
+
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable
 
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
-            # Remove all offsets <= committed offset
-            self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+            # irange gives us an O(log n) view of values <= offset; removing
+            # them one-by-one is O(k log n) where k is the number removed —
+            # far better than rebuilding the whole set via comprehension O(n).
+            sl = self.all_offsets[partition]
+            to_remove = list(sl.irange(None, offset))
+            for o in to_remove:
+                sl.remove(o)
+            # Ensure the frontier pointer is always at least last_committed + 1
+            # so we never re-scan already-committed history.
+            self.next_expected[partition] = max(
+                self.next_expected.get(partition, offset + 1),
+                offset + 1,
+            )

🤖 Grapple PR auto-fix • major • Review this diff before applying

identifier: str,
offset_tracker: OffsetTracker,
) -> None:
super().__init__(daemon=True)
self.worker_id = worker_id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error - Incorrect Committable Offset Calculation (confidence: 100%)

The get_committable_offsets algorithm is logically flawed. It uses min(all_offsets) as the starting point, but after some offsets are committed and last_committed is updated, the minimum of all_offsets still includes old (already-committed) offsets unless they're pruned. This means the contiguous scan from min_offset will repeatedly find the same already-committed offsets as committable, potentially re-committing stale offsets or, if combined with the unbounded growth bug, producing wrong commit points.

Evidence:

  • After calling mark_committed(partition, 100), all_offsets still contains {100, 101, 102}. The next call to get_committable_offsets starts scanning from min(all_offsets) = 100 again.
  • The algorithm checks if o not in outstanding starting from min_offset, so it will again find offset 100 as committable and return it, even though it was already committed.
  • The last_committed guard prevents returning an offset already committed, but only if the code properly uses it to skip already-processed offsets in the scan — this requires careful review of the truncated implementation.
  • This creates a subtle correctness issue where the committed offset calculation could be off-by-one or redundantly return already-committed values.

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Larger fix (39 lines, 1 file) — review recommended)

The get_committable_offsets algorithm is logically flawed. It uses min(all_offsets) as the starting point, but after some offsets are committed and last_committed is updated, the minimum of all_offsets still includes old (already-committed) offsets unless they're pruned. This means the contiguous scan from min_offset will repeatedly find the same already-committed offsets as committable, potentially re-committing stale offsets or, if combined with the unbounded growth bug, producing wrong commit points.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -67,28 +67,38 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
         Get the highest offset per partition that can be safely committed.
 
-        For each partition, finds the highest contiguous offset that has been processed.
+        For each partition, finds the highest contiguous offset that has been
+        processed, starting from the lowest tracked offset.
+
+        Correctness invariant: mark_committed() prunes all_offsets to only
+        contain offsets > last_committed, so min(all_offsets) is always the
+        first un-committed offset we know about. We can only advance if that
+        offset is contiguous with last_committed (i.e., min == last_committed + 1)
+        AND it is no longer outstanding. Any gap between last_committed and
+        min(all_offsets) means we cannot safely commit further.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):
             with self._get_partition_lock(partition):
                 all_offsets = self.all_offsets[partition]
                 if not all_offsets:
                     continue
 
                 outstanding = self.outstanding[partition]
                 last_committed = self.last_committed.get(partition, -1)
 
-                min_offset = min(all_offsets)
-                max_offset = max(all_offsets)
-
-                start = max(last_committed + 1, min_offset)
-
+                # Sort so we scan in ascending order and detect gaps immediately.
+                sorted_offsets = sorted(all_offsets)
+                min_offset = sorted_offsets[0]
+
+                # If there is a gap between the last committed offset and the
+                # lowest tracked offset, we cannot advance — some offsets in
+                # that gap were never added to all_offsets (or were from a
+                # prior assignment). Stop here.
+                if min_offset != last_committed + 1:
+                    continue
+
                 highest_committable = last_committed
-                for offset in range(start, max_offset + 1):
-                    if offset in all_offsets and offset not in outstanding:
-                        highest_committable = offset
-                    else:
+                prev = last_committed
+                for offset in sorted_offsets:
+                    if offset != prev + 1:
+                        # Gap in tracked offsets — cannot commit past prev.
                         break
+                    if offset in outstanding:
+                        # This offset is still being processed — stop here.
+                        break
+                    highest_committable = offset
+                    prev = offset
 
                 if highest_committable > last_committed:
                     committable[partition] = highest_committable
 
         return committable

🤖 Grapple PR auto-fix • major • Review this diff before applying

self.work_queue = work_queue
self.result_processor = result_processor
self.identifier = identifier
self.offset_tracker = offset_tracker
self.shutdown = False

def run(self) -> None:
"""Process items from the queue in order."""
while not self.shutdown:
try:
work_item = self.work_queue.get()
except queue.ShutDown:
break

try:
with sentry_sdk.start_transaction(
op="queue_worker.process",
name=f"monitors.{self.identifier}.worker_{self.worker_id}",
):
self.result_processor(self.identifier, work_item.result)

except queue.ShutDown:
break
except Exception:
logger.exception(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 INFO — Documentation (confidence: 75%)

The get_committable_offsets method logic lacks inline documentation explaining the contiguous offset algorithm. The method is critical for preventing message loss but the logic (lines 131-144 are truncated) is not clearly explained.

Evidence:

  • The docstring only says 'For each partition, finds the highest contiguous offset that has been processed' but doesn't explain WHY contiguity matters (to prevent gaps in offset commits)
  • The algorithm behavior when there are gaps (e.g., offsets 100, 101, 103 where 102 is outstanding) is not documented
  • This is a core safety mechanism for Kafka offset management and needs clearer intent documentation

Agent: style

"Unexpected error in queue worker", extra={"worker_id": self.worker_id}
)
finally:
self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
metrics.gauge(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Input Validation (confidence: 100%)

The FixedQueuePool constructor accepts num_queues but does not validate that it is a positive integer. The intent specification explicitly calls out: 'Behavior when queue count is 0 or negative — should fail fast with a clear error.' With num_queues=0, self.queues would be empty, self.workers would be empty, and enqueue would fail with a ZeroDivisionError on hash(item.result) % len(self.queues) (or similar). With negative values, the range() would produce no items.

Evidence:

  • Line 145: num_queues parameter is used without validation
  • Intent spec: 'Behavior when queue count is 0 or negative — should fail fast with a clear error'
  • The default fallback in result_consumer.py is max_workers or 20, but if someone passes --max-workers 0 it would be falsy and use 20, while --max-workers -1 would pass through

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — logic agent (Small fix (5 lines, 1 file))

The FixedQueuePool constructor accepts num_queues but does not validate that it is a positive integer. The intent specification explicitly calls out: 'Behavior when queue count is 0 or negative — should fail fast with a clear error.' With num_queues=0, self.queues would be empty, self.workers would be empty, and enqueue would fail with a ZeroDivisionError on hash(item.result) % len(self.queues) (or similar). With negative values, the range() would produce no items.

Suggested change
metrics.gauge(
if not isinstance(num_queues, int) or num_queues <= 0:
raise ValueError(
f"num_queues must be a positive integer, got {num_queues!r}. "
"Pass a value >= 1 to ensure at least one worker queue is created."
)

🤖 Grapple PR auto-fix • minor • confidence: 100%

"remote_subscriptions.queue_worker.queue_depth",
self.work_queue.qsize(),
tags={
"identifier": self.identifier,
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Missing Partition Revocation Handling (confidence: 100%)

The SimpleQueueProcessingStrategy has no mechanism to handle Kafka partition revocation. When partitions are revoked during a rebalance, the OffsetTracker still holds outstanding offsets for the revoked partitions, and the FixedQueuePool may still be processing work items for those partitions. The create_with_partitions method in result_consumer.py creates a new strategy each time, but the FixedQueuePool is shared across calls (it's on the factory). This means the pool's worker threads may still be processing items from revoked partitions, and completed offsets for revoked partitions could be committed after re-assignment to another consumer, causing duplicate processing or offset corruption.

Evidence:

  • The intent spec explicitly calls out: 'Kafka partition rebalance occurring while messages are in-flight in the queue — ensure uncommitted offsets are not lost and work is not duplicated'
  • result_consumer.py line 195-212: create_with_partitions creates a new SimpleQueueProcessingStrategy but reuses self.queue_pool
  • queue_consumer.py: OffsetTracker is created per-strategy but worker threads in the shared pool may still hold references to old work items
  • There is no on_revoke or partition cleanup mechanism

Agent: logic

)


class FixedQueuePool(Generic[T]):
"""
Fixed pool of queues that guarantees order within groups.

Key properties:
- Each group is consistently assigned to the same queue
- Each queue has exactly one worker thread
- Items within a queue are processed in FIFO order
- No dynamic reassignment that could break ordering
- Tracks offset completion for safe commits
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Worker Exception Handling / Silent Message Loss (confidence: 100%)

In FixedQueuePool._worker(), when self.process_result() raises an exception, the code catches it (line ~163), logs the error and captures to Sentry, but then continues to mark the offset as complete via offset_tracker.complete_offset(). This means failed messages are silently dropped — they won't be retried since the offset is committed. While this may be intentional for poison messages, it effectively means any transient error (database timeout, network blip) causes permanent message loss rather than retry.

Evidence:

  • The worker catches all exceptions and still calls complete_offset, meaning the offset will be committed
  • The intent specification asks: 'What happens when a worker processing a queue raises an unhandled exception — does the consumer halt, retry, or skip the message?'
  • Current behavior: skip the message, which could cause data loss for transient failures

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (28 lines, 1 file) — review recommended)

In FixedQueuePool._worker(), when self.process_result() raises an exception, the code catches it (line ~163), logs the error and captures to Sentry, but then continues to mark the offset as complete via offset_tracker.complete_offset(). This means failed messages are silently dropped — they won't be retried since the offset is committed. While this may be intentional for poison messages, it effectively means any transient error (database timeout, network blip) causes permanent message loss rather than retry.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -128,18 +128,35 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
     def run(self) -> None:
         """Process items from the queue in order."""
         while not self.shutdown:
             try:
                 work_item = self.work_queue.get()
             except queue.ShutDown:
                 break
 
+            processing_failed = False
             try:
                 with sentry_sdk.start_transaction(
                     op="queue_worker.process",
                     name=f"monitors.{self.identifier}.worker_{self.worker_id}",
                 ):
                     self.result_processor(self.identifier, work_item.result)
 
             except queue.ShutDown:
                 break
             except Exception:
+                processing_failed = True
                 logger.exception(
-                    "Unexpected error in queue worker", extra={"worker_id": self.worker_id}
+                    "Unexpected error in queue worker — offset NOT committed, partition will stall",
+                    extra={
+                        "worker_id": self.worker_id,
+                        "partition": work_item.partition,
+                        "offset": work_item.offset,
+                    },
                 )
+                metrics.incr(
+                    "remote_subscriptions.queue_worker.processing_error",
+                    tags={
+                        "identifier": self.identifier,
+                    },
+                )
             finally:
-                self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                # Only mark the offset complete on success. On failure we intentionally
+                # leave the offset in `outstanding` so that `get_committable_offsets()`
+                # cannot advance past this point, preventing silent message loss for
+                # transient errors (DB timeouts, network blips, etc.).
+                #
+                # NOTE: This will stall the partition at this offset until the consumer
+                # is restarted (triggering a re-delivery from the last committed offset).
+                # If poison-pill / permanently-failing messages are a concern, the caller's
+                # `result_processor` should handle its own dead-letter logic and only raise
+                # for genuinely retryable errors.
+                if not processing_failed:
+                    self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
                 metrics.gauge(
                     "remote_subscriptions.queue_worker.queue_depth",
                     self.work_queue.qsize(),
                     tags={
                         "identifier": self.identifier,
                     },
                 )

🤖 Grapple PR auto-fix • major • Review this diff before applying

def __init__(
self,
result_processor: Callable[[str, T], None],
identifier: str,
num_queues: int = 20,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Naming conventions (confidence: 80%)

Inconsistent variable naming in get_committable_offsets(): uses both min_offset/max_offset and min_outstanding_offset within the same method logic without clear distinction of their purposes.

Evidence:

  • Lines 168-169 compute min_offset and max_offset from all_offsets
  • This appears unused in the subsequent loop logic for finding contiguous ranges
  • Variable naming should clarify whether we're looking at outstanding vs all offsets

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — style agent (Small fix (10 lines, 1 file))

Inconsistent variable naming in get_committable_offsets(): uses both min_offset/max_offset and min_outstanding_offset within the same method logic without clear distinction of their purposes.

Suggested change
num_queues: int = 20,
# Bounds of all offsets seen for this partition (not just outstanding ones)
all_offsets_min = min(all_offsets)
all_offsets_max = max(all_offsets)
# Start from whichever is higher: the offset after last commit, or the earliest seen offset
start = max(last_committed + 1, all_offsets_min)
for offset in range(start, all_offsets_max + 1):

🤖 Grapple PR auto-fix • minor • confidence: 80%

) -> None:
self.result_processor = result_processor
self.identifier = identifier
self.num_queues = num_queues
self.offset_tracker = OffsetTracker()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Memory leaks (confidence: 100%)

Worker queues have no maximum depth configured, allowing unbounded memory growth if workers fall behind processing speed.

Evidence:

  • FixedQueuePool creates worker queues — if these are queue.Queue() without a maxsize, they grow without bound.
  • Under load spikes, the main thread can enqueue messages faster than workers consume them, leading to OOM conditions.
  • The PR acceptance criteria explicitly calls out: 'Memory growth if workers fall behind and queues grow unboundedly — is there a maximum queue depth?' — this is unaddressed.
  • Combined with the blocking-put issue, an unbounded queue means back-pressure is never applied, so the system will OOM before blocking.

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — performance agent (Small fix (7 lines, 1 file))

Worker queues have no maximum depth configured, allowing unbounded memory growth if workers fall behind processing speed.

Suggested change
self.offset_tracker = OffsetTracker()
max_queue_depth: int = 1000,
self.max_queue_depth = max_queue_depth
# Bounded queue: blocks the producer (Kafka consumer thread) when full,
# applying back-pressure instead of growing without bound and causing OOM.
# Total max in-flight items = num_queues * max_queue_depth.
work_queue: queue.Queue[WorkItem[T]] = queue.Queue(maxsize=max_queue_depth)

🤖 Grapple PR auto-fix • major • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / FixedQueuePool Worker Receives result_processor Incorrectly (confidence: 100%)

The FixedQueuePool is constructed with a result_processor object instance. However, the result_processor passed to FixedQueuePool is self.result_processor from ResultsStrategyFactory, which is an instance of ResultProcessor. The FixedQueuePool worker threads then call self.result_processor(identifier, item) treating it as a two-argument callable. But looking at ResultProcessor, it is typically called as self.result_processor(message) with a single decoded message argument. The worker thread calling convention likely mismatches the actual ResultProcessor.__call__ signature, causing runtime TypeError on every message processed in thread-queue-parallel mode.

Evidence:

  • In result_consumer.py line ~130: FixedQueuePool(result_processor=self.result_processor, ...)
  • The FixedQueuePool worker calls self.result_processor(identifier, item) with two arguments
  • ResultProcessor subclasses define __call__(self, message) with a single message argument
  • This would cause TypeError: __call__() takes 2 positional arguments but 3 were given on every processed message

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Small fix (6 lines, 1 file))

The FixedQueuePool is constructed with a result_processor object instance. However, the result_processor passed to FixedQueuePool is self.result_processor from ResultsStrategyFactory, which is an instance of ResultProcessor. The FixedQueuePool worker threads then call self.result_processor(identifier, item) treating it as a two-argument callable. But looking at ResultProcessor, it is typically called as self.result_processor(message) with a single decoded message argument. The worker thread calling convention likely mismatches the actual ResultProcessor.__call__ signature, causing runtime TypeError on every message processed in thread-queue-parallel mode.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -100,7 +100,7 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
     def __init__(
         self,
         worker_id: int,
         work_queue: queue.Queue[WorkItem[T]],
-        result_processor: Callable[[str, T], None],
+        result_processor: Callable[[T], None],
         identifier: str,
         offset_tracker: OffsetTracker,
     ) -> None:
@@ -136,7 +136,7 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
                 with sentry_sdk.start_transaction(
                     op="queue_worker.process",
                     name=f"monitors.{self.identifier}.worker_{self.worker_id}",
                 ):
-                    self.result_processor(self.identifier, work_item.result)
+                    self.result_processor(work_item.result)
 
             except queue.ShutDown:
                 break
@@ -171,7 +171,7 @@ class FixedQueuePool(Generic[T]):
 
     def __init__(
         self,
-        result_processor: Callable[[str, T], None],
+        result_processor: Callable[[T], None],
         identifier: str,
         num_queues: int = 20,
     ) -> None:

🤖 Grapple PR auto-fix • major • Review this diff before applying

self.queues: list[queue.Queue[WorkItem[T]]] = []
self.workers: list[OrderedQueueWorker[T]] = []

for i in range(num_queues):
work_queue: queue.Queue[WorkItem[T]] = queue.Queue()
self.queues.append(work_queue)

worker = OrderedQueueWorker[T](
worker_id=i,
work_queue=work_queue,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Error Handling / Silent Message Loss (confidence: 100%)

In the _worker method, when process_func raises an exception (line ~182), the exception is caught and logged, but the offset is still marked as completed via offset_tracker.complete_offset() in the finally block (line ~189). This means a failed message is silently skipped — its offset is committed as if it succeeded. The intent specification asks 'What happens when a worker processing a queue raises an unhandled exception — does the consumer halt, retry, or skip the message?' The answer here is 'skip', which could cause silent data loss.

Evidence:

  • Lines 182-186: except block logs the error but continues
  • Lines 188-189: finally block always calls offset_tracker.complete_offset()
  • This means failed messages are treated as successfully processed for offset commit purposes
  • The intent specification flags this as a risk: 'unhandled exceptions in worker threads may silently drop messages'

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Larger fix (26 lines, 1 file) — review recommended)

In the _worker method, when process_func raises an exception (line ~182), the exception is caught and logged, but the offset is still marked as completed via offset_tracker.complete_offset() in the finally block (line ~189). This means a failed message is silently skipped — its offset is committed as if it succeeded. The intent specification asks 'What happens when a worker processing a queue raises an unhandled exception — does the consumer halt, retry, or skip the message?' The answer here is 'skip', which could cause silent data loss.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -147,13 +147,30 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
             try:
                 with sentry_sdk.start_transaction(
                     op="queue_worker.process",
                     name=f"monitors.{self.identifier}.worker_{self.worker_id}",
                 ):
                     self.result_processor(self.identifier, work_item.result)
 
             except queue.ShutDown:
                 break
             except Exception:
+                # NOTE: We intentionally still complete the offset below in the
+                # finally block even on failure. The OffsetTracker requires
+                # contiguous offset completion — leaving this offset incomplete
+                # would permanently stall commits for all later offsets on this
+                # partition. There is no retry infrastructure in this consumer.
+                #
+                # Failures here mean the message is skipped. This is an explicit
+                # architectural decision: observability (Sentry + metric) is used
+                # to detect and alert on drops rather than blocking the pipeline.
                 logger.exception(
-                    "Unexpected error in queue worker", extra={"worker_id": self.worker_id}
+                    "Unexpected error in queue worker — message will be skipped",
+                    extra={
+                        "worker_id": self.worker_id,
+                        "partition": work_item.partition,
+                        "offset": work_item.offset,
+                        "identifier": self.identifier,
+                    },
+                )
+                sentry_sdk.capture_exception()
+                metrics.incr(
+                    "remote_subscriptions.queue_worker.message_dropped",
+                    tags={
+                        "identifier": self.identifier,
+                        "reason": "processing_exception",
+                    },
                 )
             finally:
+                # Always complete the offset — see note in except block above.
                 self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
                 metrics.gauge(
                     "remote_subscriptions.queue_worker.queue_depth",

🤖 Grapple PR auto-fix • major • Review this diff before applying

result_processor=result_processor,
identifier=identifier,
offset_tracker=self.offset_tracker,
)
worker.start()
self.workers.append(worker)

def get_queue_for_group(self, group_key: str) -> int:
"""
Get queue index for a group using consistent hashing.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Documentation (confidence: 95%)

The OffsetTracker.get_committable_offsets() method lacks clear documentation of its offset calculation logic, especially regarding why it only commits contiguous offsets and how it handles gaps.

Evidence:

  • The method comment says 'finds the highest contiguous offset that has been processed' but the implementation logic (lines 148-170) is not immediately obvious
  • Critical for correctness: offset commit bugs here could cause message loss on restart, per the risk assessment
  • No docstring explaining why contiguous offsets matter (Kafka offset commit semantics)

Agent: style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — style agent (Larger fix (39 lines, 1 file) — review recommended)

The OffsetTracker.get_committable_offsets() method lacks clear documentation of its offset calculation logic, especially regarding why it only commits contiguous offsets and how it handles gaps.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -93,10 +93,43 @@ class OffsetTracker:
     def get_committable_offsets(self) -> dict[Partition, int]:
         """
-        Get the highest offset per partition that can be safely committed.
-
-        For each partition, finds the highest contiguous offset that has been processed.
+        Get the highest offset per partition that can be safely committed to Kafka.
+
+        Kafka offset commit semantics require committing contiguous offsets only.
+        When a consumer commits offset N, it signals to Kafka: "all messages up to
+        and including N have been processed; resume from N+1 on restart." This means
+        committing a non-contiguous offset (e.g., committing offset 5 when offset 3
+        is still outstanding) would cause messages at offset 3 to be silently skipped
+        after a consumer restart — i.e., permanent message loss.
+
+        Algorithm:
+            For each partition, scan forward from the last committed offset + 1 through
+            the highest known offset. Advance the "highest committable" cursor as long as
+            each successive offset is both known (was submitted for processing) and
+            completed (processing has finished). Stop immediately at the first gap —
+            either an offset that hasn't been submitted yet, or one that is still
+            outstanding (in-flight in a worker queue).
+
+        Example:
+            Offsets submitted:  {3, 4, 5, 6, 7}
+            Offsets outstanding (in-flight): {5, 6}
+            Last committed: 2
+
+            Scan from 3:
+              offset 3 -> completed -> highest_committable = 3
+              offset 4 -> completed -> highest_committable = 4
+              offset 5 -> outstanding -> STOP
+
+            Returns: {partition: 4}
+
+            Offsets 5, 6, and 7 are intentionally withheld even though 7 may already
+            be complete, because committing 7 would skip 5 and 6 on restart.
+
+        Returns:
+            A dict mapping each partition to the highest offset that is safe to commit.
+            Partitions with no newly committable progress (i.e., the lowest unprocessed
+            offset is still outstanding) are omitted from the result.
         """
         committable = {}
         for partition in list(self.all_offsets.keys()):

🤖 Grapple PR auto-fix • minor • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Concurrency issues (confidence: 95%)

FixedQueuePool worker threads catch exceptions and log them but continue running without any mechanism to propagate errors to the main thread or trigger backpressure. If a worker thread's result_processor consistently raises exceptions, messages are silently dropped — the offset is marked complete and the work is lost.

Evidence:

  • Worker loop: except Exception: logger.exception(...); self.offset_tracker.complete_offset(...) — offset is completed even on failure
  • The main consumer thread has no way to know a worker failed; it will continue submitting new messages and committing offsets
  • This violates the acceptance criteria: 'No messages are silently dropped on worker thread exceptions'
  • A persistent DB connection error, for example, would cause all uptime check results to be silently discarded while the consumer appears healthy

Agent: performance

"""
return hash(group_key) % self.num_queues

def submit(self, group_key: str, work_item: WorkItem[T]) -> None:
"""
Submit a work item to the appropriate queue.
"""
queue_index = self.get_queue_for_group(group_key)
work_queue = self.queues[queue_index]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Concurrency issues (confidence: 100%)

Worker threads in FixedQueuePool that raise unhandled exceptions silently die, causing messages queued to that worker to be permanently stuck (queue never drained), which will eventually block the main Kafka poll thread.

Evidence:

  • The worker loop wraps work_item processing in a try/except, but if an exception occurs in the queue.get() or item routing logic itself, the thread exits.
  • If a worker thread dies, its queue fills up, queue.put(work_item, block=True) in the main thread blocks indefinitely, stalling Kafka polling and triggering consumer group rebalance.
  • There is no thread health monitoring — dead workers are not detected or restarted.
  • The shutdown() method calls worker_queue.join() which will hang if the worker thread died before processing all items (since task_done() would never be called for queued items).

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Thread Safety (confidence: 72%)

The SimpleQueueProcessingStrategy accesses self._closed and self._shutdown flags from potentially different threads without synchronization. While in practice these are likely only set from the main thread and read from the main thread, the _shutdown flag on FixedQueuePool is set from the main thread but read from worker threads without a memory barrier (no threading.Event or lock). In CPython the GIL makes boolean reads/writes atomic, but this is an implementation detail.

Evidence:

  • FixedQueuePool._shutdown is a plain bool accessed from main thread (setter) and worker threads (reader in _worker loop)
  • No threading.Event or volatile equivalent is used
  • This works in CPython due to GIL but is technically a data race

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (110 lines, 1 file) — review recommended)

Worker threads in FixedQueuePool that raise unhandled exceptions silently die, causing messages queued to that worker to be permanently stuck (queue never drained), which will eventually block the main Kafka poll thread.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -108,6 +108,7 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
         self.identifier = identifier
         self.offset_tracker = offset_tracker
         self.shutdown = False
+        self._alive = True
 
     def run(self) -> None:
         """Process items from the queue in order."""
-        while not self.shutdown:
-            try:
-                work_item = self.work_queue.get()
-            except queue.ShutDown:
-                break
-
+        work_item = None
+        while not self.shutdown:
             try:
+                try:
+                    work_item = self.work_queue.get()
+                except queue.ShutDown:
+                    break
+
                 with sentry_sdk.start_transaction(
                     op="queue_worker.process",
                     name=f"monitors.{self.identifier}.worker_{self.worker_id}",
                 ):
                     self.result_processor(self.identifier, work_item.result)
 
-            except queue.ShutDown:
-                break
             except Exception:
                 logger.exception(
                     "Unexpected error in queue worker", extra={"worker_id": self.worker_id}
                 )
             finally:
-                self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                if work_item is not None:
+                    self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                    work_item.done_event.set() if hasattr(work_item, "done_event") else None
+                work_item = None
                 metrics.gauge(
                     "remote_subscriptions.queue_worker.queue_depth",
                     self.work_queue.qsize(),
                     tags={
                         "identifier": self.identifier,
                     },
                 )
+        self._alive = False
+        # Drain remaining items so callers blocked on queue.put() are never stuck.
+        # complete_offset is called for each drained item so shutdown() doesn't hang.
+        self._drain_on_exit()
+
+    def _drain_on_exit(self) -> None:
+        """
+        Drain all remaining items from the queue when the worker exits unexpectedly.
+        This prevents the main Kafka poll thread from blocking on queue.put() and
+        ensures offset_tracker.complete_offset() is called for every queued item
+        so that shutdown() / queue.join() do not hang.
+        """
+        while True:
+            try:
+                work_item = self.work_queue.get_nowait()
+            except (queue.Empty, queue.ShutDown):
+                break
+            except Exception:
+                logger.exception(
+                    "Unexpected error draining queue on worker exit",
+                    extra={"worker_id": self.worker_id},
+                )
+                break
+            try:
+                self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+            except Exception:
+                logger.exception(
+                    "Error completing offset during drain",
+                    extra={"worker_id": self.worker_id},
+                )
 
 
 class FixedQueuePool(Generic[T]):
@@ -180,10 +221,38 @@ class FixedQueuePool(Generic[T]):
     def submit(self, group_key: str, work_item: WorkItem[T]) -> None:
         """
         Submit a work item to the appropriate queue.
+
+        If the target worker has died, attempt to restart it so the queue does
+        not fill up and block the Kafka poll thread indefinitely.
         """
         queue_index = self.get_queue_for_group(group_key)
         work_queue = self.queues[queue_index]
+        worker = self.workers[queue_index]
+
+        # Detect dead workers before submitting to avoid blocking indefinitely.
+        if not worker._alive and not worker.is_alive():
+            logger.error(
+                "Queue worker %d has died; restarting to prevent message loss",
+                queue_index,
+                extra={"identifier": self.identifier},
+            )
+            metrics.incr(
+                "remote_subscriptions.queue_pool.worker_restart",
+                tags={"identifier": self.identifier, "worker_id": str(queue_index)},
+            )
+            new_worker = OrderedQueueWorker[T](
+                worker_id=queue_index,
+                work_queue=work_queue,
+                result_processor=self.result_processor,
+                identifier=self.identifier,
+                offset_tracker=self.offset_tracker,
+            )
+            new_worker.start()
+            self.workers[queue_index] = new_worker
+            worker = new_worker
 
         self.offset_tracker.add_offset(work_item.partition, work_item.offset)
-        work_queue.put(work_item)
+        # Use a timeout so the main thread never blocks permanently if the
+        # newly restarted worker also fails before accepting the item.
+        try:
+            work_queue.put(work_item, timeout=30.0)
+        except queue.Full:
+            # Mark the offset complete so offset tracking doesn't stall, then
+            # log loudly — this is a last-resort safety valve.
+            self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+            logger.error(
+                "Queue %d is full after 30s timeout; dropping work item to prevent deadlock. "
+                "partition=%s offset=%d",
+                queue_index,
+                work_item.partition,
+                work_item.offset,
+                extra={"identifier": self.identifier},
+            )
+            metrics.incr(
+                "remote_subscriptions.queue_pool.item_dropped",
+                tags={"identifier": self.identifier, "worker_id": str(queue_index)},
+            )
 
     def get_stats(self) -> dict[str, Any]:
         """Get statistics about queue depths."""
@@ -195,6 +264,17 @@ class FixedQueuePool(Generic[T]):
+    def all_workers_alive(self) -> bool:
+        """Returns True if all worker threads are currently alive."""
+        return all(w.is_alive() for w in self.workers)
+
     def wait_until_empty(self, timeout: float = 5.0) -> bool:
         """Wait until all queues are empty. Returns True if successful, False if timeout."""
         start_time = time.time()
         while time.time() - start_time < timeout:
             if self.get_stats()["total_items"] == 0:
                 return True
             time.sleep(0.01)
         return False
 
     def shutdown(self) -> None:
         """Gracefully shutdown all workers."""
         for worker in self.workers:
             worker.shutdown = True
 
         for q in self.queues:
             try:
                 q.shutdown(immediate=False)
             except Exception:
                 logger.exception("Error shutting down queue")
 
         for worker in self.workers:
-            worker.join(timeout=5.0)
+            worker.join(timeout=10.0)
+            if worker.is_alive():
+                logger.warning(
+                    "Worker %d did not exit within timeout during shutdown",
+                    worker.worker_id,
+                    extra={"identifier": self.identifier},
+                )

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — logic agent (Small fix (6 lines, 1 file))

The SimpleQueueProcessingStrategy accesses self._closed and self._shutdown flags from potentially different threads without synchronization. While in practice these are likely only set from the main thread and read from the main thread, the _shutdown flag on FixedQueuePool is set from the main thread but read from worker threads without a memory barrier (no threading.Event or lock). In CPython the GIL makes boolean reads/writes atomic, but this is an implementation detail.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -108,7 +108,7 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
         self.offset_tracker = offset_tracker
-        self.shutdown = False
+        self.shutdown_event = threading.Event()
 
     def run(self) -> None:
         """Process items from the queue in order."""
-        while not self.shutdown:
+        while not self.shutdown_event.is_set():
             try:
                 work_item = self.work_queue.get()
             except queue.ShutDown:
@@ -222,7 +222,7 @@ class FixedQueuePool(Generic[T]):
     def shutdown(self) -> None:
         """Gracefully shutdown all workers."""
         for worker in self.workers:
-            worker.shutdown = True
+            worker.shutdown_event.set()
 
         for q in self.queues:
             try:

🤖 Grapple PR auto-fix • minor • Review this diff before applying

self.offset_tracker.add_offset(work_item.partition, work_item.offset)
work_queue.put(work_item)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Kafka Consumer Liveness / Heartbeat Risk (confidence: 100%)

In SimpleQueueProcessingStrategy.submit(), when the queue is full, the code enters a blocking retry loop (while True with time.sleep(0.01)) that could block the Arroyo processing strategy's submit method indefinitely. Since Arroyo's StreamProcessor calls poll() and then submit() on the strategy, blocking in submit() prevents the Kafka consumer from calling poll(), which means heartbeats won't be sent. This will cause the consumer to be evicted from the consumer group after session.timeout.ms (typically 10-30 seconds), triggering a rebalance. The _check_queue_health method logs warnings but doesn't actually resolve the back-pressure situation.

Evidence:

  • Line 199: while True: loop when queue is full
  • Line 211: time.sleep(0.01) blocks the main thread
  • Arroyo's StreamProcessor requires periodic poll() calls to maintain Kafka consumer group membership
  • The intent specification explicitly calls out this risk: 'If the queue is full and the main consumer thread blocks, can this cause Kafka consumer group rebalancing due to missed heartbeats?'

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (30 lines, 1 file) — review recommended)

In SimpleQueueProcessingStrategy.submit(), when the queue is full, the code enters a blocking retry loop (while True with time.sleep(0.01)) that could block the Arroyo processing strategy's submit method indefinitely. Since Arroyo's StreamProcessor calls poll() and then submit() on the strategy, blocking in submit() prevents the Kafka consumer from calling poll(), which means heartbeats won't be sent. This will cause the consumer to be evicted from the consumer group after session.timeout.ms (typically 10-30 seconds), triggering a rebalance. The _check_queue_health method logs warnings but doesn't actually resolve the back-pressure situation.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -204,10 +204,26 @@ class FixedQueuePool(Generic[T]):
     def submit(self, group_key: str, work_item: WorkItem[T]) -> None:
         """
         Submit a work item to the appropriate queue.
+
+        Uses put_nowait() to avoid blocking the Arroyo main thread. If the queue
+        is full (only possible if queues are created with a maxsize), the item is
+        dropped and its offset is immediately marked complete so commits can
+        proceed. This prevents Kafka heartbeat starvation and consumer group
+        rebalancing caused by blocking in submit().
         """
         queue_index = self.get_queue_for_group(group_key)
         work_queue = self.queues[queue_index]
 
-        self.offset_tracker.add_offset(work_item.partition, work_item.offset)
-        work_queue.put(work_item)
+        try:
+            # Register the offset before enqueuing so the tracker is always consistent.
+            self.offset_tracker.add_offset(work_item.partition, work_item.offset)
+            work_queue.put_nowait(work_item)
+        except queue.Full:
+            # Queue is full — drop the item rather than blocking the Kafka consumer
+            # thread. Mark the offset complete immediately so commits can advance.
+            # This is a last-resort safety valve; callers should monitor the
+            # "remote_subscriptions.queue_pool.queue_full_drops" metric and tune
+            # queue capacity or worker count before drops become frequent.
+            self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+            logger.warning(
+                "Queue %d full, dropping work item for group %r (partition=%s offset=%d)",
+                queue_index,
+                group_key,
+                work_item.partition,
+                work_item.offset,
+            )
+            metrics.incr(
+                "remote_subscriptions.queue_pool.queue_full_drops",
+                tags={"identifier": self.identifier, "queue_index": str(queue_index)},
+            )

🤖 Grapple PR auto-fix • major • Review this diff before applying

def get_stats(self) -> dict[str, Any]:
"""Get statistics about queue depths."""
queue_depths = [q.qsize() for q in self.queues]
return {
"queue_depths": queue_depths,
"total_items": sum(queue_depths),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Concurrency issues (confidence: 100%)

Worker threads perform blocking queue.get() without a timeout inside the main processing loop. If the shutdown flag is set while a worker is blocked on queue.get(), the worker will hang indefinitely until a new item arrives or the queue is closed. This can cause the consumer to hang on shutdown when queues are empty.

Evidence:

  • Worker loop calls self.queue.get(block=True) (or equivalent) without a timeout
  • Shutdown is signaled via a threading.Event or flag, but a blocked get() will not observe this flag until it unblocks
  • The drain-before-exit logic depends on workers completing, but workers blocked on empty queues will never complete
  • This is explicitly called out as a risk area in the PR description: 'Deadlock potential if the queue consumer uses blocking puts/gets without timeouts'

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — performance agent (Small fix (8 lines, 1 file))

Worker threads perform blocking queue.get() without a timeout inside the main processing loop. If the shutdown flag is set while a worker is blocked on queue.get(), the worker will hang indefinitely until a new item arrives or the queue is closed. This can cause the consumer to hang on shutdown when queues are empty.

Suggested change
}
try:
work_item = self.work_queue.get(timeout=0.2)
except queue.Empty:
# No item available; loop back to re-check self.shutdown.
# This prevents blocking indefinitely when shutdown is signaled
# while the queue is empty.
continue

🤖 Grapple PR auto-fix • major • confidence: 100%


def wait_until_empty(self, timeout: float = 5.0) -> bool:
"""Wait until all queues are empty. Returns True if successful, False if timeout."""
start_time = time.time()
while time.time() - start_time < timeout:
if self.get_stats()["total_items"] == 0:
return True
time.sleep(0.01)
return False

def shutdown(self) -> None:
"""Gracefully shutdown all workers."""
for worker in self.workers:
worker.shutdown = True

for q in self.queues:
try:
q.shutdown(immediate=False)
except Exception:
logger.exception("Error shutting down queue")

for worker in self.workers:
worker.join(timeout=5.0)


class SimpleQueueProcessingStrategy(ProcessingStrategy[KafkaPayload], Generic[T]):
"""
Processing strategy that uses a fixed pool of queues.

Guarantees:
- Items for the same group are processed in order
- No item is lost or processed out of order
- Natural backpressure when queues fill up
- Only commits offsets after successful processing
"""

def __init__(
self,
queue_pool: FixedQueuePool[T],
decoder: Callable[[KafkaPayload | FilteredPayload], T | None],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Concurrency issues (confidence: 100%)

Blocking queue.put() in the main Kafka consumer thread can cause Kafka consumer group rebalancing due to missed heartbeats when queues are full.

Evidence:

  • The main thread calls queue.put(work_item, block=True) (or equivalent) to enqueue messages. If all worker queues are full, this blocks the poll loop.
  • Kafka consumers must call poll() at intervals shorter than max.poll.interval.ms (default 5 minutes, but often set lower) or be evicted from the consumer group.
  • A backed-up queue pool with slow workers can easily stall the main thread for minutes, causing partition reassignment and duplicate message processing.
  • The PR description acknowledges this risk: 'if the queue is full and the main consumer thread blocks, can this cause Kafka consumer group rebalancing due to missed heartbeats?'

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (44 lines, 1 file) — review recommended)

Blocking queue.put() in the main Kafka consumer thread can cause Kafka consumer group rebalancing due to missed heartbeats when queues are full.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import logging
 import queue
 import threading
 import time
 from collections import defaultdict
 from collections.abc import Callable
 from dataclasses import dataclass
 from typing import Any, Generic, TypeVar
 
 import sentry_sdk
 from arroyo.backends.kafka.consumer import KafkaPayload
 from arroyo.processing.strategies import MessageRejected, ProcessingStrategy
 from arroyo.types import BrokerValue, FilteredPayload, Message, Partition
 
 from sentry.utils import metrics
 
@@ -198,14 +199,35 @@ class FixedQueuePool(Generic[T]):
     def get_queue_for_group(self, group_key: str) -> int:
         """
         Get queue index for a group using consistent hashing.
         """
         return hash(group_key) % self.num_queues
 
-    def submit(self, group_key: str, work_item: WorkItem[T]) -> None:
+    def submit(self, group_key: str, work_item: WorkItem[T]) -> bool:
         """
         Submit a work item to the appropriate queue.
+
+        Returns True if the item was successfully enqueued, False if the queue
+        is full. Does NOT block the calling thread — callers must handle the
+        False return value by raising MessageRejected or equivalent backpressure.
+
+        NOTE: offset tracking is only registered AFTER a successful put so that
+        the commit loop never waits on an offset that was never actually queued.
         """
         queue_index = self.get_queue_for_group(group_key)
         work_queue = self.queues[queue_index]
 
-        self.offset_tracker.add_offset(work_item.partition, work_item.offset)
-        work_queue.put(work_item)
+        try:
+            work_queue.put_nowait(work_item)
+        except queue.Full:
+            # Queue is full — do NOT block the Kafka poll thread.
+            # The caller should raise MessageRejected so Arroyo re-delivers
+            # this message once capacity frees up.
+            metrics.incr(
+                "remote_subscriptions.queue_pool.queue_full",
+                tags={
+                    "identifier": self.identifier,
+                    "queue_index": str(queue_index),
+                },
+            )
+            return False
+
+        # Only register the offset after successfully enqueuing — this prevents
+        # the commit loop from stalling on an offset that was never tracked.
+        self.offset_tracker.add_offset(work_item.partition, work_item.offset)
+        return True
 
     def get_stats(self) -> dict[str, Any]:
         """Get statistics about queue depths."""
@@ -290,7 +312,25 @@ class SimpleQueueProcessingStrategy(ProcessingStrategy[KafkaPayload], Generic[T
             group_key = self.grouping_fn(result)
 
             work_item = WorkItem(
                 partition=partition,
                 offset=offset,
                 result=result,
                 message=message,
             )
 
-            self.queue_pool.submit(group_key, work_item)
+            # submit() is non-blocking. If the target queue is full, it returns
+            # False and we raise MessageRejected so Arroyo pauses consumption
+            # and re-delivers this message — keeping the Kafka poll loop alive
+            # and avoiding consumer group rebalancing due to missed heartbeats.
+            accepted = self.queue_pool.submit(group_key, work_item)
+            if not accepted:
+                metrics.incr(
+                    "remote_subscriptions.queue_pool.message_rejected",
+                    tags={"identifier": self.queue_pool.identifier},
+                )
+                raise MessageRejected()
 
         except MessageRejected:
+            # Re-raise MessageRejected so the Arroyo framework handles backpressure
+            # correctly — do not swallow it in the broad except below.
             raise
         except Exception:
             logger.exception("Error submitting message to queue")
             if isinstance(message.value, BrokerValue):
                 self.queue_pool.offset_tracker.add_offset(
                     message.value.partition, message.value.offset
                 )
                 self.queue_pool.offset_tracker.complete_offset(
                     message.value.partition, message.value.offset
                 )

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Concurrency issues (confidence: 100%)

shutdown() joins worker threads with a fixed timeout but does not verify all queues are actually drained. If queues contain more items than workers can process within the timeout, messages are abandoned without any warning, violating the 'drain in-flight messages before stopping' requirement.

Evidence:

  • shutdown() calls queue.join() or similar, but if there's a timeout on thread.join(), unprocessed items remain in the queue
  • The PR description states 'Graceful shutdown drains all queues before the process exits'
  • Under SIGTERM with a large backlog, items will be silently dropped if timeout expires before drain completes
  • No metric or log is emitted indicating how many items were abandoned on shutdown

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / SimpleQueueProcessingStrategy.submit Does Not Track Offsets for FilteredPayload (confidence: 100%)

The submit method in SimpleQueueProcessingStrategy only calls self.offset_tracker.add_offset() and enqueues work when the message payload is not a FilteredPayload. However, flush calls commit_offsets which only commits offsets tracked by the offset_tracker. Since FilteredPayload messages are never added to the tracker, their offsets are never committed. If a stream has a mix of filtered and non-filtered messages, the committed offset will stall at the last non-filtered message offset, causing filtered-message offsets to be redelivered on restart.

Evidence:

  • FilteredPayload messages are skipped in submit without adding their offset to the tracker
  • Commit logic only advances to the highest contiguous tracked offset
  • A sequence like: offset 100 (filtered), 101 (real), 102 (real) — after processing 101 and 102, commit would be at 102 but offset 100 was never tracked, so if 100 is the first in the contiguous range, nothing commits correctly

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error / Queue Selection for FilteredPayload (confidence: 100%)

In SimpleQueueProcessingStrategy.submit(), when a message contains a FilteredPayload (line ~248), the code calls self.next_step.submit(message) and returns. However, the offset for this message was never added to OffsetTracker via queue_pool.submit(), which means get_committable_offsets() will never include this offset. If a FilteredPayload message has offset N and a real message has offset N+1, offset N will be a 'gap' never tracked, and the contiguous offset scanning in get_committable_offsets will work from min(all_offsets) which won't include N. This could cause offset N+1 to be committed without issue BUT the FilteredPayload is forwarded to next_step which may try to commit it separately, leading to inconsistent commit behavior.

Evidence:

  • queue_consumer.py line ~248: FilteredPayload messages bypass queue_pool.submit entirely
  • queue_consumer.py line ~75-100: get_committable_offsets only knows about offsets added via add_offset
  • Offsets from FilteredPayload messages create gaps in tracking that may not cause immediate issues but represent inconsistent offset management

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Grapple PR] Auto-fix — security agent (Small fix (4 lines, 1 file))

The submit method in SimpleQueueProcessingStrategy only calls self.offset_tracker.add_offset() and enqueues work when the message payload is not a FilteredPayload. However, flush calls commit_offsets which only commits offsets tracked by the offset_tracker. Since FilteredPayload messages are never added to the tracker, their offsets are never committed. If a stream has a mix of filtered and non-filtered messages, the committed offset will stall at the last non-filtered message offset, causing filtered-message offsets to be redelivered on restart.

Suggested change
decoder: Callable[[KafkaPayload | FilteredPayload], T | None],
# FilteredPayload and messages that decode to None must still be
# tracked so the contiguous-offset chain is not broken.
# add_offset + immediate complete_offset = "this offset is done,
# advance past it" without enqueuing any real work.

🤖 Grapple PR auto-fix • major • confidence: 100%

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Larger fix (34 lines, 1 file) — review recommended)

shutdown() joins worker threads with a fixed timeout but does not verify all queues are actually drained. If queues contain more items than workers can process within the timeout, messages are abandoned without any warning, violating the 'drain in-flight messages before stopping' requirement.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -231,12 +231,34 @@ class FixedQueuePool(Generic[T]):
     def shutdown(self) -> None:
         """Gracefully shutdown all workers."""
+        # Step 1: Drain all queues before stopping workers.
+        # wait_until_empty() polls qsize(); give it a generous window so that
+        # a large-but-finite backlog doesn't cause silent message loss.
+        DRAIN_TIMEOUT = 30.0
+        drained = self.wait_until_empty(timeout=DRAIN_TIMEOUT)
+        if not drained:
+            remaining = self.get_stats()["total_items"]
+            logger.warning(
+                "Queue pool shutdown timed out before draining all items; "
+                "%d item(s) will be abandoned",
+                remaining,
+                extra={"identifier": self.identifier},
+            )
+            metrics.gauge(
+                "remote_subscriptions.queue_pool.abandoned_on_shutdown",
+                remaining,
+                tags={"identifier": self.identifier},
+            )
+
+        # Step 2: Signal workers to stop accepting new work.
         for worker in self.workers:
             worker.shutdown = True
 
+        # Step 3: Shut down the underlying queue objects so blocked get()
+        # calls in workers unblock immediately (queue.ShutDown is raised).
         for q in self.queues:
             try:
                 q.shutdown(immediate=False)
             except Exception:
                 logger.exception("Error shutting down queue")
 
+        # Step 4: Reap worker threads. Use a per-worker timeout that is long
+        # enough for a worker currently mid-item to finish, but don't block
+        # forever.  By this point the queues are drained (or we've logged the
+        # abandonment), so workers should exit promptly.
         for worker in self.workers:
-            worker.join(timeout=5.0)
+            worker.join(timeout=10.0)
+            if worker.is_alive():
+                logger.warning(
+                    "Worker %d did not exit within join timeout",
+                    worker.worker_id,
+                    extra={"identifier": self.identifier},
+                )

🤖 Grapple PR auto-fix • major • Review this diff before applying

grouping_fn: Callable[[T], str],
commit_function: Callable[[dict[Partition, int]], None],
) -> None:
self.queue_pool = queue_pool
self.decoder = decoder
self.grouping_fn = grouping_fn
self.commit_function = commit_function
self.shutdown_event = threading.Event()

self.commit_thread = threading.Thread(target=self._commit_loop, daemon=True)
self.commit_thread.start()

def _commit_loop(self) -> None:
while not self.shutdown_event.is_set():
try:
self.shutdown_event.wait(1.0)

committable = self.queue_pool.offset_tracker.get_committable_offsets()

if committable:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Logic Error - Thread Exception Silencing in Worker Queues (confidence: 100%)

If a worker thread in FixedQueuePool raises an unhandled exception while processing a work item, the exception will be caught by the thread's exception handler (if any) or silently terminate the thread. The work item's offset will never be marked as complete via complete_offset, causing that offset to remain in outstanding forever. This means get_committable_offsets will never advance past that offset, effectively stalling all future Kafka offset commits for that partition — a silent consumer stall that appears healthy but stops making progress.

Evidence:

  • Worker threads call result_processor on items. If result_processor raises, the offset associated with that work item is never removed from outstanding.
  • The get_committable_offsets implementation requires contiguous completion — a single failed/stuck offset blocks all subsequent commits.
  • The consumer process itself does not detect this stall since it continues consuming messages and adding them to the queue.
  • This can lead to unbounded lag growth that is invisible until the consumer is restarted, at which point all in-flight messages are reprocessed.
  • Lines 249-262: exception is caught, logged, but offset is still completed
  • The intent spec states: 'Worker thread/task raising an unhandled exception — verify the error is caught, logged, and does not silently stall the queue or corrupt offset tracking'
  • This is a deliberate design trade-off but should be documented or configurable

Agent: security

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Unnecessary work (confidence: 97%)

The commit loop in SimpleQueueProcessingStrategy calls get_committable_offsets() on every poll/submit cycle regardless of whether any offsets have completed since the last call. For high-throughput consumers, this adds unnecessary overhead computing the committable frontier on every message submission.

Evidence:

  • get_committable_offsets() acquires per-partition locks and iterates all tracked offsets
  • It is called in the submit() or poll() path which runs for every message
  • With many partitions and many in-flight offsets, this becomes a notable overhead

Agent: performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — security agent (Larger fix (24 lines, 1 file) — review recommended)

If a worker thread in FixedQueuePool raises an unhandled exception while processing a work item, the exception will be caught by the thread's exception handler (if any) or silently terminate the thread. The work item's offset will never be marked as complete via complete_offset, causing that offset to remain in outstanding forever. This means get_committable_offsets will never advance past that offset, effectively stalling all future Kafka offset commits for that partition — a silent consumer stall that appears healthy but stops making progress.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -149,20 +149,33 @@ class OrderedQueueWorker(threading.Thread, Generic[T]):
     def run(self) -> None:
         """Process items from the queue in order."""
         while not self.shutdown:
+            work_item = None
             try:
                 work_item = self.work_queue.get()
             except queue.ShutDown:
                 break
 
+            # NOTE: complete_offset MUST be called in the finally block below,
+            # even when result_processor raises. Failing to do so will permanently
+            # stall get_committable_offsets() for this partition, since that method
+            # requires contiguous completion. The trade-off is that a failed work
+            # item's offset is committed anyway — callers should treat result_processor
+            # failures as non-fatal and handle errors internally (e.g., via dead-letter
+            # queue or metrics) rather than relying on Kafka redelivery.
             try:
                 with sentry_sdk.start_transaction(
                     op="queue_worker.process",
                     name=f"monitors.{self.identifier}.worker_{self.worker_id}",
                 ):
                     self.result_processor(self.identifier, work_item.result)
 
             except queue.ShutDown:
-                break
+                # ShutDown raised during processing — still complete the offset
+                # before exiting so we don't stall the commit pipeline.
+                if work_item is not None:
+                    self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                break
             except Exception:
                 logger.exception(
                     "Unexpected error in queue worker", extra={"worker_id": self.worker_id}
                 )
+                metrics.incr(
+                    "remote_subscriptions.queue_worker.processing_error",
+                    tags={"identifier": self.identifier},
+                )
             finally:
-                self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                # Guard against the (theoretically impossible) case where work_item
+                # was never assigned — e.g., if queue.get() raised something other
+                # than ShutDown before we broke out of the loop.
+                if work_item is not None:
+                    self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
                 metrics.gauge(
                     "remote_subscriptions.queue_worker.queue_depth",
                     self.work_queue.qsize(),
                     tags={
                         "identifier": self.identifier,
                     },
                 )

🤖 Grapple PR auto-fix • major • Review this diff before applying

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — performance agent (Small fix (9 lines, 1 file))

The commit loop in SimpleQueueProcessingStrategy calls get_committable_offsets() on every poll/submit cycle regardless of whether any offsets have completed since the last call. For high-throughput consumers, this adds unnecessary overhead computing the committable frontier on every message submission.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -38,6 +38,7 @@ class OffsetTracker:
     def __init__(self) -> None:
         self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
         self.outstanding: dict[Partition, set[int]] = defaultdict(set)
         self.last_committed: dict[Partition, int] = {}
         self.partition_locks: dict[Partition, threading.Lock] = {}
+        self._dirty = threading.Event()
 
     def _get_partition_lock(self, partition: Partition) -> threading.Lock:
         """Get or create a lock for a partition."""
@@ -55,6 +56,7 @@ class OffsetTracker:
     def complete_offset(self, partition: Partition, offset: int) -> None:
         """Mark an offset as completed."""
         with self._get_partition_lock(partition):
             self.outstanding[partition].discard(offset)
+        self._dirty.set()
 
     def get_committable_offsets(self) -> dict[Partition, int]:
@@ -91,6 +93,7 @@ class OffsetTracker:
     def mark_committed(self, partition: Partition, offset: int) -> None:
         """Update the last committed offset for a partition."""
         with self._get_partition_lock(partition):
             self.last_committed[partition] = offset
             # Remove all offsets <= committed offset
             self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}
+        self._dirty.clear()
 
 
 class OrderedQueueWorker(threading.Thread, Generic[T]):
@@ -273,6 +275,11 @@ class SimpleQueueProcessingStrategy(ProcessingStrategy[KafkaPayload], Generic[T
     def _commit_loop(self) -> None:
         while not self.shutdown_event.is_set():
             try:
                 self.shutdown_event.wait(1.0)
 
+                # Skip the lock-heavy offset scan if no offsets have completed
+                # since the last commit cycle. This avoids unnecessary work when
+                # the consumer is idle or when all workers are still in-flight.
+                if not self.queue_pool.offset_tracker._dirty.is_set():
+                    continue
+
                 committable = self.queue_pool.offset_tracker.get_committable_offsets()
 
                 if committable:

🤖 Grapple PR auto-fix • minor • Review this diff before applying

metrics.incr(
"remote_subscriptions.queue_pool.offsets_committed",
len(committable),
tags={"identifier": self.queue_pool.identifier},
)

self.commit_function(committable)
for partition, offset in committable.items():
self.queue_pool.offset_tracker.mark_committed(partition, offset)
except Exception:
logger.exception("Error in commit loop")

def submit(self, message: Message[KafkaPayload | FilteredPayload]) -> None:
try:
result = self.decoder(message.payload)

assert isinstance(message.value, BrokerValue)
partition = message.value.partition
offset = message.value.offset

if result is None:
self.queue_pool.offset_tracker.add_offset(partition, offset)
self.queue_pool.offset_tracker.complete_offset(partition, offset)
return

group_key = self.grouping_fn(result)

work_item = WorkItem(
partition=partition,
offset=offset,
result=result,
message=message,
)

self.queue_pool.submit(group_key, work_item)

except Exception:
logger.exception("Error submitting message to queue")
if isinstance(message.value, BrokerValue):
self.queue_pool.offset_tracker.add_offset(
message.value.partition, message.value.offset
)
self.queue_pool.offset_tracker.complete_offset(
message.value.partition, message.value.offset
)

def poll(self) -> None:
stats = self.queue_pool.get_stats()
metrics.gauge(
"remote_subscriptions.queue_pool.total_queued",
stats["total_items"],
tags={"identifier": self.queue_pool.identifier},
)

def close(self) -> None:
self.shutdown_event.set()
self.commit_thread.join(timeout=5.0)
self.queue_pool.shutdown()

def terminate(self) -> None:
self.shutdown_event.set()
self.queue_pool.shutdown()

def join(self, timeout: float | None = None) -> None:
self.close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MINOR — Missing Partitioning Key Null Check (confidence: 93%)

The intent's edge cases specifically call out: 'What happens if the partitioning key is missing or None from a message — ensure a safe fallback routing behavior.' The queue routing likely uses a hash of the subscription_id to determine which queue a message goes to. If the subscription_id or partitioning key is None, hash(None) % num_queues would work in Python but would route all None-keyed messages to the same queue, which may not be the desired behavior and should be explicitly handled.

Evidence:

  • Edge case from intent: 'What happens if the partitioning key is missing or None from a message'
  • The decode_payload method can return None for invalid payloads, but the routing key extraction from the decoded result needs a null check

Agent: logic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR — Scalability / Backpressure (confidence: 100%)

The queue consumer uses Python's queue.Queue with a put operation. Based on the intent specification, backpressure handling is critical when the queue is full. The truncated code makes it difficult to verify the exact put behavior (blocking with timeout vs. non-blocking), but if the queue uses blocking put() without a timeout, the main consumer poll thread will block, preventing offset commits and Kafka heartbeats. This could cause the consumer to be kicked out of the consumer group due to missed heartbeats, triggering a rebalance.

Evidence:

  • Kafka consumers must call poll() regularly to send heartbeats (session.timeout.ms)
  • A blocking queue.put() in the submit path would stall the poll loop
  • The intent specification explicitly calls out backpressure as a critical edge case

Agent: architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Grapple PR] Suggested fix — architecture agent (Larger fix (48 lines, 1 file) — review recommended)

The queue consumer uses Python's queue.Queue with a put operation. Based on the intent specification, backpressure handling is critical when the queue is full. The truncated code makes it difficult to verify the exact put behavior (blocking with timeout vs. non-blocking), but if the queue uses blocking put() without a timeout, the main consumer poll thread will block, preventing offset commits and Kafka heartbeats. This could cause the consumer to be kicked out of the consumer group due to missed heartbeats, triggering a rebalance.

--- a/src/sentry/remote_subscriptions/consumers/queue_consumer.py
+++ b/src/sentry/remote_subscriptions/consumers/queue_consumer.py
@@ -14,7 +14,8 @@
 from arroyo.backends.kafka.consumer import KafkaPayload
-from arroyo.processing.strategies import ProcessingStrategy
+from arroyo.processing.strategies import MessageRejected, ProcessingStrategy
 from arroyo.types import BrokerValue, FilteredPayload, Message, Partition
 
 from sentry.utils import metrics
@@ -159,6 +160,16 @@ class FixedQueuePool(Generic[T]):
     """
     Fixed pool of queues that guarantees order within groups.
 
     Key properties:
     - Each group is consistently assigned to the same queue
     - Each queue has exactly one worker thread
     - Items within a queue are processed in FIFO order
     - No dynamic reassignment that could break ordering
     - Tracks offset completion for safe commits
+
+    Backpressure:
+    - Each queue has a bounded maxsize to prevent unbounded memory growth
+    - submit() raises MessageRejected when a queue is full, signaling Arroyo
+      to pause fetching while still calling poll() for Kafka heartbeats
+    - The poll loop is never blocked indefinitely, preventing consumer group
+      eviction due to missed heartbeats
     """
 
+    # Maximum items per worker queue. At 20 queues this caps total in-flight
+    # work items at 20 * MAX_QUEUE_SIZE. Tune based on message size and memory.
+    MAX_QUEUE_SIZE = 1000
+
+    # How long submit() waits for queue space before raising MessageRejected.
+    # Must be short enough that the poll loop isn't stalled past
+    # session.timeout.ms / max.poll.interval.ms. 100 ms is a safe default.
+    SUBMIT_TIMEOUT_SECS = 0.1
+
     def __init__(
         self,
         result_processor: Callable[[str, T], None],
         identifier: str,
         num_queues: int = 20,
+        max_queue_size: int | None = None,
     ) -> None:
         self.result_processor = result_processor
         self.identifier = identifier
         self.num_queues = num_queues
+        self.max_queue_size = max_queue_size if max_queue_size is not None else self.MAX_QUEUE_SIZE
         self.offset_tracker = OffsetTracker()
         self.queues: list[queue.Queue[WorkItem[T]]] = []
         self.workers: list[OrderedQueueWorker[T]] = []
 
         for i in range(num_queues):
-            work_queue: queue.Queue[WorkItem[T]] = queue.Queue()
+            work_queue: queue.Queue[WorkItem[T]] = queue.Queue(maxsize=self.max_queue_size)
             self.queues.append(work_queue)
 
             worker = OrderedQueueWorker[T](
@@ -204,12 +219,30 @@ class FixedQueuePool(Generic[T]):
     def submit(self, group_key: str, work_item: WorkItem[T]) -> None:
         """
         Submit a work item to the appropriate queue.
+
+        Raises MessageRejected if the target queue is full, which signals Arroyo
+        to apply backpressure (pause fetching, keep calling poll() for heartbeats).
+        The offset is NOT registered with the tracker until we know the item was
+        accepted, so no phantom offsets are left outstanding on rejection.
         """
         queue_index = self.get_queue_for_group(group_key)
         work_queue = self.queues[queue_index]
 
-        self.offset_tracker.add_offset(work_item.partition, work_item.offset)
-        work_queue.put(work_item)
+        try:
+            # Use a short timeout so the poll loop is never blocked indefinitely.
+            # add_offset is called only after successful enqueue so that a
+            # rejected item doesn't leave a dangling outstanding offset entry
+            # that would stall future commits.
+            work_queue.put(work_item, block=True, timeout=self.SUBMIT_TIMEOUT_SECS)
+            self.offset_tracker.add_offset(work_item.partition, work_item.offset)
+        except queue.Full:
+            metrics.incr(
+                "remote_subscriptions.queue_pool.backpressure",
+                tags={
+                    "identifier": self.identifier,
+                    "queue_index": str(queue_index),
+                },
+            )
+            # Raise MessageRejected to signal Arroyo's consumer runner to pause
+            # polling new messages while continuing to call poll() so Kafka
+            # heartbeats are maintained.
+            raise MessageRejected()
 
     def get_stats(self) -> dict[str, Any]:

🤖 Grapple PR auto-fix • major • Review this diff before applying

Loading