feat(uptime): Add ability to use queues to manage parallelism#9
feat(uptime): Add ability to use queues to manage parallelism#9ShashankFC wants to merge 1 commit into
Conversation
One potential problem we have with batch processing is that any one slow item will clog up the whole batch. This pr implements a queueing method instead, where we keep N queues that each have their own workers. There's still a chance of individual items backlogging a queue, but we can try increased concurrency here to reduce the chances of that happening <!-- Describe your PR here. -->
|
Claude finished @ShashankFC's task in 1m 43s —— View job Code Review Complete
SummaryThis PR adds a new Critical Issues1. Race Condition in Lock Acquisition (src/sentry/remote_subscriptions/consumers/queue_consumer.py:49-54)def _get_partition_lock(self, partition: Partition) -> threading.Lock:
"""Get or create a lock for a partition."""
lock = self.partition_locks.get(partition)
if lock:
return lock
return self.partition_locks.setdefault(partition, threading.Lock())Issue: Classic check-then-act race condition. If two threads call this simultaneously for a new partition, both could pass the Impact: This could lead to data corruption in offset tracking since different threads might use different locks for the same partition, breaking the mutual exclusion guarantee. Recommendation: Use a global lock for partition lock creation: def __init__(self) -> None:
self.all_offsets: dict[Partition, set[int]] = defaultdict(set)
self.outstanding: dict[Partition, set[int]] = defaultdict(set)
self.last_committed: dict[Partition, int] = {}
self.partition_locks: dict[Partition, threading.Lock] = {}
self._partition_locks_lock = threading.Lock() # Add this
def _get_partition_lock(self, partition: Partition) -> threading.Lock:
with self._partition_locks_lock:
return self.partition_locks.setdefault(partition, threading.Lock())2. Inefficient Offset Commit Algorithm (src/sentry/remote_subscriptions/consumers/queue_consumer.py:67-98)The for offset in range(start, max_offset + 1): # Could iterate millions of times
if offset in all_offsets and offset not in outstanding:
highest_committable = offset
else:
breakImpact: If there's a gap of 1 million offsets, this iterates 1 million times checking set membership on each iteration. Recommendation: Use sorted offset tracking: def get_committable_offsets(self) -> dict[Partition, int]:
committable = {}
for partition in list(self.all_offsets.keys()):
with self._get_partition_lock(partition):
all_offsets = sorted(self.all_offsets[partition])
if not all_offsets:
continue
outstanding = self.outstanding[partition]
last_committed = self.last_committed.get(partition, -1)
highest_committable = last_committed
for offset in all_offsets:
if offset <= last_committed:
continue
if offset in outstanding:
break
if offset != highest_committable + 1:
break
highest_committable = offset
if highest_committable > last_committed:
committable[partition] = highest_committable
return committable3. Missing Error Handling in Worker Thread (src/sentry/remote_subscriptions/consumers/queue_consumer.py:127-157)The worker thread catches exceptions but continues running even after def run(self) -> None:
while not self.shutdown:
try:
work_item = self.work_queue.get()
except queue.ShutDown:
break
try:
# ... processing code ...
except queue.ShutDown: # This is unreachable!
breakRecommendation: Restructure exception handling: def run(self) -> None:
while not self.shutdown:
try:
work_item = self.work_queue.get(timeout=1.0)
except queue.Empty:
continue
except queue.ShutDown:
break
try:
with sentry_sdk.start_transaction(...):
self.result_processor(self.identifier, work_item.result)
except Exception:
logger.exception(...)
finally:
self.offset_tracker.complete_offset(...)High-Severity Issues4. Potential Memory Leak in Offset Tracking (src/sentry/remote_subscriptions/consumers/queue_consumer.py:100-106)The def mark_committed(self, partition: Partition, offset: int) -> None:
with self._get_partition_lock(partition):
self.last_committed[partition] = offset
# Remove all offsets <= committed offset
self.all_offsets[partition] = {o for o in self.all_offsets[partition] if o > offset}Recommendation: Add periodic cleanup or bounded set size with logging when threshold is exceeded. 5. Busy Loop in Commit Thread (src/sentry/remote_subscriptions/consumers/queue_consumer.py:273-291)The commit loop uses Medium-Severity Issues6. Hardcoded Number of Queues (src/sentry/remote_subscriptions/consumers/result_consumer.py:136)num_queues=max_workers or 20, # Number of parallel queuesThe default of 20 queues seems arbitrary and isn't documented. This should either:
7. Silent Failure on Decoder Error (src/sentry/remote_subscriptions/consumers/queue_consumer.py:317-325)When message decoding fails, the exception is logged but the offset is still marked as complete. While this prevents poison messages from blocking the consumer, it could hide systematic issues. Recommendation: Add metrics to track decoder failures: except Exception:
logger.exception("Error submitting message to queue")
metrics.incr("queue_consumer.decoder_error", tags={"identifier": self.queue_pool.identifier})
if isinstance(message.value, BrokerValue):
# ... existing code8. Missing Timeout in
|
Test 9nn---nReplicated from ai-code-review-evaluation/sentry-greptile#9