You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR refactors the SpanFlusher from a single-process architecture to a multiprocess architecture with configurable process limits. The main enhancement is the ability to distribute shard processing across multiple processes for better parallelization and resource utilization. Overall, the implementation is solid, but there are several issues that need attention.
The comment says "Optimistically reset healthy_since to avoid a race between the starting process and the next flush cycle", but this creates a problematic race condition:
Problem: If a process crashes immediately after being created but before _create_process_for_shards returns, the health check in _ensure_processes_alive() will think the process is healthy for up to max-unhealthy-seconds. This could delay detection of a crashing process.
Recommendation: Consider setting healthy_since to 0 initially and only update it once the process successfully starts its main loop. Alternatively, add a grace period constant and handle newly-started processes differently.
2. Inconsistent Exception Handling in Process Termination 🔴
If a process hangs and doesn't respond to the stopped flag, you'll wait until the deadline, then call terminate(), but you don't wait for terminate() to complete
You're calling terminate() on all processes after the deadline, even those that already exited gracefully
You should probably call process.join(timeout=...) instead of manually polling is_alive()
Recommendation:
forprocess_index, processinself.processes.items():
ifdeadlineisnotNone:
remaining_time=max(0, deadline-time.time())
else:
remaining_time=None# Use join with timeout instead of pollingifisinstance(process, multiprocessing.Process):
process.join(timeout=remaining_time)
ifprocess.is_alive():
process.terminate()
process.join(timeout=1.0) # Wait briefly for terminateelse:
# For threadsprocess.join(timeout=remaining_time)
def_create_process_for_shard(self, shard: int):
# Find which process this shard belongs to and restart that processforprocess_index, shardsinself.process_to_shards_map.items():
ifshardinshards:
self._create_process_for_shards(process_index, shards)
break
Issue: This method is defined but never called anywhere in the codebase. It seems like it was intended for future use but isn't currently needed.
Recommendation: Remove it unless there's a plan to use it soon. Dead code increases maintenance burden.
step.poll()
# Give flusher threads time to process after drift changetime.sleep(0.1)
Issue: Using a fixed sleep of 0.1 seconds in tests can be flaky, especially on slower CI machines. The test might fail intermittently.
Recommendation: Use a retry loop with timeout:
step.poll()
# Wait for flusher to process after drift changemax_wait=2.0start=time.time()
whilenotmessagesandtime.time() -start<max_wait:
time.sleep(0.01)
The __init__ method now has a new max_processes parameter but no docstring explaining it.
Recommendation: Add parameter documentation:
def__init__(
self,
buffer: SpansBuffer,
next_step: ProcessingStrategy[FilteredPayload|int],
max_processes: int|None=None, # Document this!produce_to_pipe: Callable[[KafkaPayload], None] |None=None,
):
""" ...existing docstring... :param max_processes: Maximum number of processes to spawn for flushing. Defaults to the number of assigned shards. """
10. CLAUDE.md Change is Unrelated 🟢
Location:CLAUDE.md:448-456
The addition of the hasattr() vs isinstance() anti-pattern is good style guidance, but it's not related to this PR's purpose (multiprocess span flusher enhancement).
Recommendation: This should ideally be in a separate commit or PR. However, since it's already here and it's a documentation change that doesn't affect functionality, it's acceptable to leave it.
Positive Observations ✅
Good Process Distribution Logic: The round-robin distribution of shards across processes (lines 64-66) is clean and ensures even load distribution.
Proper Sentry SDK Tagging: The addition of sentry_spans_buffer_shards tag helps with observability in production.
Comprehensive Test Coverage: The new test test_flusher_processes_limit() properly validates the multiprocess behavior.
Health Monitoring per Process: The refactor from single healthy_since to per-process tracking is a significant improvement.
Memory Backpressure Handling: The aggregation of memory info across all buffers is the correct approach.
Error Context Improvement: The error message now includes shard information: f"flusher process for shards {shards} crashed repeatedly" which will help debugging.
The refactoring from single-process to multi-process is architecturally sound. The design properly:
Maintains backward compatibility with max_processes defaulting to the number of shards
Handles both multiprocessing and threading (for tests)
Preserves health monitoring per process
Aggregates metrics and memory info correctly
The approach aligns well with Sentry's architecture patterns shown in CLAUDE.md. Great work overall! Just address the critical and major issues before merging.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Test 6
Replicated from ai-code-review-evaluation/sentry-greptile#6