Skip to content

Span Buffer Multiprocess Enhancement with Health Monitoring#10

Open
akshayutture-augment wants to merge 1 commit into
span-flusher-stablefrom
span-flusher-multiprocess
Open

Span Buffer Multiprocess Enhancement with Health Monitoring#10
akshayutture-augment wants to merge 1 commit into
span-flusher-stablefrom
span-flusher-multiprocess

Conversation

@akshayutture-augment

@akshayutture-augment akshayutture-augment commented Nov 14, 2025

Copy link
Copy Markdown

Test 6

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable span flusher process limits via new --flusher-processes CLI option
    • Enabled multi-process span flushing for improved parallelism and shard distribution across processes
  • Documentation

    • Updated code examples clarifying proper type checking patterns
  • Tests

    • Added test coverage for process limit configuration
    • Updated span processing tests with enhanced validation

Replicated from ai-code-review-evaluation/sentry-coderabbit#6

@akshayutture-augment

Copy link
Copy Markdown
Author

@coderabbitai review

@akshayutture-augment

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Nov 18, 2025

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Nov 18, 2025

Copy link
Copy Markdown

Walkthrough

The changes implement multi-process shard parallelism for the span flusher consumer. A new --flusher-processes configuration option is added, allowing the factory to create multiple flusher processes. The flusher now manages per-process/shard state tracking, metrics tagging, and health monitoring.

Changes

Cohort / File(s) Summary
Documentation & Configuration
CLAUDE.md, src/sentry/consumers/__init__.py
Added example code blocks to CLAUDE.md illustrating hasattr vs isinstance usage for unions. Added --flusher-processes click option (int, default 1) to process-spans consumer configuration.
Factory Integration
src/sentry/spans/consumers/process/factory.py
Added optional flusher_processes parameter (default None) to ProcessSpansStrategyFactory.init, propagated to SpanFlusher via max_processes argument.
Core Flusher Multi-Process Support
src/sentry/spans/consumers/process/flusher.py
Major refactoring: added max_processes parameter to SpanFlusher.init; introduced process-to-shards mapping, per-process state trackers (healthy_since, backpressure_since, restarts), and per-process buffers registry. Added _create_processes, _create_process_for_shards, _create_process_for_shard helpers. Updated main signature to include shards parameter; added shard-aware tagging and metrics. Refactored process lifecycle management and join/termination logic to handle multiple processes.
Tests
tests/sentry/spans/consumers/process/test_consumer.py, tests/sentry/spans/consumers/process/test_flusher.py
Added test_flusher_processes_limit to verify flusher respects max process limit and distributes shards correctly. Updated test_basic to use transaction=True django_db marker. Updated backpressure assertion in test_flusher to inspect per-process state dictionary.

Sequence Diagram(s)

sequenceDiagram
    participant Config as CLI Config
    participant Factory as ProcessSpansStrategyFactory
    participant Flusher as SpanFlusher
    participant Processes as Worker Processes

    Config->>Factory: Create with flusher_processes=N
    Factory->>Flusher: __init__(max_processes=N)
    activate Flusher
    Flusher->>Flusher: Compute shard-to-process mapping
    Flusher->>Flusher: Create process_to_shards_map<br/>per-process state trackers
    deactivate Flusher
    
    rect rgb(200, 240, 255)
        note over Factory,Flusher: Process Creation Phase
        loop For each shard subset
            Flusher->>Processes: _create_process_for_shards(shards)
            Processes->>Processes: Start worker (main method)
        end
    end
    
    rect rgb(240, 200, 255)
        note over Flusher,Processes: Runtime: Per-Process Shard Handling
        Flusher->>Processes: submit() accumulates per-process buffers
        Processes->>Processes: main(shards=[...]) processes shard subset
        Processes->>Processes: Tag metrics with shard_tag
        Processes->>Flusher: Report per-process backpressure/health
    end
    
    rect rgb(240, 220, 200)
        note over Flusher,Processes: Lifecycle Management
        Flusher->>Processes: _ensure_processes_alive checks all processes
        Flusher->>Processes: On failure: restart via _create_process_for_shards
        Flusher->>Processes: join() terminates all worker processes
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

Areas requiring close attention:

  • src/sentry/spans/consumers/process/flusher.py: Extensive refactoring of process lifecycle management, particularly:
    • Correctness of shard-to-process distribution logic in _create_processes and related helpers
    • Per-process state initialization and propagation (process_to_shards_map, process_backpressure_since, etc.)
    • Multi-process synchronization and IPC data flow consistency
    • Restart/failure handling for multiple processes
  • Per-process metrics and backpressure aggregation: Ensure shard tagging and per-process state checks are correctly applied across all metrics calls and health monitoring paths
  • Test coverage: Verify test assertions correctly validate shard distribution and multi-process behavior

Poem

🐰 Through fields of shards, our workers hop,
No longer one, but many-top!
Each process claims its shard domain,
While flushing streams flow down the lane.
Parallelism blooms, our spans now race,
Across the cluster, process-pace!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: introducing multiprocess capabilities with health monitoring to the span buffer/flusher system.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch span-flusher-multiprocess

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@akshayutture-augment

akshayutture-augment commented Nov 19, 2025

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Nov 19, 2025

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
src/sentry/spans/consumers/process/flusher.py (5)

43-80: Max-processes and shard mapping logic mostly sound; consider minor edge-case guard

The per-process shard mapping and state dictionaries look coherent, and num_processes = min(max_processes, len(buffer.assigned_shards)) correctly bounds process count by shard count.

One small defensive improvement you might want to consider:

  • If buffer.assigned_shards were ever empty (e.g. factory called with an empty partition set), self.max_processes and self.num_processes would both become 0 and _create_processes() would create no workers. If that state is possible, explicitly asserting or handling the “no shards” case at construction time (e.g. early-return or a clear error) would make the behavior easier to reason about.

Otherwise, the mapping and per-process state initializations (process_* dicts, self.buffers) are consistent with the multi-process design.


81-133: Process/thread creation helpers are correct; _create_process_for_shard appears unused

The split into _create_processes, _create_process_for_shards, and _create_process_for_shard makes the lifecycle clearer, and the way buffer/shards are bound into the target callable via run_with_initialized_sentry vs functools.partial is correct for both process and thread modes.

Two small observations:

  • _create_process_for_shard is not referenced anywhere in this file; unless it’s used via indirection elsewhere, it looks like dead code. Removing it (or adding a call site) would reduce confusion.
  • The process_healthy_since reset before (re)starting a worker is a sensible race-avoidance tactic and matches the health checking below.

134-200: Shard-aware tagging and per-process backpressure look correct; align metric tag keys

The updated main signature and usage (buffer, shards, shared Values) is consistent, and the shard tag construction plus:

  • per-process backpressure (backpressure_since) and
  • per-process health (healthy_since)

all line up with the new dictionaries on the parent side.

Minor metrics nits:

  • You use tags={"shard": shard_tag} for "spans.buffer.flusher.produce" and "spans.buffer.segment_size_bytes", but tags={"shards": shard_tag} for "spans.buffer.flusher.wait_produce". For downstream metrics consumers, it will likely be simpler if the tag key is consistent across all these timers (either all shard or all shards).
  • shard_tag itself is a comma-separated list of shards, which is fine, but the mix of singular/plural keys can be confusing when this value represents multiple shards.

These are cosmetic, but tightening them now will avoid metrics churn later.


267-315: submit/backpressure/memory logic matches new multi-buffer design

The changes in submit are aligned with the per-process structure:

  • _ensure_processes_alive() is called before any work, which is a reasonable place for health checks.
  • buffer.record_stored_segments() and buffer.get_memory_info() are invoked across self.buffers.values(), covering all shard groups.
  • Backpressure is now computed across self.process_backpressure_since.values(), which correctly generalizes the previous single-Value approach.

One subtle point on memory accounting:

  • If SpansBuffer.get_memory_info() returns per-service (ServiceMemory) entries that are identical across buffers (e.g., each buffer returns overall Redis memory for the same service), summing used and available over all buffers will scale both numerator and denominator by the same factor, leaving the used / available ratio unchanged. That’s logically safe but does redundant work.
  • If get_memory_info() is truly per-shard or per-service, the aggregate sum is also fine.

This is more about efficiency than correctness; if this path is hot and get_memory_info() is expensive, you might later want to deduplicate calls per underlying service/cluster.


328-347: join implementation works but can be simplified and made more lint‑friendly

The new join correctly:

  • sets stopped before joining next_step, and
  • then waits for each background worker to finish within the optional timeout.

A couple of minor cleanups:

  • You don’t use process_index in for process_index, process in self.processes.items():, which is what Ruff is flagging. You can iterate values directly:

    for process in self.processes.values():
        ...
  • deadline = time.time() + timeout if timeout else None treats timeout=0 as “no deadline”; if consumers rely on timeout=0 meaning “don’t wait”, they won’t get that behavior. If this is a public API and 0 is a realistic input, consider distinguishing None from 0 explicitly.

These are small polish items; behavior as written is otherwise consistent.

tests/sentry/spans/consumers/process/test_consumer.py (2)

12-63: Basic consumer test adjustments: behavior OK, but sleep comment is misleading

  • Marking test_basic as @pytest.mark.django_db(transaction=True) is consistent with the use of real DB-backed components in this path.
  • The extra step.poll() after updating current_drift is reasonable to give the pipeline another chance to drive the flusher.

However:

# Give flusher threads time to process after drift change
time.sleep(0.1)

is a no-op because you previously did:

monkeypatch.setattr("time.sleep", lambda _: None)

So the comment doesn’t match reality, and the test is not actually sleeping here. If the test does not need real time to elapse, you can simply drop the comment and the time.sleep call; if you do need actual sleeping, patch only the module(s) under test (e.g., sentry.spans.consumers.process.flusher.time.sleep) rather than the global time.sleep.


84-123: Rename force to _force to suppress Ruff ARG001 lint warning in test_flusher_processes_limit

The test correctly verifies flusher process limits and shard distribution. The suggestion to rename the unused force parameter to _force in the add_commit function (line 106) is valid—the parameter is never used in the function body and the underscore prefix is the standard Python convention for intentionally unused parameters. This change preserves caller compatibility while suppressing the lint warning.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2c41b0 and 9bf83b0.

📒 Files selected for processing (6)
  • CLAUDE.md (1 hunks)
  • src/sentry/consumers/__init__.py (1 hunks)
  • src/sentry/spans/consumers/process/factory.py (3 hunks)
  • src/sentry/spans/consumers/process/flusher.py (8 hunks)
  • tests/sentry/spans/consumers/process/test_consumer.py (4 hunks)
  • tests/sentry/spans/consumers/process/test_flusher.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.5)
src/sentry/spans/consumers/process/flusher.py

248-250: Avoid specifying long messages outside the exception class

(TRY003)


337-337: Loop control variable process_index not used within loop body

Rename unused process_index to _process_index

(B007)

tests/sentry/spans/consumers/process/test_consumer.py

106-106: Unused function argument: force

(ARG001)

🔇 Additional comments (4)
CLAUDE.md (1)

448-457: Type-checking example is correct and reinforces best practice

The new hasattr vs isinstance example is accurate and aligns with how unions should be handled; good addition to the anti-patterns section.

tests/sentry/spans/consumers/process/test_flusher.py (1)

78-83: Backpressure assertion correctly updated for per-process tracking

Switching the assertion to:

assert any(x.value for x in flusher.process_backpressure_since.values())

matches the new process_backpressure_since: dict[int, Value] shape and still validates that at least one flusher worker has entered backpressure. This is the right adaptation of the test.

src/sentry/spans/consumers/process/factory.py (1)

34-76: Factory wiring of flusher_processes into SpanFlusher looks correct

  • The new flusher_processes: int | None = None parameter is stored as self.flusher_processes.
  • In create_with_partitions, you pass max_processes=self.flusher_processes into SpanFlusher, preserving prior behavior when it is None.
  • Because SpanFlusher internally bounds num_processes by len(buffer.assigned_shards), you won’t oversubscribe workers relative to shards even if a higher flusher_processes is configured.

This is a clean, backwards-compatible extension of the factory.

src/sentry/consumers/__init__.py (1)

430-438: All parameter wiring verified and correct.

The verification confirms that the --flusher-processes parameter is properly implemented:

  • ProcessSpansStrategyFactory correctly accepts flusher_processes: int | None = None (factory.py, line 41)
  • The parameter is properly passed through to SpanFlusher instantiation via max_processes=self.flusher_processes (factory.py, line 74)
  • The implementation is complete and correctly wired from CLI option to flusher

The original review's suggestions for validation and clarified help text remain as optional enhancements—the code functions correctly as-is.

Comment on lines +218 to 260
def _ensure_processes_alive(self) -> None:
max_unhealthy_seconds = options.get("spans.buffer.flusher.max-unhealthy-seconds")
if not self.process.is_alive():
exitcode = getattr(self.process, "exitcode", "unknown")
cause = f"no_process_{exitcode}"
elif int(time.time()) - self.healthy_since.value > max_unhealthy_seconds:
cause = "hang"
else:
return # healthy

metrics.incr("spans.buffer.flusher_unhealthy", tags={"cause": cause})
if self.process_restarts > MAX_PROCESS_RESTARTS:
raise RuntimeError(f"flusher process crashed repeatedly ({cause}), restarting consumer")
for process_index, process in self.processes.items():
if not process:
continue

shards = self.process_to_shards_map[process_index]

cause = None
if not process.is_alive():
exitcode = getattr(process, "exitcode", "unknown")
cause = f"no_process_{exitcode}"
elif (
int(time.time()) - self.process_healthy_since[process_index].value
> max_unhealthy_seconds
):
# Check if any shard handled by this process is unhealthy
cause = "hang"

if cause is None:
continue # healthy

# Report unhealthy for all shards handled by this process
for shard in shards:
metrics.incr(
"spans.buffer.flusher_unhealthy", tags={"cause": cause, "shard": shard}
)

try:
self.process.kill()
except ValueError:
pass # Process already closed, ignore
if self.process_restarts[process_index] > MAX_PROCESS_RESTARTS:
raise RuntimeError(
f"flusher process for shards {shards} crashed repeatedly ({cause}), restarting consumer"
)
self.process_restarts[process_index] += 1

self.process_restarts += 1
self._create_process()
try:
if isinstance(process, multiprocessing.Process):
process.kill()
except (ValueError, AttributeError):
pass # Process already closed, ignore

self._create_process_for_shards(process_index, shards)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Health monitoring and restart logic: watch process type check and thread-mode restarts

The per-process health logic is a good upgrade, but a couple of details are worth revisiting:

  1. Process type check before kill()

    if isinstance(process, multiprocessing.Process):
        process.kill()

    Here process instances are created via self.mp_context.Process(...), whose type is multiprocessing.context.SpawnProcess. Relying on isinstance(process, multiprocessing.Process) may or may not match depending on the exact class hierarchy and could silently skip the kill step in some environments.

    To make this robust and independent of the concrete multiprocessing implementation, consider checking capabilities instead of type, e.g.:

    if hasattr(process, "kill"):
        process.kill()

    while explicitly excluding threads if you want:

    import threading
    
    if not isinstance(process, threading.Thread) and hasattr(process, "kill"):
        process.kill()
  2. Thread mode (produce_to_pipe) restart behavior

    In thread mode, process is a threading.Thread, so the kill() branch is skipped. If a thread is considered “hung” (health exceeds threshold), _create_process_for_shards will start an additional thread for the same shard set without stopping the old one. That’s acceptable in tests but could lead to multiple flusher threads per shard if produce_to_pipe is ever used outside unit tests.

    If you expect thread mode beyond tests, it would be safer to special-case threads (e.g., don’t auto-restart on hang, or track/join and only create one replacement).

Given the current docstring says produce_to_pipe is for unit testing, treating these as future-proofing rather than critical bugs seems reasonable.


🏁 Script executed:

#!/bin/bash
# Check actual types used for SpanFlusher processes to validate the isinstance check.
rg -n "class SpanFlusher" -C3
rg -n "mp_context.Process" -C3
python - << 'PY'
import multiprocessing as mp, inspect
ctx = mp.get_context("spawn")
p = ctx.Process(target=print, args=("x",))
print("Process type:", type(p))
print("Is instance of multiprocessing.Process:", isinstance(p, mp.Process))
PY

Length of output: 1908


🏁 Script executed:

#!/bin/bash
# Find _create_process_for_shards implementation and process creation/tracking logic
rg -n "_create_process_for_shards" -A20 src/sentry/spans/consumers/process/flusher.py

# Check how processes are stored and if there's any cleanup/replacement logic
rg -n "self.processes\[" -C2 src/sentry/spans/consumers/process/flusher.py

# Look for thread-specific handling or guards
rg -n "threading.Thread" -C3 src/sentry/spans/consumers/process/flusher.py

Length of output: 4162


Fix broken isinstance check for process type and implement proper thread cleanup

Your concerns are confirmed by the code. Both issues are real:

  1. Line 249: isinstance(process, multiprocessing.Process) fails for SpawnProcess

    The Python test confirms that multiprocessing.get_context("spawn").Process(...) returns SpawnProcess, which is not an instance of multiprocessing.Process. This means process.kill() at line 253 is never executed. Old multiprocessing processes become orphans and accumulate as zombie processes.

    Fix: Use hasattr(process, "kill") instead:

    if hasattr(process, "kill"):
        process.kill()
  2. Line 124 assigns new thread without stopping old one

    In thread mode, _create_process_for_shards creates a new thread and overwrites the dict entry at line 124 without stopping the old thread. Both threads continue running for the same shard set, creating duplicates.

    Fix: For thread mode, either (a) join the old thread before creating a new one, or (b) prevent auto-restart on hang for threads, or (c) explicitly exclude threads from kill logic and don't auto-restart them.

🧰 Tools
🪛 Ruff (0.14.5)

248-250: Avoid specifying long messages outside the exception class

(TRY003)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants