Span Buffer Multiprocess Enhancement with Health Monitoring by ShashankFC · Pull Request #5 · AI-Code-Review-Evals/codeant-sentry

ShashankFC · 2026-02-13T05:04:29Z

User description

Test 6

Summary by CodeRabbit

New Features
- Added configurable process limit for span flusher, enabling better control over concurrent processing of span data across multiple shards.
Documentation
- Added code examples demonstrating type checking best practices for union types.
Tests
- Added test coverage for span flusher process limiting functionality.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Replicated from ai-code-review-evaluation/sentry-coderabbit#6

CodeAnt-AI Description

Run span flusher with configurable per-shard multiprocessing and per-process health/backpressure

What Changed

Span flusher now runs multiple worker processes/threads and assigns shards across a configurable maximum number of flusher processes (CLI flag and factory option to set max processes).
Health, backpressure, restarts, and memory checks are tracked and enforced per flusher process; unhealthy processes are restarted individually and reported with shard-specific metrics.
Memory/backpressure decisions aggregate across all shard-specific buffers so Redis-based backpressure and hard backpressure events consider all assigned shards; join/shutdown waits for all flusher processes to finish.
Kafka production metrics include shard tags, and unit tests added/updated to verify process-limiting behavior and per-process backpressure reporting.

Impact

✅ Controlled flusher parallelism across shards
✅ Fewer consumer restarts caused by a single flusher crash
✅ Clearer flusher health and backpressure metrics per shard

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

ShashankFC · 2026-02-13T05:45:02Z

@CodeAnt-AI: review

codeant-ai · 2026-02-13T05:45:05Z

CodeAnt AI is running the review.

codeant-ai · 2026-02-13T05:46:06Z

Sequence Diagram

Shows how the consumer/factory configures a multiprocess SpanFlusher, how worker processes flush span segments to Kafka, and how the manager monitors health and enforces backpressure/restarts when needed.

sequenceDiagram
    participant Consumer
    participant Factory
    participant SpanFlusher
    participant FlusherProcess as WorkerProcess
    participant Kafka

    Consumer->>Factory: create_with_partitions(commit, partitions, flusher_processes=N)
    Factory->>SpanFlusher: instantiate manager (max_processes=N) and start N worker processes
    SpanFlusher->>WorkerProcess: worker polls assigned shard buffer and flushes segments
    WorkerProcess->>Kafka: produce flushed segment payloads
    WorkerProcess-->>SpanFlusher: update healthy/backpressure flags
    SpanFlusher->>Consumer: on unhealthy/backpressure -> raise MessageRejected (pause consumer) and restart worker process

Generated by CodeAnt AI

codeant-ai · 2026-02-13T05:53:01Z

                        produce(kafka_payload)

-                with metrics.timer("spans.buffer.flusher.wait_produce"):
+                with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):


Suggestion: The metrics for flusher wait time use the tag key "shards" while the corresponding produce timer uses "shard", so dashboards and alerts that expect a consistent tag name per shard will not be able to correlate produce and wait metrics correctly; the tag key should be made consistent (e.g. always "shard"). [logic error]

Severity Level: Major ⚠️

- ⚠️ Harder correlating wait and produce metrics per shard. - ⚠️ Dashboards grouping by shard miss wait_produce timer.

Suggested change

with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):

with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shard": shard_tag}):

Steps of Reproduction ✅

1. Start the spans consumer using `ProcessSpansStrategyFactory.create_with_partitions` in `src/sentry/spans/consumers/process/factory.py:58-76`, which constructs a `SpanFlusher` that manages shard-specific buffers and child processes. 2. Allow spans to be ingested so that `SpanFlusher.main` (child process entrypoint at `src/sentry/spans/consumers/process/flusher.py:134-213`) begins flushing segments. For each non-empty `flushed_segments` batch, it records a produce timer at `flusher.py:185`: `with metrics.timer("spans.buffer.flusher.produce", tags={"shard": shard_tag}): ...`. 3. Immediately after producing Kafka messages, it records a wait timer at `flusher.py:199`: `with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}): ...`. Note the different tag keys: `"shard"` for the produce timer vs `"shards"` for the wait timer. 4. In the metrics backend used by `sentry.utils.metrics` (invoked in this same module), metric series are distinguished by tag keys. As a result, when an operator groups `spans.buffer.flusher.*` metrics by `shard`, the produce timer (`spans.buffer.flusher.produce` with `{"shard": shard_tag}`) will be grouped correctly per shard, but the wait timer (`spans.buffer.flusher.wait_produce` with `{"shards": shard_tag}`) will not appear under the same `shard` dimension. This makes per-shard correlation between produce time and wait time difficult or impossible without custom handling, demonstrating the impact of the inconsistent tag key.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** src/sentry/spans/consumers/process/flusher.py **Line:** 199:199 **Comment:** *Logic Error: The metrics for flusher wait time use the tag key `"shards"` while the corresponding produce timer uses `"shard"`, so dashboards and alerts that expect a consistent tag name per shard will not be able to correlate produce and wait metrics correctly; the tag key should be made consistent (e.g. always `"shard"`). Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

👍 | 👎

codeant-ai · 2026-02-13T05:53:01Z


+    step.poll()
+    # Give flusher threads time to process after drift change
+    time.sleep(0.1)


Suggestion: The call to time.sleep(0.1) is rendered a no-op by the earlier monkeypatch.setattr("time.sleep", lambda _: None), so there is no guaranteed delay between advancing the flusher's drift and calling step.join(), which can cause a race where the flusher thread has not yet flushed any spans and the subsequent unpacking of messages fails intermittently. Replacing this with a wait mechanism that does not rely on the patched time.sleep ensures the background flusher has a real opportunity to run before join() shuts it down. [race condition]

Severity Level: Major ⚠️

- ❌ `test_basic` for span consumer can fail intermittently. - ⚠️ CI reliability reduced due to nondeterministic flakiness. - ⚠️ Flusher timing behavior insufficiently validated under concurrent conditions.

Suggested change

time.sleep(0.1)

start = time.time()

while time.time() - start < 0.1:

pass

Steps of Reproduction ✅

1. Run the test directly with pytest: `pytest tests/sentry/spans/consumers/process/test_consumer.py::test_basic` (entry point in `tests/sentry/spans/consumers/process/test_consumer.py:12`). 2. In `test_basic` (`test_consumer.py:13-16`), `monkeypatch.setattr("time.sleep", lambda _: None)` globally replaces `time.sleep` with a no-op, affecting both the test and `SpanFlusher` implementation (`src/sentry/spans/consumers/process/flusher.py:1-214`). 3. The test submits a span and calls `step.poll()` then manually advances the flusher clock via `fac._flusher.current_drift.value = 9000` (`test_consumer.py:57-58`), followed by another `step.poll()` and the comment-intended delay `time.sleep(0.1)` at `test_consumer.py:60-62`. Because of the monkeypatch, that `time.sleep(0.1)` does not actually wait. 4. Immediately after the no-op sleep, `step.join()` is invoked (`test_consumer.py:64`), which ultimately calls `SpanFlusher.join()` (`flusher.py:328-345`), setting `self.stopped.value = True` before waiting for background flusher threads to stop. If, at this point, a flusher thread is in the 1-second sleep path of its main loop (`time.sleep(1)` at `flusher.py:181-183`, also monkeypatched to no-op) and has not yet executed `buffer.flush_segments(now=now)` with the updated `current_drift`, it will exit without ever producing to `messages.append`. The subsequent destructuring `(msg,) = messages` at `test_consumer.py:66` then raises `ValueError: not enough values to unpack (expected 1, got 0)`, making `test_basic` flaky depending on timing between the drift update and `step.join()`.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** tests/sentry/spans/consumers/process/test_consumer.py **Line:** 62:62 **Comment:** *Race Condition: The call to `time.sleep(0.1)` is rendered a no-op by the earlier `monkeypatch.setattr("time.sleep", lambda _: None)`, so there is no guaranteed delay between advancing the flusher's drift and calling `step.join()`, which can cause a race where the flusher thread has not yet flushed any spans and the subsequent unpacking of `messages` fails intermittently. Replacing this with a wait mechanism that does not rely on the patched `time.sleep` ensures the background flusher has a real opportunity to run before `join()` shuts it down. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

👍 | 👎

codeant-ai · 2026-02-13T05:53:05Z

CodeAnt AI finished running the review.

ref(span-buffer): Introduce multiprocessed flusher (#93824)

9bf83b0

codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label Feb 13, 2026

codeant-ai Bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span Buffer Multiprocess Enhancement with Health Monitoring#5

Span Buffer Multiprocess Enhancement with Health Monitoring#5
ShashankFC wants to merge 1 commit into
span-flusher-stablefrom
span-flusher-multiprocess

ShashankFC commented Feb 13, 2026 •

edited by codeant-ai Bot

Loading

Uh oh!

ShashankFC commented Feb 13, 2026

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Uh oh!

codeant-ai Bot Feb 13, 2026

Uh oh!

codeant-ai Bot Feb 13, 2026

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):
	with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shard": shard_tag}):

-    time.sleep(0.1)
+    start = time.time()
+    while time.time() - start < 0.1:
+        pass

Conversation

ShashankFC commented Feb 13, 2026 • edited by codeant-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary by CodeRabbit

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

ShashankFC commented Feb 13, 2026

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Sequence Diagram

Uh oh!

codeant-ai Bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShashankFC commented Feb 13, 2026 •

edited by codeant-ai Bot

Loading