Skip to content

Span Buffer Multiprocess Enhancement with Health Monitoring#12

Open
zaibkhan wants to merge 1 commit into
span-flusher-stablefrom
span-flusher-multiprocess
Open

Span Buffer Multiprocess Enhancement with Health Monitoring#12
zaibkhan wants to merge 1 commit into
span-flusher-stablefrom
span-flusher-multiprocess

Conversation

@zaibkhan

@zaibkhan zaibkhan commented Sep 9, 2025

Copy link
Copy Markdown

Mirrors ai-code-review-evaluation#6 for like-for-like benchmarking.

  • Base: span-flusher-stable
  • Head: span-flusher-multiprocess

Original PR excerpt:

Test 6

@codoki-pr-intelligence

codoki-pr-intelligence Bot commented Sep 9, 2025

Copy link
Copy Markdown

Codoki PR Review

Summary: Fix futures leak, standardize metric tags
What’s good: Solid multiprocess refactor: shard-to-process mapping, per-process health/backpressure, and memory checks are introduced without altering the external consumer interface.
Review Status: ❌ Requires changes
Overall Priority: High

Issues (Critical & High only)

Severity Issue Why it matters
High Performance — Unbounded producer_futures growth causes memory leak …/process/flusher.py
Two issues in this block: (1) producer_futures is never cleared after awaiting, causing unbounded growth and repeated waits across iterations (worse now with multiple processes); (2) the metric tag key uses 'shards' here but 'shard' elsewhere, which fragments observability. Clearing the list after awaiting and standardizing the tag key avoids memory bloat and keeps metrics coherent.

Showing top 1 issues. Critical: 0, High: 1. See inline suggestions for more.

Key Feedback (click to expand)
  • Needs improvement: Clear produced futures after awaiting to prevent unbounded growth and repeated waits, and standardize metric tag keys ('shard' vs 'shards') for consistency in dashboards.
  • Performance: The producer_futures list is never cleared, causing unbounded memory growth and redundant waits across iterations; the multiprocess design amplifies this with one list per process.
  • Open questions: Is 'shards' as a metric tag key intentional, or should all timers use 'shard' to match other metrics? Also, can we safely rely on isinstance(process, multiprocessing.Process) with spawn context, or should we switch to capability checks (hasattr)?

Confidence: 3/5 — Needs work before merge (1 high · status: Requires changes)

React with 👍 or 👎 if you found this review useful.

"spans.buffer.segment_size_bytes",
len(kafka_payload.value),
tags={"shard": shard_tag},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ High: Two issues in this block: (1) producer_futures is never cleared after awaiting, causing unbounded growth and repeated waits across iterations (worse now with multiple processes); (2) the metric tag key uses 'shards' here but 'shard' elsewhere, which fragments observability. Clearing the list after awaiting and standardizing the tag key avoids memory bloat and keeps metrics coherent.

Suggested change
)
with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shard": shard_tag}):
for future in producer_futures:
future.result()
producer_futures.clear()

raise RuntimeError(
f"flusher process for shards {shards} crashed repeatedly ({cause}), restarting consumer"
)
self.process_restarts[process_index] += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔷 Medium: Using isinstance(process, multiprocessing.Process) may miss processes created via a specific start method (spawn) depending on the context class, leading to not killing dead processes and potential duplicate restarts; checking for capability (kill) is more robust and also naturally excludes threads.

Suggested change
self.process_restarts[process_index] += 1
if hasattr(process, "kill"):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants