-
Notifications
You must be signed in to change notification settings - Fork 3
HCD-579: Fix hang on group batches #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
HCD-579: Fix hang on group batches #20
Conversation
WalkthroughAdds README documentation for group usage and flushing semantics; introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Handler as Handler/Task
participant Bus
participant GroupReg as GroupRegistry
participant Receiver
Handler->>Bus: flush_and_sync_group(group_id, force=false)
activate Bus
Bus->>Bus: log(start)
Note over Bus: Phase 1 — Non-blocking trigger
Bus->>Receiver: flush_nowait() for each receiver in group
Receiver-->>Bus: returns immediately
Note over Bus: Phase 2 — Blocking stabilization loop (≤32 iters)
loop up to 32
Bus->>Receiver: flush() for each receiver in group
Receiver-->>Bus: completes when processing done
Bus->>GroupReg: get_group_count(group_id)
GroupReg-->>Bus: current_count
alt current_count == 0
Bus->>Bus: stabilized -> break
else current_count > 0
Bus->>Bus: continue
end
end
Note over Bus: Final partial-batch flush and wait
Bus->>Receiver: partial flush for remaining batches
Bus->>GroupReg: wait_idle(group_id)
GroupReg-->>Bus: notify idle
Bus->>Bus: synchronized group operations complete
Bus-->>Handler: return
deactivate Bus
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: HCD-579 - Fix hang on group batchesThank you for this comprehensive fix! This PR addresses critical hanging issues with group batching and adds excellent test coverage. Here's my detailed review: ✅ Strengths1. Excellent Problem AnalysisThe PR correctly identifies and fixes two distinct hanging scenarios:
2. Comprehensive Test CoverageThe addition of 907 lines of tests is outstanding. Key tests added:
These cover edge cases thoroughly. 3. Documentation ImprovementsThe README updates are very helpful, especially:
4. Defensive ProgrammingThe 🔍 Issues & Suggestions1. Magic Number: Fuse Count (src/lib.rs:897)let fuse_count = 32i32;Issue: The value 32 appears arbitrary without justification. Suggestion:
Severity: Minor - works fine but lacks clarity 2. Potential Infinite Loop Risk (src/lib.rs:900-954)loop {
iters += 1;
if iters > fuse_count {
log::warn\!(...);
break; // Only logs warning, doesn't return error
}
// ... flush logic
}Issue: If the fuse triggers, execution continues to the final flush and sync. This could:
Suggestions:
Severity: Moderate - edge case handling could be more robust 3. Logging Verbosity (Throughout)Multiple new Observations:
Suggestion: Review log levels - Severity: Minor - tuning issue 4. Memory Overhead: Duplicate Receiver Iteration (src/lib.rs:911-967)let receiver_ids = self.inner.group_registry.receivers_for_group(group_id);
// First pass
for r in self.inner.receivers.iter() {
if receiver_ids.contains(&r.id()) { ... }
}
// Second pass
for r in self.inner.receivers.iter() {
if receiver_ids.contains(&r.id()) { ... }
}
// ... then again at line 962-967Issue: Iterating all receivers and filtering multiple times. Suggestion: Pre-filter receivers once: let relevant_receivers: Vec<_> = self.inner.receivers.iter()
.filter(|r| receiver_ids.contains(&r.id()))
.collect();
for r in &relevant_receivers {
if \!r.is_idling() {
r.flush_nowait(self);
}
}Severity: Low - likely not a bottleneck unless receiver count is very high 5. Race Condition Consideration (src/lib.rs:922)if receiver_ids.contains(&r.id()) && \!r.is_idling() {
r.flush_nowait(self);
}Question: Is there a TOCTOU (time-of-check-time-of-use) race between checking
Severity: Info - likely not an issue but worth confirming 6. Error Handling in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
27-145: Update documentation: Batch handlers DO propagate group contextThe claim on line 111 that "Batch handlers don't propagate group context to child messages" is incorrect. The code shows that all handler types—batch and non-batch—use
Bus::with_group_context()to propagate group_id to child messages.Batch handler implementations in
src/receivers/buffer_unordered_batched/execution.rsandsrc/receivers/synchronize_batched/execution.rsexplicitly callBus::with_group_context(gid, ...)before executing the handler, identical to non-batch handlers.Update the batch handler documentation section to reflect that batch handlers do propagate group context, similar to other handler types, though messages are still batched per group_id.
🤖 Fix all issues with AI agents
In `@src/lib.rs`:
- Around line 985-1047: The README text contradicts the implementation of
flush_current_group(): update the docs so they state that flush_current_group()
triggers non-blocking flushes (it calls Receiver::flush_nowait()) and does not
itself wait for child handlers to finish, and clarify that final
synchronization/waiting is performed by flush_and_sync_group()/the outer sync
mechanism which observes group_registry.processing_count() to detect completion;
reference flush_current_group(), flush_nowait(), flush_and_sync_group(), and
group_registry.processing_count() in the updated wording to make the distinction
explicit.
In `@tests/test_groups.rs`:
- Around line 3226-3358: The test spawns OS threads and calls
tokio::runtime::Handle::current() inside them (in
BatchWithCascadeHandler::handle in
test_flush_and_sync_group_batched_with_cascading), which can panic or drop
sends; instead capture a runtime handle in the test thread (let rt =
Handle::current()) and pass/clone that into the batch handler so
BatchWithCascadeHandler::handle uses rt.spawn(async move {
bus.send(...).await.unwrap(); }) rather than std::thread::spawn and
Handle::current() inside the thread; also ensure spawned tasks are awaited (or
tracked) so flush_and_sync_group waits for child processing and add an assertion
checking child_processed.load(...) to make the test deterministic.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.mdexamples/demo_groups.rssrc/group.rssrc/lib.rssrc/receiver.rstests/test_groups.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Use appropriate SendOptions (Broadcast, Direct(id), Except(id), Random, or Balanced) when sending messages through the bus to control routing strategy
Files:
examples/demo_groups.rssrc/receiver.rssrc/group.rstests/test_groups.rssrc/lib.rs
🧠 Learnings (4)
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Use `#[derive(Message)]` with appropriate attributes for types sent through the bus, such as `#[message(clone)]` for broadcast-enabled messages, `#[message(shared)]` for remote transport serialization, and `#[type_tag("custom::name")]` or `#[namespace("my_namespace")]` for custom type tags
Applied to files:
examples/demo_groups.rsREADME.md
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Choose the appropriate receiver type (BufferUnorderedAsync/Sync for concurrent processing, BufferUnorderedBatchedAsync/Sync for batched concurrent processing, SynchronizedAsync/Sync for sequential processing, or SynchronizedBatchedAsync/Sync for batched sequential processing) based on whether message processing should be concurrent or sequential
Applied to files:
src/receiver.rstests/test_groups.rssrc/lib.rsREADME.md
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Implement one of the 12 handler traits (Handler, AsyncHandler, BatchHandler, AsyncBatchHandler, SynchronizedHandler, AsyncSynchronizedHandler, BatchSynchronizedHandler, AsyncBatchSynchronizedHandler, LocalHandler, LocalAsyncHandler, LocalBatchHandler, or LocalAsyncBatchHandler) based on the required concurrency model (thread-safe, synchronized, or local) and whether messages should be batched or processed individually
Applied to files:
tests/test_groups.rssrc/lib.rsREADME.md
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to **/*.rs : Use appropriate SendOptions (Broadcast, Direct(id), Except(id), Random, or Balanced) when sending messages through the bus to control routing strategy
Applied to files:
README.md
🧬 Code graph analysis (3)
src/receiver.rs (1)
tests/test_idle.rs (1)
bus(48-48)
src/group.rs (3)
src/envelop.rs (5)
group_id(180-180)group_id(308-310)group_id(455-457)group_id(522-524)group_id(591-593)src/error.rs (1)
group_id(422-424)src/receiver.rs (1)
drop(552-556)
src/lib.rs (3)
src/group.rs (1)
processing_count(261-267)src/receiver.rs (3)
id(109-109)id(418-420)id(789-791)src/relay.rs (1)
id(128-130)
🔇 Additional comments (14)
examples/demo_groups.rs (1)
6-6: LGTM!The documentation update accurately reflects the usage of
flush_groupfor waiting on group completion, aligning with the PR's enhancements to group-based flushing semantics.src/receiver.rs (1)
1100-1110: LGTM!The
flush_nowaitmethod is a well-designed addition that complements the existing blockingflushmethod. The non-blocking semantics are essential for the two-pass flush strategy inflush_and_sync_groupand for use within handlers viaflush_current_groupto prevent deadlocks. The warning-level log on failure is appropriate for a fire-and-forget operation.src/group.rs (3)
174-182: LGTM!The trace-level logging in
incrementprovides valuable observability for debugging group counter flow without impacting performance in production.
273-321: LGTM!The debug and trace logging throughout
wait_idleprovides excellent visibility into the idle-waiting flow. The refactor to storecountin a local variable (line 296) before the comparison is a minor improvement that makes the log message more accurate.
215-244: Good addition of observability logging.The trace-level logging for successful decrements is appropriate. The warning for unknown
group_idwill be triggered whenforce_remove_groupis called while handlers are still completing—this is expected behavior and actually helpful for catching unintended concurrent removal of groups. Each active handler'sGroupGuard::dropwill log the warning as it attempts to decrement a group that no longer exists in the registry, alerting developers to this race condition if it occurs unexpectedly.src/lib.rs (2)
843-853: LGTM - Key fix for partial batch hangs.The updated
flush_groupnow proactively flushes receivers belonging to the group before waiting for idle. This correctly addresses the hanging issue where partial batches (fewer messages thanbatch_size) would never be processed. The implementation:
- Retrieves receiver IDs for the group
- Flushes only those receivers (targeted flush)
- Then waits for the group to become idle
883-983: Well-designed two-pass flush strategy for cascading messages.The enhanced
flush_and_sync_groupimplementation addresses deadlocks with cascading messages effectively:
- First pass (non-blocking): Triggers all receivers to start processing without waiting, preventing upstream handlers from blocking on downstream buffers.
- Second pass (blocking): Waits for processing to complete.
- Stabilization loop: Continues until no receivers need flushing, with a 32-iteration fuse.
- Final partial-batch flush: Ensures any remaining buffered messages are processed.
The logging is well-placed for debugging production issues.
tests/test_groups.rs (7)
2831-2919: Good regression coverage for partial batches inflush_and_sync_group.Timeout guard + processed-count assertions make the hang regression reproducible.
2921-3002: Good coverage forflush_grouppartial-batch flush.Clear reproduction of the pre-fix hang scenario.
3004-3111: Solid cascading-message wait coverage.The test exercises parent/child completion under timeout, which is the core regression.
3113-3224: Deep-nesting wait coverage looks solid.The depth tracking makes the intent clear and verifiable.
3360-3445: Independent-group flush behavior covered well.Nice to see both groups flushed separately and verified.
3447-3490: Empty-group fast-path coverage looks good.The short timeout makes the “no messages” case explicit.
3492-3732: Comprehensive VRBP-pattern regression test.The sequence mirrors the real-world flow and should guard against future hangs.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Pull Request Review: Fix hang on group batchesThis PR addresses critical hanging issues in the group-based task tracking system, specifically with flush_group and flush_and_sync_group when used with batched handlers. Strengths
Issues and Concerns1. Magic Number - Fuse Count (Medium Priority)Location: src/lib.rs:897 The hardcoded iteration limit of 32 lacks justification. Recommend adding a named constant with documentation. 2. Performance - Repeated Receiver Iteration (Low-Medium)Location: src/lib.rs:900-954 The loop iterates over ALL receivers multiple times per iteration. Consider caching filtered receiver list. 3. Excessive Debug Logging (Low Priority)11+ new log statements could create noise. Move iteration details to trace level. 4. Inconsistent Error HandlingLocation: src/receiver.rs:1107-1109 flush_nowait error lacks context. Add receiver ID to warning message. 5. Race Condition (Needs Verification)Location: src/lib.rs:946-953 Add comment explaining why final flush guarantees correctness. VerdictApprove with Minor Revisions This PR successfully fixes critical hanging bugs with a well-reasoned approach. The concerns are mostly minor optimizations that don't block the core fix. Great work on the comprehensive tests and documentation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib.rs (1)
879-975: Alignforceparameter semantics with actual behavior.The documentation states that
force"skips waiting for idle and proceeds directly to sync," but the implementation unconditionally executes the final flush loop andwait_idle(group_id).awaitregardless of theforceflag. Theforceparameter only skips the stabilization loop, not the subsequent wait operations. Either gate the final flush andwait_idlebehind!force, or update the documentation to clarify thatforceonly skips the stabilization loop.✏️ Doc update option
- /// * `force` - If true, skips waiting for idle and proceeds directly to sync + /// * `force` - If true, skips the stabilization loop; a final flush + wait_idle still occurs
🤖 Fix all issues with AI agents
In `@README.md`:
- Around line 105-111: Update the "Batch Handler Behavior" docs to state that
batch handlers do propagate group context: clarify that batch handlers set
task-local storage so bus.send inherits the batch's group_id for child messages
unless the child explicitly provides its own group_id, while still noting
messages are batched per group_id and each message's group counter is
decremented after processing; mention that batch handlers do not strip group
context when spawning children only if those children override group_id.
In `@tests/test_groups.rs`:
- Around line 3230-3363: Replace the final fixed sleep in
test_flush_and_sync_group_batched_with_cascading with a bounded polling loop
that checks child_processed.load(Ordering::SeqCst) until it equals 2 (the
expected count) and aborts with a timeout; specifically, after calling
bus.flush_and_sync_group(job_id, false) poll child_processed in a loop using a
short sleep (e.g., 5-20ms) between iterations and wrap the whole wait in
tokio::time::timeout (or abort after a max Duration) so the test fails
deterministically if children spawned by BatchWithCascadeHandler do not complete
in time instead of relying on tokio::time::sleep(100ms).
♻️ Duplicate comments (1)
src/lib.rs (1)
985-1013: Documentation misleadingly implies blocking behavior, but the implementation uses non-blocking flush.The doc comment and example say "wait for all child messages to complete," but
flush_current_group()callsflush_nowait()internally to avoid deadlocks when called from within a handler. The code comment explicitly documents this trade-off. Update the documentation to clarify that this triggers a non-blocking flush and direct readers toflush_group()orflush_and_sync_group()if they need to actually block and wait for message processing.Suggested doc updates
- /// This is designed to be called from within a handler to wait for all - /// child messages (sent during this handler's execution) to complete. + /// This is designed to be called from within a handler to trigger a non-blocking + /// flush of child messages (sent during this handler's execution).- /// // Wait for all child messages to complete + /// // Trigger a non-blocking flush of child messages
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README.mdsrc/lib.rstests/test_groups.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Use appropriate SendOptions (Broadcast, Direct(id), Except(id), Random, or Balanced) when sending messages through the bus to control routing strategy
Files:
tests/test_groups.rssrc/lib.rs
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Implement one of the 12 handler traits (Handler, AsyncHandler, BatchHandler, AsyncBatchHandler, SynchronizedHandler, AsyncSynchronizedHandler, BatchSynchronizedHandler, AsyncBatchSynchronizedHandler, LocalHandler, LocalAsyncHandler, LocalBatchHandler, or LocalAsyncBatchHandler) based on the required concurrency model (thread-safe, synchronized, or local) and whether messages should be batched or processed individually
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Implement one of the 12 handler traits (Handler, AsyncHandler, BatchHandler, AsyncBatchHandler, SynchronizedHandler, AsyncSynchronizedHandler, BatchSynchronizedHandler, AsyncBatchSynchronizedHandler, LocalHandler, LocalAsyncHandler, LocalBatchHandler, or LocalAsyncBatchHandler) based on the required concurrency model (thread-safe, synchronized, or local) and whether messages should be batched or processed individually
Applied to files:
README.mdtests/test_groups.rssrc/lib.rs
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Use `#[derive(Message)]` with appropriate attributes for types sent through the bus, such as `#[message(clone)]` for broadcast-enabled messages, `#[message(shared)]` for remote transport serialization, and `#[type_tag("custom::name")]` or `#[namespace("my_namespace")]` for custom type tags
Applied to files:
README.md
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to messagebus/src/**/*.rs : Choose the appropriate receiver type (BufferUnorderedAsync/Sync for concurrent processing, BufferUnorderedBatchedAsync/Sync for batched concurrent processing, SynchronizedAsync/Sync for sequential processing, or SynchronizedBatchedAsync/Sync for batched sequential processing) based on whether message processing should be concurrent or sequential
Applied to files:
README.mdtests/test_groups.rssrc/lib.rs
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Applies to **/*.rs : Use appropriate SendOptions (Broadcast, Direct(id), Except(id), Random, or Balanced) when sending messages through the bus to control routing strategy
Applied to files:
README.md
📚 Learning: 2026-01-08T15:42:29.191Z
Learnt from: CR
Repo: shakacode/messagebus PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-08T15:42:29.191Z
Learning: Verify examples work correctly by running `cargo run --example <example_name>` for demo_async, demo_sync_batch, and benchmark examples
Applied to files:
tests/test_groups.rs
🧬 Code graph analysis (1)
src/lib.rs (3)
src/group.rs (1)
processing_count(261-267)src/receiver.rs (3)
id(109-109)id(418-420)id(789-791)src/relay.rs (1)
id(128-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: claude-review
🔇 Additional comments (8)
src/lib.rs (1)
822-853: Good fix for partial batch hangs.Flushing receivers in the group before
wait_idleshould prevent partial batches from stalling.tests/test_groups.rs (7)
2838-2919: LGTM — good regression for partial batch hangs inflush_and_sync_group.
2926-3002: LGTM — solid coverage forflush_grouppartial batches.
3009-3111: LGTM — validates cascading child messages are awaited.
3117-3224: LGTM — nice deep-nesting coverage.
3367-3450: LGTM — independent-group flushing coverage looks good.
3454-3495: LGTM — empty-group flush guard is a useful regression test.
3505-3737: LGTM — great end-to-end reproduction of the vrbp pattern.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Pull Request Review: Fix hang on group batchesSummaryThis PR addresses critical hanging issues in the messagebus library's group-based task flushing functionality, particularly when using batched handlers. The fixes are well-designed and the comprehensive test coverage provides strong confidence in the solution. ✅ Strengths1. Excellent Problem Identification and Solution
2. Comprehensive Test CoverageThe PR adds extensive tests covering multiple edge cases:
Tests are well-documented with clear explanations of what bugs they're preventing. 3. Excellent Documentation
4. API Design
|
gonzalog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
HCD-579
Summary
flush_groupandflush_and_sync_groupwith batched handlersBug Fixes
flush_and_sync_group:is_idling()checks and 32-iteration fuse to prevent infinite loopsflush_group:batch_size) no longer cause indefinite hangsDocumentation
flush_current_group()for use within handlersGroupRemovalResult,remove_group, etc.)Tests Added
test_flush_group_flushes_partial_batchestest_flush_and_sync_group_flushes_partial_batchestest_flush_and_sync_group_waits_for_cascading_messagestest_flush_and_sync_group_handles_deep_nestingtest_flush_and_sync_group_batched_with_cascadingTest plan
cargo run --example demo_groupsworks correctlySummary by CodeRabbit
New Features
Enhancements
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.