[Data] Add per-stage training-thread blocking attribution to iter_batches#64183
[Data] Add per-stage training-thread blocking attribution to iter_batches#64183OneSizeFitsQuorum wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces detailed per-stage wall-clock timing and statistics tracking (including fetch, batching, formatting, collating, finalization, and order restoration) for Ray Data batch iteration, along with corresponding Prometheus metrics and unit tests. The review feedback suggests updating the type annotation of dataset_tag in _create_iteration_tags to Optional[str] to align with the underlying function and avoid static type checking errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
b02e44d to
850ff58
Compare
faaeb93 to
14ec378
Compare
JasonLi1909
left a comment
There was a problem hiding this comment.
Thank you for the PR @OneSizeFitsQuorum! The core idea of capturing the start and end of each pipeline stage to compute overlap with the training thread stall makes sense. That said, let's keep the scope of this PR to just the training thread attribution metrics. We can compartmentalize other metrics for a later PR, but at a glance some of them such as data_iter_prefetch_queue_depth are more of an implementation detail, which we should avoid. Also it would be great if you could simplify the PR description so it's shorter and easier to understand. Thanks!
14ec378 to
69c8cbb
Compare
…e observability to iter_batches Implements overlap-based latency attribution for Ray Data's iter_batches pipeline, addressing ray-project#64132 and RFC ray-project#63911. Each pipeline stage (fetch, batching, format, collate, finalize, restore_order) records an independent (start_s, end_s) time window. The training thread captures its own blocked window around next(). Attribution per stage is the overlap of the two windows, correctly handling prefetch > 1. New Prometheus metrics (14 total): - data_iter_blocked_{fetch,batching,format,collate,finalize,restore_order}_seconds - data_iter_batches_total, data_iter_rows_total - data_iter_total_seconds, data_iter_restore_order_buffer_peak - data_iter_shuffle_buffer_{rows,compactions_total,compaction_seconds} - data_iter_prefetch_queue_depth Also adds: - Per-stage breakdown rendering in IterStatsSummary.to_string() - Rank extraction from dataset tags for Prometheus labels - Final metrics flush on iterator completion Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
69c8cbb to
071ebd6
Compare
|
Thanks for the review @JasonLi1909! I've made the following changes:
Also fixed a regex issue flagged by Cursor Bugbot: the rank extraction now uses the last |
Reverts batcher.py changes that were only needed for the shuffle buffer metrics which have been removed from this PR's scope per reviewer feedback. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
| with stats.iter_get_s.timer() if stats else nullcontext(): | ||
| block = ray.get(block_ref) | ||
| yield block | ||
| end_s = time.perf_counter() |
There was a problem hiding this comment.
Instead of having tow timers for the same thing, can we consolidate the capture of start_s and end_s into the Timer class (in stats.py)? Double check for any backwards compatibility issues with other consumers of Timer
There was a problem hiding this comment.
Done. StageTiming now supports context manager protocol (__enter__/__exit__), and Timer gained start_s/end_s fields. Pipeline functions use nested context managers instead of redundant perf_counter() calls. No backwards compatibility issues — Timer is only used internally within ray/data.
There was a problem hiding this comment.
Let's also capture the blocked time here. It will be useful to indicate how we much we are blocked on the upstream data pipeline and cross node transfer.
There was a problem hiding this comment.
Done. The fetch window in resolve_block_refs now spans from next(block_ref_iter) (upstream wait) through ray.get() completion, capturing both production delays and cross-node transfer.
| def resolve_block_refs( | ||
| block_ref_iter: Iterator[ObjectRef[Block]], | ||
| stats: Optional[DatasetStats] = None, | ||
| ) -> Iterator[Block]: | ||
| record_timings: bool = False, |
There was a problem hiding this comment.
Let's remove the optionality here and always record the timings to avoid downstream isinstance(block, BlockWithTimings) checks, Union types, and other type branching logic. And if we can consolidate the (start, end) capture into Timer, then we can still toggle this timing via the presence of stats
There was a problem hiding this comment.
Done. resolve_block_refs always returns BlockWithTiming (no record_timings parameter). batch_blocks() wraps raw blocks in BlockWithTiming with zero timing before passing to blocks_to_batches(), so _BatchingIterator receives a uniform type. All isinstance checks, Union types, and type branching logic removed.
| @@ -452,7 +474,9 @@ def get_next_ref_bundle() -> RefBundle: | |||
| prefetcher.stop() | |||
|
|
|||
|
|
|||
| def restore_original_order(batch_iter: Iterator[Batch]) -> Iterator[Batch]: | |||
| def restore_original_order( | |||
There was a problem hiding this comment.
Let's hold off on timing restore order and surfacing data_iter_blocked_restore_order_seconds. This is also more of an implementation detail. Instead we should focus on surfacing actionable metrics for users.
There was a problem hiding this comment.
Done. Removed restore_order stage, data_iter_blocked_restore_order_seconds metric, and all related code. restore_original_order() reverted to the original simple for-loop. PR now exposes 8 metrics (5 blocked stages + batches/rows/total).
| with self.yield_batch_context(batch): | ||
| yield batch.data | ||
|
|
||
| self.after_epoch_end() | ||
|
|
||
| def _report_batch_timings( |
JasonLi1909
left a comment
There was a problem hiding this comment.
Hey @OneSizeFitsQuorum, thanks for the cleanup. I've left some more comments. Overall, make sure to include docstrings, comments, and that names of data classes/functions communicate their purpose. I'll take another look after you're done, thanks!
Per reviewer feedback, restore_order is an implementation detail rather than an actionable user-facing metric. Reverts restore_original_order() to the original simple for-loop and removes the data_iter_blocked_restore_order_seconds Prometheus metric along with all related fields, exports, and tests. The PR now exposes 8 core metrics (5 blocked stages + batches/rows/total). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, consolidates the dual timing mechanism: - StageTiming now supports context manager protocol (__enter__/ __exit__) to automatically capture start_s/end_s - Timer gains start_s/end_s fields populated by timer() - Pipeline functions (resolve_block_refs, _format_batch, _collate_batch, _finalize_batch) use nested context managers instead of redundant perf_counter() + _record_stage_window() - resolve_block_refs always returns BlockWithTiming, removing the record_timings parameter, Union types, and isinstance branching - Removed _record_stage_window helper (no longer needed) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The fetch timing window in resolve_block_refs now spans from when we start waiting for the upstream iterator (blocked on the data pipeline) through ray.get() completion. This captures cross-node transfer and upstream production delays, giving a more complete picture of what blocks the training thread. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, adds clear docstrings to: - BatchTimings (per-batch pipeline-stage timing windows) - BlockWithTiming (resolved block with fetch timing) - BatchTimings.merge_fetch() (multi-block fetch window expansion) - BatchTimings.stages() (stage name/timing iterator) - _report_batch_timings() (overlap-based attribution algorithm) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
_BatchingIterator can receive blocks from paths other than resolve_block_refs (e.g., doctest examples that pass raw pyarrow Tables). Restore the isinstance check to handle both BlockWithTiming and raw Block objects gracefully. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 9fcde56. Configure here.
Per reviewer feedback, removed isinstance check and Union type from _BatchingIterator by ensuring all entry points wrap blocks in BlockWithTiming: - batch_blocks() now wraps raw blocks in BlockWithTiming with zero timing before passing to blocks_to_batches() - _BatchingIterator now assumes all blocks are BlockWithTiming - Removed Union import from util.py This provides a uniform type throughout the batching pipeline while maintaining backward compatibility for external callers of batch_blocks(). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per Cursor Bugbot review: 1. merge_fetch now sums fetch durations instead of taking the span, avoiding counting idle gaps between consecutive block fetches as fetch blocking time. 2. Move blocked_start_s/blocked_end_s captures inside get_next_batch_context() so the blocked window aligns with iter_total_blocked_s, preventing sum(iter_blocked_*) from exceeding iter_total_blocked_s. Updated tests to reflect the new duration-summing behavior. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
After changing merge_fetch to sum durations instead of taking the span, the _merge_stage helper is no longer called anywhere. Remove the dead code. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
After deeper analysis, the span approach (taking [earliest_start, latest_end]) is semantically correct for multi-block fetches: - From the training thread's perspective, it's blocked for the entire span, even if there are gaps between consecutive block fetches - Those "idle gaps" are actually pipeline overhead (batching logic, scheduling) and are part of the blocking experience - Summing durations would underestimate the actual blocking time The Cursor Bugbot concern about "idle gaps" is valid in theory, but in practice: 1. The gaps are very small (microseconds of pipeline overhead) 2. They represent real blocking time from the training thread's perspective 3. Span aligns with the semantic meaning of "how long did training wait" Reverted tests to expect span behavior. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
|
Re: merge_fetch idle gaps (Cursor Bugbot) After deeper analysis, we've reverted to the span-based approach (commit aa41de5). Here's why: Semantic correctness: From the training thread's perspective, when it calls The "idle gaps" aren't really idle: The gaps between block fetches are actually pipeline overhead (batching logic, scheduling, etc.). They're part of the blocking experience from the training thread's viewpoint. Sum underestimates: If we sum durations, we'd report less blocking time than the training thread actually experienced, which defeats the purpose of observability. Example: The span approach correctly answers "how long did the training thread wait for this batch?", which is what users care about for performance debugging. Cursor Bugbot's concern about idle gaps is theoretically valid, but in practice those gaps are microseconds of pipeline overhead and represent real blocking time. |
- test_util.py: Updated test_resolve_block_refs to expect BlockWithTiming objects and test_blocks_to_batches to wrap raw blocks - block_batching.py: Changed generator expression to map() to avoid holding references to blocks, fixing test_chained_transforms_release_intermediates Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Summary
Decomposes
iter_total_blocked_sinto per-stage contributions so users can see where the training thread is blocked. Closes #64132, part of RFC #63911.Design
Each pipeline stage records an independent
(start_s, end_s)window viaStageTiming, a lightweight context manager. The training thread records its blocked window aroundnext(batch_iter). Per-stage attribution = overlap of the two windows:Invariant:
sum(iter_blocked_*) ≤ iter_total_blocked_sThe fetch stage spans from upstream wait (blocked on data pipeline) through
ray.get()completion, capturing both production delays and cross-node transfer.New Prometheus Metrics (8)
data_iter_blocked_fetch_secondsdata_iter_blocked_batching_secondsdata_iter_blocked_format_secondsdata_iter_blocked_collate_secondsdata_iter_blocked_finalize_secondsdata_iter_batches_totaldata_iter_rows_totaldata_iter_total_secondsAlso: rank extracted from dataset tag (last
split_N) for per-rank Prometheus querying.Changes
interfaces.py—StageTiming(context manager),BatchTimings,BlockWithTimingiter_batches.py—_report_batch_timings()overlap computation with docstringutil.py—resolve_block_refsalways returnsBlockWithTiming(norecord_timingsflag, no Union/isinstance); nested context managers replace redundantperf_counter()calls; fetch window includes upstream waitstats.py— 8 Prometheus Gauges,Timergainsstart_s/end_s,IterStatsSummaryper-stage breakdown, rank extractioniterator.py— Final metrics flush on iteration endExample Output
Performance
~6 μs/batch overhead. At 10k batches/sec: <0.04% impact on a 10ms training step.
Tests
References
Closes #64132 · Part of RFC #63911
cc @JasonLi1909