[QDP] Double-buffered pinned I/O pipeline and faster Parquet decode #751

400Ping · 2025-12-22T13:34:08Z

Purpose of PR

Add an pinned host buffer pool and wire it into the dual-stream pipeline so each chunk uses double-buffered pinned staging before H2D copies (reduces malloc/free and GPU idle).

Related Issues or PRs

Closes #703

Changes Made

Breaking Changes

Yes
No

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes
Successfully built and ran all unit tests or manual tests locally
PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
Code follows ASF guidelines

400Ping · 2025-12-22T13:52:11Z

cc @guan404ming @rich7420 @ryankert01

rich7420 · 2025-12-23T03:12:00Z

Thanks @400Ping for the patch!

What's the reason you define a ffi function again?
some tests failed locally due to tensor shape problem or you should fix test for excepted output.

400Ping · 2025-12-23T10:21:30Z

Thanks @400Ping for the patch!

What's the reason you define a ffi function again?

some tests failed locally due to tensor shape problem or you should fix test for excepted output.

My bad just fixed it.

rich7420

@400Ping thanks for the patch!
left some comments

qdp/qdp-core/src/gpu/memory.rs

qdp/qdp-core/src/gpu/pipeline.rs

rich7420 · 2025-12-26T17:55:23Z

I think maybe we could add some unit tests for this.

ryankert01 · 2025-12-27T00:49:18Z

We have 2 improvement in this PR. Based on the benchmark result, I'm speculating if there's one of them are not contributing to the speed improvement. What's your experience?

Signed-off-by: 400Ping <fourhundredping@gmail.com>

This reverts commit 3556b5a.

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping · 2025-12-29T13:30:46Z

We have 2 improvement in this PR. Based on the benchmark result, I'm speculating if there's one of them are not contributing to the speed improvement. What's your experience?

I think both have improvements, for the second one is what @rich7420 and @guan404ming suggested to change a different decompression technique to improve its performance. But I think overall it is because of the first one improving the speed improvements.

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping · 2025-12-29T14:05:33Z

Just tested, the second one doesn't improve much performance, going to remove it.

Signed-off-by: 400Ping <fourhundredping@gmail.com>

rich7420 · 2025-12-30T05:00:00Z

plz fix pre-commit error

400Ping · 2025-12-31T11:56:18Z

Done, @rich7420 PTAL

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping · 2026-01-01T03:27:50Z

cc @guan404ming @ryankert01

ryankert01 · 2026-01-01T07:17:00Z

please fix pre-commit. I tested locally and get a 2.8% speedup on arrow ipc case.
Will look into it next week.

Signed-off-by: 400Ping <fourhundredping@gmail.com>

Copilot

Pull request overview

This PR introduces a double-buffered pinned host memory I/O pipeline to improve GPU data transfer performance. The key optimization is adding a reusable pool of pinned host buffers to eliminate repeated CUDA allocation/deallocation overhead in the streaming Parquet decode path.

Implements PinnedBufferPool with automatic RAII-based buffer management
Refactors PipelineContext to support multiple event slots for double-buffered synchronization
Renames PinnedBuffer to PinnedHostBuffer for clarity
Moves norm buffer allocation from per-pipeline to per-chunk

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
qdp/qdp-core/src/gpu/buffer_pool.rs	New pinned host buffer pool with acquire/release semantics and automatic return-to-pool on drop
qdp/qdp-core/src/gpu/pipeline.rs	Extended PipelineContext to support multiple event slots; integrated pinned buffer pool; improved error handling with Result returns
qdp/qdp-core/src/gpu/memory.rs	Renamed PinnedBuffer to PinnedHostBuffer and added immutable slice accessor
qdp/qdp-core/src/lib.rs	Integrated buffer pool types; moved norm buffer allocation to per-chunk scope; updated pipeline event handling
qdp/qdp-core/src/gpu/mod.rs	Exposed new buffer_pool module and its public types
qdp/qdp-core/src/gpu/cuda_ffi.rs	Removed redundant cfg attribute (already applied at module level)
qdp/qdp-kernels/tests/amplitude_encode.rs	Refactored test loop to use idiomatic iterator pattern instead of direct indexing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-01T16:11:24Z

qdp/qdp-core/src/gpu/buffer_pool.rs

+        let mut free = self.free.lock().unwrap();
+        loop {
+            if let Some(buffer) = free.pop() {
+                return PinnedBufferHandle {
+                    buffer: Some(buffer),
+                    pool: Arc::clone(self),
+                };
+            }
+            free = self.available_cv.wait(free).unwrap();


The .unwrap() calls on mutex lock operations can cause panics if the mutex is poisoned. In a production system, poisoned mutex errors should be handled more gracefully, either by propagating the error or by documenting that panic behavior is intentional in these scenarios.

Copilot · 2026-01-01T16:11:25Z

qdp/qdp-core/src/gpu/buffer_pool.rs

+    /// Returns `None` if the pool is currently empty; callers can choose to spin/wait
+    /// or fall back to synchronous paths.
+    pub fn try_acquire(self: &Arc<Self>) -> Option<PinnedBufferHandle> {
+        let mut free = self.free.lock().unwrap();


The .unwrap() call on the mutex lock operation can cause a panic if the mutex is poisoned. Consider handling this error more gracefully or documenting the panic behavior.

Copilot · 2026-01-01T16:11:25Z

qdp/qdp-core/src/gpu/buffer_pool.rs

+
+    /// Number of buffers currently available.
+    pub fn available(&self) -> usize {
+        self.free.lock().unwrap().len()


The .unwrap() call on the mutex lock operation can cause a panic if the mutex is poisoned. Consider handling this error more gracefully or documenting the panic behavior.

Copilot · 2026-01-01T16:11:25Z

qdp/qdp-core/src/gpu/pipeline.rs

+        if chunk.len() > CHUNK_SIZE_ELEMENTS {
+            return Err(MahoutError::InvalidInput(format!(
+                "Chunk size {} exceeds pinned buffer capacity {}",
+                chunk.len(),
+                CHUNK_SIZE_ELEMENTS
+            )));
+        }
+


This check is redundant because the iterator chunks() will never produce a chunk larger than CHUNK_SIZE_ELEMENTS. The check can be removed to simplify the code.

Suggested change

if chunk.len() > CHUNK_SIZE_ELEMENTS {

return Err(MahoutError::InvalidInput(format!(

"Chunk size {} exceeds pinned buffer capacity {}",

chunk.len(),

CHUNK_SIZE_ELEMENTS

)));

}

Copilot · 2026-01-01T16:11:25Z

qdp/qdp-core/src/lib.rs

+                            let mut norm_buffer = self
+                                .device
+                                .alloc_zeros::<f64>(samples_in_chunk)
+                                .map_err(|e| {
+                                    MahoutError::MemoryAllocation(format!(
+                                        "Failed to allocate norm buffer: {:?}",
+                                        e
+                                    ))
+                                })?;


Moving the norm buffer allocation inside the per-chunk loop (line 331-339) reintroduces per-chunk allocation overhead that was previously avoided. The old code comment explicitly stated: "Reuse a single norm buffer across chunks to avoid per-chunk allocations" and warned that "per-chunk allocation + drop can lead to use-after-free when the next chunk reuses the same device memory while the previous chunk is still running." While the use-after-free concern is mitigated by proper stream synchronization, the per-chunk allocation overhead remains. Consider pre-allocating a single norm buffer sized for the maximum expected samples_in_chunk to improve performance.

Copilot · 2026-01-01T16:11:26Z

qdp/qdp-core/src/gpu/buffer_pool.rs

+impl Drop for PinnedBufferHandle {
+    fn drop(&mut self) {
+        if let Some(buf) = self.buffer.take() {
+            let mut free = self.pool.free.lock().unwrap();


The .unwrap() call on the mutex lock operation can cause a panic if the mutex is poisoned. Consider handling this error more gracefully or documenting the panic behavior.

Suggested change

let mut free = self.pool.free.lock().unwrap();

let mut free = match self.pool.free.lock() {

Ok(guard) => guard,

Err(poisoned) => poisoned.into_inner(),

};

qdp/qdp-core/src/gpu/buffer_pool.rs

Copilot · 2026-01-01T16:11:26Z

qdp/qdp-core/src/gpu/buffer_pool.rs

+    fn deref(&self) -> &Self::Target {
+        self.buffer
+            .as_ref()
+            .expect("Buffer already returned to pool")
+    }
+}
+
+#[cfg(target_os = "linux")]
+impl std::ops::DerefMut for PinnedBufferHandle {
+    fn deref_mut(&mut self) -> &mut Self::Target {
+        self.buffer
+            .as_mut()
+            .expect("Buffer already returned to pool")


The panic message "Buffer already returned to pool" may be misleading. This panic occurs when attempting to use a PinnedBufferHandle after it has been dropped and its buffer returned to the pool. Consider a more descriptive message such as "Attempted to use PinnedBufferHandle after buffer was returned to pool (use-after-drop)" to better indicate the programmer error.

guan404ming · 2026-01-01T16:20:27Z

I agree that the comment regarding .unwrap() is valid.
Do we want to handle this more gracefully, or is the current panic-on-poison behavior expected? If so, we can document it.

400Ping · 2026-01-01T16:31:39Z

I agree that the comment regarding .unwrap() is valid. Do we want to handle this more gracefully, or is the current panic-on-poison behavior expected? If so, we can document it.

I think I will change the code to handle it more gracefully and add some comments to document this behavior.

Signed-off-by: 400Ping <fourhundredping@gmail.com>

guan404ming · 2026-01-02T10:38:30Z

Need resolve conflicts, and overall looks good to me!

400Ping marked this pull request as draft December 22, 2025 13:39

400Ping marked this pull request as ready for review December 22, 2025 13:51

400Ping changed the title ~~[QDP] Double-buffered async I/O for read_parquet_batch~~ [QDP] Pinned host buffer + dual-stream event pipeline to overlap copy and compute Dec 22, 2025

400Ping marked this pull request as draft December 24, 2025 07:54

400Ping force-pushed the qdp-I/O-barrier branch from 63ab994 to 755140f Compare December 24, 2025 16:21

400Ping marked this pull request as ready for review December 25, 2025 10:18

rich7420 reviewed Dec 25, 2025

View reviewed changes

qdp/qdp-core/src/gpu/memory.rs Outdated Show resolved Hide resolved

qdp/qdp-core/src/gpu/pipeline.rs Show resolved Hide resolved

rich7420 marked this pull request as draft December 25, 2025 13:24

400Ping changed the title ~~[QDP] Pinned host buffer + dual-stream event pipeline to overlap copy and compute~~ [QDP] Double-buffered pinned I/O pipeline and faster Parquet decode Dec 25, 2025

400Ping marked this pull request as ready for review December 25, 2025 22:51

rich7420 reviewed Dec 26, 2025

View reviewed changes

qdp/qdp-core/src/gpu/pipeline.rs Outdated Show resolved Hide resolved

400Ping added 11 commits December 29, 2025 21:13

Double-buffered async I/O for read_parquet_batch

bcc9970

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

eceabf0

Signed-off-by: 400Ping <fourhundredping@gmail.com>

fix python binding error

ba9b021

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

d4a33bb

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

d5075ac

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

198e9d2

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

5eb1ce0

Signed-off-by: 400Ping <fourhundredping@gmail.com>

update

28292ac

Signed-off-by: 400Ping <fourhundredping@gmail.com>

fix build error

735ed08

Signed-off-by: 400Ping <fourhundredping@gmail.com>

Revert "fix build error"

7cd8489

This reverts commit 3556b5a.

fix build errors

b411dcf

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping force-pushed the qdp-I/O-barrier branch from 372a6c5 to b411dcf Compare December 29, 2025 13:21

update unit test and boundary check

288cf73

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping added 2 commits December 29, 2025 22:06

remove improvement 2

8d9ebf4

Signed-off-by: 400Ping <fourhundredping@gmail.com>

fix qdp-core error

6334829

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping requested a review from rich7420 December 29, 2025 14:25

Merge branch 'dev-qdp' into qdp-I/O-barrier

5961834

fix pre-commit

0dea033

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping added 2 commits January 1, 2026 22:39

[Fix] fix pre-commit errors & warnings

9522034

Signed-off-by: 400Ping <fourhundredping@gmail.com>

fix rust linters

74bf8d0

Signed-off-by: 400Ping <fourhundredping@gmail.com>

guan404ming requested a review from Copilot January 1, 2026 16:07

Copilot started reviewing on behalf of guan404ming January 1, 2026 16:08 View session

Copilot AI reviewed Jan 1, 2026

View reviewed changes

400Ping added 2 commits January 2, 2026 00:38

[Fix] handle buffer pool lock poisoning

2c14592

Signed-off-by: 400Ping <fourhundredping@gmail.com>

[Chore] fix rust linters

7ed2326

Signed-off-by: 400Ping <fourhundredping@gmail.com>

-            let mut free = self.pool.free.lock().unwrap();
+            let mut free = match self.pool.free.lock() {
+                Ok(guard) => guard,
+                Err(poisoned) => poisoned.into_inner(),
+            };

[QDP] Double-buffered pinned I/O pipeline and faster Parquet decode #751

Are you sure you want to change the base?

[QDP] Double-buffered pinned I/O pipeline and faster Parquet decode #751

Conversation

400Ping commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of PR

Related Issues or PRs

Changes Made

Breaking Changes

Checklist

Uh oh!

400Ping commented Dec 22, 2025

Uh oh!

rich7420 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

400Ping commented Dec 23, 2025

Uh oh!

rich7420 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rich7420 commented Dec 26, 2025

Uh oh!

ryankert01 commented Dec 27, 2025

Uh oh!

400Ping commented Dec 29, 2025

Uh oh!

400Ping commented Dec 29, 2025

Uh oh!

rich7420 commented Dec 30, 2025

Uh oh!

400Ping commented Dec 31, 2025

Uh oh!

400Ping commented Jan 1, 2026

Uh oh!

ryankert01 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

guan404ming commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

400Ping commented Jan 1, 2026

Uh oh!

guan404ming commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

400Ping commented Dec 22, 2025 •

edited

Loading

rich7420 commented Dec 23, 2025 •

edited

Loading

rich7420 left a comment •

edited

Loading

ryankert01 commented Jan 1, 2026 •

edited

Loading

guan404ming commented Jan 1, 2026 •

edited

Loading