Feature/cuda clusterfinder by kferjaoui · Pull Request #306 · slsdetectorgroup/aare

kferjaoui · 2026-04-23T15:27:51Z

CUDA Cluster Finder

Kernel

Shared memory tiling with generalized halo loading for arbitrary cluster sizes (3×3, 5×5, 9×9, …)
Compile-time stencil geometry enables full loop unrolling of the stencil reduction
Pedestal subtraction fused into the shared memory load stage
Non-photon pixels update the pedestal in-kernel; no atomics required on this path
Lock-free atomic global counter for cluster output across blocks
~87.5% occupancy on sm_89 (43 registers/thread, 0 spills)

Precision

Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias)
Pedestal accumulation stays double to preserve variance accuracy
FP32 throughput is ~64× higher than FP64 on the RTX 4090
For 3x3 clusters: The stride-18 shared memory access pattern avoids bank conflicts only with 32-bit values

Host Wrapper & Transfers

Per-stream pinned host staging buffers for truly async DMA (no implicit sync on pageable memory)
N independent StreamContext instances, each owning device buffers, a stream handle, and a pedestal array
find_clusters_batched() distributes frames round-robin across streams
Cluster payload construction deferred out of the kernel hot path
Rely on squared comparisons to avoid per-pixel sqrtf()

Python Integration

Built as a separate _aare_cuda.so to avoid nvcc/gcc ABI conflicts
Re-exported onto _aare so user imports remain from aare import ClusterFinderCUDA
Cluster and ClusterVector marked py::module_local() for pybind11 3 compatibility
Runtime factory selects CPU or GPU backend; raises RuntimeError on CPU-only builds

Validation

97.5% cluster match vs. CPU reference on ~90k frames of 400×400 Moench data (9×9 clustering)
~134× throughput improvement: 106 FPS CPU vs. 14,214 FPS GPU
Kernel time: 0.040 ms/frame | PCIe + overhead: 0.030 ms/frame

Implements a GPU version of the sequential ClusterFinder for single-frame cluster reconstrcution. Kernel (ClusterFinderCUDA.cuh): - Shared memory tiling with generalized halo loading for arbitrary cluster sizes (3x3, 5x5, ...) - Zero-initialization of shared memory to handle image boundary and partial edge-block cases. - Pedestal subtraction during shared memory loading. - Compile-time cluster geometry enabling full loop unrolling of the stencil reduction - Atomic global counter for lock-free cluster output across blocks. - RAII host wrapper; `ClusterFinderCUDA` struct.

- Non-photon pixels now update pedestal (push_fast equivalent) directly in the kernel, no atomics needed - Commented out quadrant significance test (c2): absent from sequential CPU code, was producing GPU-only clusters. - Added d_pd_sum to device allocations and host upload Build (sm_89): 46 registers, 0 spills, 100% occupancy. Verified on 256x256 Jungfrau data, 5000 frames, nSigma=5.0: CPU 8428 vs GPU 8471 clusters, 99.8% match 0.63 ms/frame CPU vs 0.04 ms/frame GPU (~16x)

- Wrap per-stream CUDA resources (device buffers, stream handle) in StreamContext struct; ClusterFinderCUDA owns a vector of n_streams contexts with independent pedestal arrays - Split ClusterFinderCUDA.cuh into clusterfinder_kernel.cuh (device kernel) and ClusterFinderCUDA.hpp (host RAII wrapper) - Add find_clusters_batched(): processes N frames round-robin across streams, returns per-frame cluster vectors. - Update ClusterFinderCUDA.test.cu - Update Makefile for new file layout.

- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for ClusterFinderCUDA - Build CUDA bindings as separate _aare_cuda.so to avoid segfaults from mixing nvcc and gcc compiled code in the same shared object - Re-export CUDA classes onto _aare in __init__.py so user code uses `from aare import ClusterFinderCUDA` regardless of which .so hosts the class - Factory in ClusterFinder.py selects backend; RuntimeError if GPU requested on CPU-only build - Update python/CMakeLists.txt: _aare_cuda module gated behind AARE_CUDA and AARE_PYTHON_BINDINGS - Add validation notebook: ~20x speedup vs sequential ClusterFinder

- After upgrading to pybind11 3, duplicate registration of cluster-related types across `_aare` and `_aare_cuda` started failing. - Mark the `Cluster` and `ClusterVector` bindings as `py::module_local()` so each extension owns its local registration. Note: cluster objects from CPU and CUDA bindings are now distinct Python types.

- Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias). - Pedestal accumulation stays double to preserve variance accuracy. Notes: - On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving stencil math to float improves performance. - Using float also avoids shared memory bank conflicts: stride-18 maps to distinct banks for 32-bit values, but caused conflicts with 64-bit.

- Use per-stream pinned host staging buffers for truly async CUDA transfers. - Avoid reserving full device capacity per result frame. - Reduce kernel work by delaying cluster payload construction. - Use squared comparisons and removing per-pixel sqrtf() ops.

lrlunin · 2026-05-07T11:44:55Z

Should not be included in git, I guess

Yes will be deleted (only for quick tests while developing)

Co-authored-by: Leonid Lunin <lunin.leonid@gmail.com>

lrlunin

The issue with failing conda build was resolved in #309

lrlunin · 2026-05-13T14:54:30Z

+      - name: Install conda-build in base
+        run: conda install -n base conda-build -y
+


Suggested change

- name: Install conda-build in base

run: conda install -n base conda-build -y

fixed in #309

lrlunin · 2026-05-13T14:55:03Z

+      - name: Install conda-build in base
+        run: conda install -n base conda-build -y
+


Suggested change

- name: Install conda-build in base

run: conda install -n base conda-build -y

fixed in #309

Rework the multi-stream pipeline to eliminate per-frame sync barriers and fix the D2H staging architecture. Sync reduction: - Replace one cudaStreamSynchronize per frame with one per stream per batch, cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams) - Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]] stored in a single class-level lazy-allocated pinned pool (h_output_pinned), replacing the per-stream separate cluster/count device buffers - Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated, enabling correct kernel timing across any batch size Pinned H2D without CPU-side copy: - Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping cudaHostRegister so callers can pin their existing batch buffer once; all find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams): Before: ~34 µs/frame -> After: ~28 µs/frame (−18 %)

- Device pedestal arrays (mean/sum/sum2) are now float instead of double: halves global-memory bandwidth for pedestal reads/writes and eliminates FP64 arithmetic in the kernel (3.3x kernel speedup, 15µs -> 4.6µs). - Replace the per-cluster push_back loop in the D2H drain with a single resize()+memcpy().

- Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches to be in-flight simultaneously: - submit_batch() enqueues H2D+kernel+D2H without blocking - collect() syncs via cudaEventSynchronize (not cudaStreamSynchronize) so a queued second batch runs uninterrupted. - Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers and cudaEventDisableTiming sync events. - find_clusters_batched() keeps its direct implementation. * Measured: 0.026 -> 0.022 ms/frame (~18%).

Khalil Daniel Ferjaoui and others added 13 commits April 8, 2026 16:20

minor: use helper for frame packing

b9caf2d

Merge branch 'main' into feature/cuda_clusterfinder

7c91ce9

Format CUDA cluster finder files

a086cbb

Exclude notebooks from JSON check

fddef97

Apply cmake-format to build files

40f08fa

Add per-frame kernel timing via CUDA events

34e69a8

kferjaoui requested review from AliceMazzoleni99, erikfrojdh and lrlunin May 5, 2026 15:25

kferjaoui marked this pull request as ready for review May 5, 2026 15:26

kferjaoui added 2 commits May 6, 2026 11:30

Fix ClusterVector move semantics

41d5184

ci: install conda-build in base env before conda build step

82e5e60

kferjaoui marked this pull request as draft May 7, 2026 10:27

lrlunin reviewed May 7, 2026

View reviewed changes

Comment thread CMakeLists.txt

Eliminate duplicated code in CMakeLists.txt

5cfc888

Co-authored-by: Leonid Lunin <lunin.leonid@gmail.com>

lrlunin reviewed May 13, 2026

View reviewed changes

kferjaoui added 3 commits May 18, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cuda clusterfinder#306

Feature/cuda clusterfinder#306
kferjaoui wants to merge 19 commits into
mainfrom
feature/cuda_clusterfinder

kferjaoui commented Apr 23, 2026 •

edited

Loading

Uh oh!

lrlunin May 7, 2026

Uh oh!

kferjaoui May 7, 2026

Uh oh!

Uh oh!

lrlunin left a comment

Uh oh!

lrlunin May 13, 2026

Uh oh!

lrlunin May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- name: Install conda-build in base
		run: conda install -n base conda-build -y

Conversation

kferjaoui commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lrlunin May 7, 2026

Choose a reason for hiding this comment

Uh oh!

kferjaoui May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lrlunin left a comment

Choose a reason for hiding this comment

Uh oh!

lrlunin May 13, 2026

Choose a reason for hiding this comment

Uh oh!

lrlunin May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kferjaoui commented Apr 23, 2026 •

edited

Loading