Skip to content

Feature/cuda clusterfinder#306

Draft
kferjaoui wants to merge 19 commits into
mainfrom
feature/cuda_clusterfinder
Draft

Feature/cuda clusterfinder#306
kferjaoui wants to merge 19 commits into
mainfrom
feature/cuda_clusterfinder

Conversation

@kferjaoui

@kferjaoui kferjaoui commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

CUDA Cluster Finder

Kernel

  • Shared memory tiling with generalized halo loading for arbitrary cluster sizes (3×3, 5×5, 9×9, …)
  • Compile-time stencil geometry enables full loop unrolling of the stencil reduction
  • Pedestal subtraction fused into the shared memory load stage
  • Non-photon pixels update the pedestal in-kernel; no atomics required on this path
  • Lock-free atomic global counter for cluster output across blocks
  • ~87.5% occupancy on sm_89 (43 registers/thread, 0 spills)

Precision

  • Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias)
  • Pedestal accumulation stays double to preserve variance accuracy
  • FP32 throughput is ~64× higher than FP64 on the RTX 4090
  • For 3x3 clusters: The stride-18 shared memory access pattern avoids bank conflicts only with 32-bit values

Host Wrapper & Transfers

  • Per-stream pinned host staging buffers for truly async DMA (no implicit sync on pageable memory)
  • N independent StreamContext instances, each owning device buffers, a stream handle, and a pedestal array
  • find_clusters_batched() distributes frames round-robin across streams
  • Cluster payload construction deferred out of the kernel hot path
  • Rely on squared comparisons to avoid per-pixel sqrtf()

Python Integration

  • Built as a separate _aare_cuda.so to avoid nvcc/gcc ABI conflicts
  • Re-exported onto _aare so user imports remain from aare import ClusterFinderCUDA
  • Cluster and ClusterVector marked py::module_local() for pybind11 3 compatibility
  • Runtime factory selects CPU or GPU backend; raises RuntimeError on CPU-only builds

Validation

  • 97.5% cluster match vs. CPU reference on ~90k frames of 400×400 Moench data (9×9 clustering)
  • ~134× throughput improvement: 106 FPS CPU vs. 14,214 FPS GPU
  • Kernel time: 0.040 ms/frame | PCIe + overhead: 0.030 ms/frame

Khalil Daniel Ferjaoui and others added 13 commits April 8, 2026 16:20
Implements a GPU version of the sequential ClusterFinder for
single-frame cluster reconstrcution.

Kernel (ClusterFinderCUDA.cuh):
- Shared memory tiling with generalized halo loading for arbitrary
  cluster sizes (3x3, 5x5, ...)
- Zero-initialization of shared memory to handle image boundary
  and partial edge-block cases.
- Pedestal subtraction during shared memory loading.
- Compile-time cluster geometry enabling full loop unrolling
  of the stencil reduction
- Atomic global counter for lock-free cluster output across blocks.
- RAII host wrapper; `ClusterFinderCUDA` struct.
- Non-photon pixels now update pedestal (push_fast equivalent)
  directly in the kernel, no atomics needed
- Commented out quadrant significance test (c2): absent from
  sequential CPU code, was producing GPU-only clusters.
- Added d_pd_sum to device allocations and host upload

Build (sm_89): 46 registers, 0 spills, 100% occupancy.

Verified on 256x256 Jungfrau data, 5000 frames, nSigma=5.0:
  CPU 8428 vs GPU 8471 clusters, 99.8% match
  0.63 ms/frame CPU vs 0.04 ms/frame GPU (~16x)
- Wrap per-stream CUDA resources (device buffers, stream handle)
  in StreamContext struct; ClusterFinderCUDA owns a vector of
  n_streams contexts with independent pedestal arrays
- Split ClusterFinderCUDA.cuh into clusterfinder_kernel.cuh
  (device kernel) and ClusterFinderCUDA.hpp (host RAII wrapper)
- Add find_clusters_batched(): processes N frames round-robin
  across streams, returns per-frame cluster vectors.
- Update ClusterFinderCUDA.test.cu
- Update Makefile for new file layout.
- Add bind_ClusterFinderCUDA.hpp with pybind11 bindings for
  ClusterFinderCUDA
- Build CUDA bindings as separate _aare_cuda.so to avoid
  segfaults from mixing nvcc and gcc compiled code in the
  same shared object
- Re-export CUDA classes onto _aare in __init__.py so user
  code uses `from aare import ClusterFinderCUDA` regardless
  of which .so hosts the class
- Factory in ClusterFinder.py selects backend; RuntimeError
  if GPU requested on CPU-only build
- Update python/CMakeLists.txt: _aare_cuda module gated
  behind AARE_CUDA and AARE_PYTHON_BINDINGS
- Add validation notebook: ~20x speedup vs sequential ClusterFinder
- After upgrading to pybind11 3, duplicate registration of cluster-related
  types across `_aare` and `_aare_cuda` started failing.
- Mark the `Cluster` and `ClusterVector` bindings as `py::module_local()` so
each extension owns its local registration.

Note: cluster objects from CPU and CUDA bindings are now distinct Python types.
- Stencil arithmetic and shared memory use float (COMPUTE_TYPE alias).
- Pedestal accumulation stays double to preserve variance accuracy.

Notes:
- On RTX 4090, FP32 throughput is ~64× higher than FP64, so moving
  stencil math to float improves performance.
- Using float also avoids shared memory bank conflicts: stride-18 maps
  to distinct banks for 32-bit values, but caused conflicts with 64-bit.
- Use per-stream pinned host staging buffers for truly async CUDA transfers.
- Avoid reserving full device capacity per result frame.
- Reduce kernel work by delaying cluster payload construction.
- Use squared comparisons and removing per-pixel sqrtf() ops.
@kferjaoui kferjaoui marked this pull request as ready for review May 5, 2026 15:26
@kferjaoui kferjaoui marked this pull request as draft May 7, 2026 10:27
Comment thread src/Makefile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be included in git, I guess

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will be deleted (only for quick tests while developing)

Comment thread CMakeLists.txt
Co-authored-by: Leonid Lunin <lunin.leonid@gmail.com>

@lrlunin lrlunin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with failing conda build was resolved in #309

Comment on lines +43 to +45
- name: Install conda-build in base
run: conda install -n base conda-build -y

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Install conda-build in base
run: conda install -n base conda-build -y

fixed in #309

Comment on lines +38 to +40
- name: Install conda-build in base
run: conda install -n base conda-build -y

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Install conda-build in base
run: conda install -n base conda-build -y

fixed in #309

kferjaoui added 3 commits May 18, 2026 16:30
Rework the multi-stream pipeline to eliminate per-frame sync barriers and
fix the D2H staging architecture.

Sync reduction:
- Replace one cudaStreamSynchronize per frame with one per stream per batch,
  cutting synchronisation calls from O(n_frames x n_streams) to O(n_streams)
- Introduce a unified per-frame D2H output layout [uint32_t count | clusters[max]]
  stored in a single class-level lazy-allocated pinned pool (h_output_pinned),
  replacing the per-stream separate cluster/count device buffers
- Move CUDA event pool from per-stream fixed-size to per-frame-slot lazy-allocated,
  enabling correct kernel timing across any batch size

Pinned H2D without CPU-side copy:
- Add register_input_buffer(ptr, bytes) / unregister_input_buffer() wrapping
  cudaHostRegister so callers can pin their existing batch buffer once; all
  find_clusters_batched() slices then transfer at DMA speed (~22 GB/s) instead
  of ~15 GB/s for pageable, with no extra memcpy or WC-memory penalty

Result (RTX 4090, 400x400 uint16, 3x3 clusters, batch=2000, 5 streams):
  Before: ~34 µs/frame  ->  After: ~28 µs/frame  (−18 %)
- Device pedestal arrays (mean/sum/sum2) are now float instead of
  double: halves global-memory bandwidth for pedestal reads/writes and
  eliminates FP64 arithmetic in the kernel (3.3x kernel speedup,
  15µs -> 4.6µs).

- Replace the per-cluster push_back loop in the D2H drain with a
  single resize()+memcpy().
- Eliminate the ~200–300 µs inter-batch idle gap by allowing two batches
to be in-flight simultaneously:
  - submit_batch() enqueues H2D+kernel+D2H without blocking
  - collect() syncs via cudaEventSynchronize (not
  cudaStreamSynchronize) so a queued second batch runs uninterrupted.

- Two ping-pong output slots (NUM_SLOTS=2) with per-slot pinned buffers
and cudaEventDisableTiming sync events.
- find_clusters_batched() keeps its direct implementation.

* Measured: 0.026 -> 0.022 ms/frame (~18%).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants