A high-performance, production-grade GPU acceleration project demonstrating advanced CUDA optimization techniques for Huffman Encoding. This project achieves massive throughput gains over traditional CPU-based frequency counting by leveraging the massive parallelism of NVIDIA GPUs.
- Advanced GPU Optimization: Includes multiple kernel implementations (Naive, Shared Memory, Warp-Level Aggregation).
- Hybrid Benchmarking: Real-time comparison between Single-threaded CPU, Multi-threaded OpenMP, and various CUDA kernels.
- Production-Grade Architecture: Clean, modular C++17 design with RAII memory management and interface-driven design.
- Measurable Performance: Built-in throughput (GB/s) metrics and latency analysis.
- Data Visualization: Integrated Python scripts for generating interactive performance dashboards.
- Scalability: Tested on datasets ranging from small text files to multi-GB binary blobs.
graph TD
A[src/main.cpp] --> B[BenchmarkHarness]
B --> C[IFrequencyCounter Interface]
C --> D[CpuFrequencyCounter]
C --> E[OpenMpFrequencyCounter]
C --> F[CudaFrequencyCounter]
F --> G[CUDA Kernels]
G --> G1[Naive Global Atomics]
G --> G2[Shared Memory Histograms]
G --> G3[Warp-Level Aggregation]
B --> H[CSV/JSON Reporter]
H --> I[Python Visualization]
The repository is organized as follows:
- src/: All C++ and CUDA source files.
- core/: Base Huffman tree node representations and code building.
- cpu/: Host serial and OpenMP multi-threaded counter implementations.
- cuda/: GPU kernel code and CPU fallback drivers.
- utils/: RAII helpers (
GpuBuffer,UnifiedBuffer,GpuStream) and log utilities. - benchmark/: Test configuration and latency recording harness.
- data/: Text datasets and inputs for benchmarking.
- scripts/: Support tools (including dashboard visualization).
- docs/: Development guides and optimization logs.
This project goes beyond simple "clean code." It implements several state-of-the-art GPU optimization strategies:
| Optimization Level | Technique | Rationale |
|---|---|---|
| L1: Naive | Global Atomics | Baseline implementation using atomicAdd on global memory. High contention. |
| L2: Shared | Per-Block Histograms | Reduces global memory traffic by aggregating counts in high-speed L1/Shared memory. |
| L3: Warp | Multi-Bank Histograms | Further reduces shared memory bank conflicts by using interleaved sub-histograms per warp. |
| L4: Async | CUDA Streams | Overlaps data transfer (H2D/D2H) with kernel execution for hidden latency. |
| L5: Managed | Unified Memory | Simplifies developer programmability using a single managed address space with async prefetching. |
On an NVIDIA RTX GPU, the Shared Memory implementation typically achieves 10x - 50x speedup over optimized CPU implementations for large datasets.
- CPU Serial: ~0.4 GB/s
- CPU OpenMP (16 threads): ~3.2 GB/s
- CUDA Shared Memory: ~45+ GB/s
- CUDA Toolkit: 13.2+ for VS 2026 support (CUDA 13.1 only supports VS 2019–2022)
- C++17 Compatible Compiler (MSVC 2019+ or GCC 9+)
- CMake 3.18+
- Python 3.8+ (for visualization)
- Visual Studio 2026 Build Tools with the MSVC toolset for GPU builds on Windows
Note on Windows & CUDA:
- The CUDA build path requires MSVC as the host compiler.
- CUDA 13.1 does not support VS 2026; you must upgrade to CUDA 13.2+ or use VS 2022/2019.
- If CMake detects an unsupported CUDA/compiler pair, it automatically falls back to a CPU-only implementation so the executable still builds and runs.
If you have VS 2026 installed and CUDA 13.1, download and install CUDA Toolkit 13.2+ or the latest version. Then reconfigure:
rmdir /s /q build-msvc
cmake -S . -B build-msvcOpen the VS Code terminal with the Developer Command Prompt for VS 2026 profile, then run:
cd /d "D:\Projects\GPU huffman"
cmake -S . -B build-msvc
cmake --build build-msvcIf you are using a regular PowerShell terminal, CMake may select MinGW instead of MSVC and the project will fall back to the CPU implementation.
Run the sample benchmark:
.\build-msvc\bin\gpu_huffman.exe .\data\big.txtOr run your own input file:
.\build-msvc\bin\gpu_huffman.exe D:\path\to\your\file.txtGenerate the visual report:
python .\scripts\visualize_results.py- The default
build-msvcfolder keeps the MSVC configuration separate from the earlier MinGW build tree. - If you want the GPU/CUDA path, always configure from the Developer Command Prompt, not a regular PowerShell or CMD session.
- Memory Contention: In frequency counting, multiple threads frequently hit the same buckets (e.g., 'e' or space). Shared memory atomics are essential to prevent global memory serialization.
- Kernel Launch Overhead: For inputs < 1MB, the CPU often wins due to the 5-10µs overhead of launching a CUDA kernel.
- PCIe Latency: Data transfer (H2D) is the primary bottleneck for small to medium files. Pinning memory and using async streams can mitigate this.
Distributed under the MIT License. See LICENSE for more information.
Author: Ganesh Kambli
