GPU-Accelerated Parallel Compression Engine (CUDA Huffman)

A high-performance, production-grade GPU acceleration project demonstrating advanced CUDA optimization techniques for Huffman Encoding. This project achieves massive throughput gains over traditional CPU-based frequency counting by leveraging the massive parallelism of NVIDIA GPUs.

🚀 Key Features

Advanced GPU Optimization: Includes multiple kernel implementations (Naive, Shared Memory, Warp-Level Aggregation).
Hybrid Benchmarking: Real-time comparison between Single-threaded CPU, Multi-threaded OpenMP, and various CUDA kernels.
Production-Grade Architecture: Clean, modular C++17 design with RAII memory management and interface-driven design.
Measurable Performance: Built-in throughput (GB/s) metrics and latency analysis.
Data Visualization: Integrated Python scripts for generating interactive performance dashboards.
Scalability: Tested on datasets ranging from small text files to multi-GB binary blobs.

🏗️ Project Architecture

graph TD
    A[src/main.cpp] --> B[BenchmarkHarness]
    B --> C[IFrequencyCounter Interface]
    C --> D[CpuFrequencyCounter]
    C --> E[OpenMpFrequencyCounter]
    C --> F[CudaFrequencyCounter]
    F --> G[CUDA Kernels]
    G --> G1[Naive Global Atomics]
    G --> G2[Shared Memory Histograms]
    G --> G3[Warp-Level Aggregation]
    B --> H[CSV/JSON Reporter]
    H --> I[Python Visualization]

📂 Directory Structure

The repository is organized as follows:

src/: All C++ and CUDA source files.
- core/: Base Huffman tree node representations and code building.
- cpu/: Host serial and OpenMP multi-threaded counter implementations.
- cuda/: GPU kernel code and CPU fallback drivers.
- utils/: RAII helpers (GpuBuffer, UnifiedBuffer, GpuStream) and log utilities.
- benchmark/: Test configuration and latency recording harness.
data/: Text datasets and inputs for benchmarking.
scripts/: Support tools (including dashboard visualization).
docs/: Development guides and optimization logs.

🛠️ Performance Engineering

This project goes beyond simple "clean code." It implements several state-of-the-art GPU optimization strategies:

Optimization Level	Technique	Rationale
L1: Naive	Global Atomics	Baseline implementation using `atomicAdd` on global memory. High contention.
L2: Shared	Per-Block Histograms	Reduces global memory traffic by aggregating counts in high-speed L1/Shared memory.
L3: Warp	Multi-Bank Histograms	Further reduces shared memory bank conflicts by using interleaved sub-histograms per warp.
L4: Async	CUDA Streams	Overlaps data transfer (H2D/D2H) with kernel execution for hidden latency.
L5: Managed	Unified Memory	Simplifies developer programmability using a single managed address space with async prefetching.

📊 Benchmarking Results

On an NVIDIA RTX GPU, the Shared Memory implementation typically achieves 10x - 50x speedup over optimized CPU implementations for large datasets.

Throughput (GB/s)

CPU Serial: ~0.4 GB/s
CPU OpenMP (16 threads): ~3.2 GB/s
CUDA Shared Memory: ~45+ GB/s

💻 Getting Started

Prerequisites

CUDA Toolkit: 13.2+ for VS 2026 support (CUDA 13.1 only supports VS 2019–2022)
C++17 Compatible Compiler (MSVC 2019+ or GCC 9+)
CMake 3.18+
Python 3.8+ (for visualization)
Visual Studio 2026 Build Tools with the MSVC toolset for GPU builds on Windows

Note on Windows & CUDA:

The CUDA build path requires MSVC as the host compiler.
CUDA 13.1 does not support VS 2026; you must upgrade to CUDA 13.2+ or use VS 2022/2019.
If CMake detects an unsupported CUDA/compiler pair, it automatically falls back to a CPU-only implementation so the executable still builds and runs.

Upgrading CUDA

If you have VS 2026 installed and CUDA 13.1, download and install CUDA Toolkit 13.2+ or the latest version. Then reconfigure:

rmdir /s /q build-msvc
cmake -S . -B build-msvc

Windows Quick Start

Open the VS Code terminal with the Developer Command Prompt for VS 2026 profile, then run:

cd /d "D:\Projects\GPU huffman"
cmake -S . -B build-msvc
cmake --build build-msvc

If you are using a regular PowerShell terminal, CMake may select MinGW instead of MSVC and the project will fall back to the CPU implementation.

Run the sample benchmark:

.\build-msvc\bin\gpu_huffman.exe .\data\big.txt

Or run your own input file:

.\build-msvc\bin\gpu_huffman.exe D:\path\to\your\file.txt

Generate the visual report:

python .\scripts\visualize_results.py

Notes

The default build-msvc folder keeps the MSVC configuration separate from the earlier MinGW build tree.
If you want the GPU/CUDA path, always configure from the Developer Command Prompt, not a regular PowerShell or CMD session.

📚 Lessons Learned & Bottlenecks

Memory Contention: In frequency counting, multiple threads frequently hit the same buckets (e.g., 'e' or space). Shared memory atomics are essential to prevent global memory serialization.
Kernel Launch Overhead: For inputs < 1MB, the CPU often wins due to the 5-10µs overhead of launching a CUDA kernel.
PCIe Latency: Data transfer (H2D) is the primary bottleneck for small to medium files. Pinning memory and using async streams can mitigate this.

📄 License

Distributed under the MIT License. See LICENSE for more information.

Author: Ganesh Kambli

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
build-msvc		build-msvc
data		data
docs		docs
scripts		scripts
src		src
.clangd		.clangd
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
benchmark_results.csv		benchmark_results.csv
performance_dashboard.png		performance_dashboard.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Accelerated Parallel Compression Engine (CUDA Huffman)

🚀 Key Features

🏗️ Project Architecture

📂 Directory Structure

🛠️ Performance Engineering

📊 Benchmarking Results

Throughput (GB/s)

💻 Getting Started

Prerequisites

Upgrading CUDA

Windows Quick Start

Notes

📚 Lessons Learned & Bottlenecks

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Parallel Compression Engine (CUDA Huffman)

🚀 Key Features

🏗️ Project Architecture

📂 Directory Structure

🛠️ Performance Engineering

📊 Benchmarking Results

Throughput (GB/s)

💻 Getting Started

Prerequisites

Upgrading CUDA

Windows Quick Start

Notes

📚 Lessons Learned & Bottlenecks

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages