Skip to content

Ganesh-403/GPU-huffman

Repository files navigation

GPU-Accelerated Parallel Compression Engine (CUDA Huffman)

License: MIT CUDA C++17 OpenMP

A high-performance, production-grade GPU acceleration project demonstrating advanced CUDA optimization techniques for Huffman Encoding. This project achieves massive throughput gains over traditional CPU-based frequency counting by leveraging the massive parallelism of NVIDIA GPUs.

Performance Dashboard

🚀 Key Features

  • Advanced GPU Optimization: Includes multiple kernel implementations (Naive, Shared Memory, Warp-Level Aggregation).
  • Hybrid Benchmarking: Real-time comparison between Single-threaded CPU, Multi-threaded OpenMP, and various CUDA kernels.
  • Production-Grade Architecture: Clean, modular C++17 design with RAII memory management and interface-driven design.
  • Measurable Performance: Built-in throughput (GB/s) metrics and latency analysis.
  • Data Visualization: Integrated Python scripts for generating interactive performance dashboards.
  • Scalability: Tested on datasets ranging from small text files to multi-GB binary blobs.

🏗️ Project Architecture

graph TD
    A[src/main.cpp] --> B[BenchmarkHarness]
    B --> C[IFrequencyCounter Interface]
    C --> D[CpuFrequencyCounter]
    C --> E[OpenMpFrequencyCounter]
    C --> F[CudaFrequencyCounter]
    F --> G[CUDA Kernels]
    G --> G1[Naive Global Atomics]
    G --> G2[Shared Memory Histograms]
    G --> G3[Warp-Level Aggregation]
    B --> H[CSV/JSON Reporter]
    H --> I[Python Visualization]
Loading

📂 Directory Structure

The repository is organized as follows:

  • src/: All C++ and CUDA source files.
    • core/: Base Huffman tree node representations and code building.
    • cpu/: Host serial and OpenMP multi-threaded counter implementations.
    • cuda/: GPU kernel code and CPU fallback drivers.
    • utils/: RAII helpers (GpuBuffer, UnifiedBuffer, GpuStream) and log utilities.
    • benchmark/: Test configuration and latency recording harness.
  • data/: Text datasets and inputs for benchmarking.
  • scripts/: Support tools (including dashboard visualization).
  • docs/: Development guides and optimization logs.

🛠️ Performance Engineering

This project goes beyond simple "clean code." It implements several state-of-the-art GPU optimization strategies:

Optimization Level Technique Rationale
L1: Naive Global Atomics Baseline implementation using atomicAdd on global memory. High contention.
L2: Shared Per-Block Histograms Reduces global memory traffic by aggregating counts in high-speed L1/Shared memory.
L3: Warp Multi-Bank Histograms Further reduces shared memory bank conflicts by using interleaved sub-histograms per warp.
L4: Async CUDA Streams Overlaps data transfer (H2D/D2H) with kernel execution for hidden latency.
L5: Managed Unified Memory Simplifies developer programmability using a single managed address space with async prefetching.

📊 Benchmarking Results

On an NVIDIA RTX GPU, the Shared Memory implementation typically achieves 10x - 50x speedup over optimized CPU implementations for large datasets.

Throughput (GB/s)

  • CPU Serial: ~0.4 GB/s
  • CPU OpenMP (16 threads): ~3.2 GB/s
  • CUDA Shared Memory: ~45+ GB/s

💻 Getting Started

Prerequisites

  • CUDA Toolkit: 13.2+ for VS 2026 support (CUDA 13.1 only supports VS 2019–2022)
  • C++17 Compatible Compiler (MSVC 2019+ or GCC 9+)
  • CMake 3.18+
  • Python 3.8+ (for visualization)
  • Visual Studio 2026 Build Tools with the MSVC toolset for GPU builds on Windows

Note on Windows & CUDA:

  • The CUDA build path requires MSVC as the host compiler.
  • CUDA 13.1 does not support VS 2026; you must upgrade to CUDA 13.2+ or use VS 2022/2019.
  • If CMake detects an unsupported CUDA/compiler pair, it automatically falls back to a CPU-only implementation so the executable still builds and runs.

Upgrading CUDA

If you have VS 2026 installed and CUDA 13.1, download and install CUDA Toolkit 13.2+ or the latest version. Then reconfigure:

rmdir /s /q build-msvc
cmake -S . -B build-msvc

Windows Quick Start

Open the VS Code terminal with the Developer Command Prompt for VS 2026 profile, then run:

cd /d "D:\Projects\GPU huffman"
cmake -S . -B build-msvc
cmake --build build-msvc

If you are using a regular PowerShell terminal, CMake may select MinGW instead of MSVC and the project will fall back to the CPU implementation.

Run the sample benchmark:

.\build-msvc\bin\gpu_huffman.exe .\data\big.txt

Or run your own input file:

.\build-msvc\bin\gpu_huffman.exe D:\path\to\your\file.txt

Generate the visual report:

python .\scripts\visualize_results.py

Notes

  • The default build-msvc folder keeps the MSVC configuration separate from the earlier MinGW build tree.
  • If you want the GPU/CUDA path, always configure from the Developer Command Prompt, not a regular PowerShell or CMD session.

📚 Lessons Learned & Bottlenecks

  1. Memory Contention: In frequency counting, multiple threads frequently hit the same buckets (e.g., 'e' or space). Shared memory atomics are essential to prevent global memory serialization.
  2. Kernel Launch Overhead: For inputs < 1MB, the CPU often wins due to the 5-10µs overhead of launching a CUDA kernel.
  3. PCIe Latency: Data transfer (H2D) is the primary bottleneck for small to medium files. Pinning memory and using async streams can mitigate this.

📄 License

Distributed under the MIT License. See LICENSE for more information.


Author: Ganesh Kambli

About

CUDA-accelerated Huffman encoding project in C++ that compares GPU vs CPU character-frequency counting, builds Huffman codes, and benchmarks performance with clear timing metrics.

Topics

Resources

Stars

Watchers

Forks

Contributors