This document provides a deep dive into the architectural design of the GPU Compute Benchmark Tool, explaining key design decisions, patterns, and implementation details.
- Compute backends are completely isolated from visualization
- Each backend implements the same abstract interface
- Benchmarks are API-agnostic, backends provide implementations
- Runtime capability detection, not compile-time assumptions
- Same executable runs on different hardware
- Graceful degradation when features unavailable
- Identical algorithms across all backends
- Same workload sizes and memory patterns
- Separate timing for compute vs memory transfer
- GPU-side synchronization where possible
- Avoid readbacks during active benchmarking
- High-resolution timers without interrupts
- Easy to add new benchmarks
- Easy to add new backends
- Modular component replacement
┌─────────────────────────────────────────────────────────────┐
│ LAYER 4: APPLICATION │
│ ├─ main.cpp : Entry point, initialization │
│ ├─ GUI.cpp : User interface and event handling │
│ └─ CLI.cpp : Command-line argument parsing │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: BENCHMARK ORCHESTRATION │
│ ├─ BenchmarkRunner : Coordinates execution │
│ ├─ Logger : Results collection and export │
│ └─ SystemInfo : Hardware capability queries │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: ABSTRACTION (Core Framework) │
│ ├─ IComputeBackend : Abstract interface for all GPUs │
│ ├─ IBenchmark : Abstract benchmark definition │
│ ├─ Timer : High-resolution timing │
│ └─ DeviceDiscovery : Runtime GPU and API detection │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: IMPLEMENTATION (Backends) │
│ ├─ CUDABackend : NVIDIA CUDA implementation │
│ ├─ OpenCLBackend : OpenCL implementation │
│ ├─ DirectComputeBackend: DirectCompute implementation │
│ └─ Renderer : OpenGL visualization (separate) │
└─────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ LAYER 0: HARDWARE │
│ └─ GPU Driver → GPU Hardware │
└─────────────────────────────────────────────────────────────┘
Purpose: Defines the contract that all compute backends must implement.
Key Methods:
class IComputeBackend {
public:
virtual bool Initialize() = 0;
virtual void Shutdown() = 0;
virtual void* AllocateMemory(size_t size) = 0;
virtual void FreeMemory(void* ptr) = 0;
virtual void CopyHostToDevice(void* dst, const void* src, size_t size) = 0;
virtual void CopyDeviceToHost(void* dst, const void* src, size_t size) = 0;
virtual void ExecuteKernel(const std::string& kernelName,
const KernelParams& params) = 0;
virtual void Synchronize() = 0;
virtual std::string GetDeviceName() = 0;
virtual size_t GetDeviceMemory() = 0;
};Why This Design?
- Polymorphism allows treating all backends uniformly
- BenchmarkRunner doesn't need to know which backend it's using
- Easy to add new backends (just implement the interface)
Purpose: Orchestrates benchmark execution across multiple backends.
Workflow:
1. Discovery Phase
├─ Query available backends
├─ Initialize each backend
└─ Report capabilities
2. Execution Phase
For each benchmark:
For each backend:
├─ Allocate memory
├─ Copy data to device
├─ START_TIMER
├─ Execute kernel
├─ Synchronize
├─ STOP_TIMER
├─ Copy results back
└─ Verify correctness
3. Results Phase
├─ Aggregate timing data
├─ Calculate statistics
└─ Export results
Key Features:
- Automatic warmup runs (GPU frequency scaling)
- Multiple iterations for statistical significance
- Result verification (ensures correctness)
- Timeout protection (prevents hangs)
Challenge: Accurately measuring GPU operations
Problem: CPU timers don't account for GPU asynchrony
Solution: Two-level timing strategy
class Timer {
// CPU-side timing (for host overhead)
LARGE_INTEGER cpuStart, cpuEnd, frequency;
// GPU-side timing (API-specific)
void* gpuStartEvent;
void* gpuEndEvent;
public:
void StartCPU();
void StopCPU();
double GetCPUTimeMS();
void StartGPU();
void StopGPU();
double GetGPUTimeMS();
};Backend-Specific GPU Timing:
CUDA:
cudaEvent_t start, end;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start);
// ... kernel execution ...
cudaEventRecord(end);
cudaEventSynchronize(end);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, end);OpenCL:
cl_event event;
clEnqueueNDRangeKernel(..., &event);
clWaitForEvents(1, &event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL);
double milliseconds = (end - start) / 1e6;DirectCompute:
// Use D3D11 queries
ID3D11Query* startQuery, *endQuery;
D3D11_QUERY_DESC queryDesc = { D3D11_QUERY_TIMESTAMP, 0 };
device->CreateQuery(&queryDesc, &startQuery);
device->CreateQuery(&queryDesc, &endQuery);
context->End(startQuery);
// ... dispatch compute shader ...
context->End(endQuery);
UINT64 startTime, endTime;
context->GetData(startQuery, &startTime, sizeof(UINT64), 0);
context->GetData(endQuery, &endTime, sizeof(UINT64), 0);Design Decision: Why separate from compute?
Reason: Mixing compute and rendering on same context can cause:
- Performance interference
- Driver state pollution
- Timing measurement corruption
Architecture:
┌─────────────────────┐
│ Benchmark Results │ (CPU-side storage)
│ ├─ Timing data │
│ ├─ Bandwidth stats │
│ └─ Error rates │
└──────────┬──────────┘
│
↓
┌──────────────┐
│ Renderer │
│ (OpenGL) │
└──────┬───────┘
│
↓ (Vertex data)
┌──────────────┐
│ GPU │
│ (Rendering │
│ Pipeline) │
└──────────────┘
Rendering Pipeline:
- Data Preparation: Format results as vertex buffers
- Vertex Processing: Position bar graphs/lines
- Fragment Shading: Color coding by performance
- Compositing: Final display with UI overlay
Shaders:
vertex.glsl: Transforms benchmark data to screen spacefragment.glsl: Colors based on performance thresholds
File: src/backends/cuda/CUDABackend.cpp
Initialization:
bool CUDABackend::Initialize() {
// 1. Check CUDA availability
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0) return false;
// 2. Query device properties
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// 3. Check compute capability (need 3.0+)
if (prop.major < 3) return false;
// 4. Set device
cudaSetDevice(0);
return true;
}Memory Management:
void* CUDABackend::AllocateMemory(size_t size) {
void* ptr = nullptr;
cudaMalloc(&ptr, size);
return ptr; // Returns device pointer
}Kernel Execution:
void CUDABackend::ExecuteKernel(const std::string& name, const KernelParams& params) {
// Calculate grid/block dimensions
dim3 block(256); // 256 threads per block
dim3 grid((params.numElements + block.x - 1) / block.x);
// Launch appropriate kernel
if (name == "vector_add") {
vectorAddKernel<<<grid, block>>>(params.input1, params.input2,
params.output, params.numElements);
}
// ... other kernels ...
}File: src/backends/opencl/OpenCLBackend.cpp
Initialization (More Complex):
bool OpenCLBackend::Initialize() {
// 1. Get platform (NVIDIA, AMD, Intel, etc.)
cl_uint numPlatforms;
clGetPlatformIDs(0, nullptr, &numPlatforms);
if (numPlatforms == 0) return false;
std::vector<cl_platform_id> platforms(numPlatforms);
clGetPlatformIDs(numPlatforms, platforms.data(), nullptr);
// 2. Get GPU device
cl_uint numDevices;
clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, 0, nullptr, &numDevices);
if (numDevices == 0) return false;
std::vector<cl_device_id> devices(numDevices);
clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, numDevices, devices.data(), nullptr);
// 3. Create context
context = clCreateContext(nullptr, 1, &devices[0], nullptr, nullptr, nullptr);
// 4. Create command queue with profiling enabled
queue = clCreateCommandQueue(context, devices[0], CL_QUEUE_PROFILING_ENABLE, nullptr);
// 5. Load and compile kernels from source strings
LoadAndCompileKernels();
return true;
}Runtime Kernel Compilation:
void OpenCLBackend::LoadAndCompileKernels() {
// Kernel source is embedded as string
const char* vectorAddSource = R"(
__kernel void vector_add(__global const float* a,
__global const float* b,
__global float* c,
int n) {
int i = get_global_id(0);
if (i < n) {
c[i] = a[i] + b[i];
}
}
)";
// Create program from source
cl_program program = clCreateProgramWithSource(context, 1, &vectorAddSource, nullptr, nullptr);
// Compile
clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
// Extract kernel
kernels["vector_add"] = clCreateKernel(program, "vector_add", nullptr);
}File: src/backends/directcompute/DirectComputeBackend.cpp
Initialization:
bool DirectComputeBackend::Initialize() {
// 1. Create D3D11 device
D3D_FEATURE_LEVEL featureLevel;
HRESULT hr = D3D11CreateDevice(
nullptr, // Default adapter
D3D_DRIVER_TYPE_HARDWARE, // Hardware acceleration
nullptr,
0,
nullptr, 0,
D3D11_SDK_VERSION,
&device,
&featureLevel,
&context
);
if (FAILED(hr)) return false;
// 2. Check compute shader support (need 11.0+)
if (featureLevel < D3D_FEATURE_LEVEL_11_0) return false;
// 3. Load and compile HLSL shaders
LoadShaders();
return true;
}HLSL Shader Compilation:
void DirectComputeBackend::LoadShaders() {
// Read shader from file
std::ifstream shaderFile("shaders/vector_add.hlsl");
std::string shaderCode((std::istreambuf_iterator<char>(shaderFile)),
std::istreambuf_iterator<char>());
// Compile HLSL to bytecode
ID3DBlob* shaderBlob = nullptr;
ID3DBlob* errorBlob = nullptr;
HRESULT hr = D3DCompile(
shaderCode.c_str(),
shaderCode.length(),
"vector_add.hlsl",
nullptr,
nullptr,
"CSMain", // Entry point
"cs_5_0", // Compute shader 5.0
0, 0,
&shaderBlob,
&errorBlob
);
if (FAILED(hr)) {
// Handle compilation error
return;
}
// Create compute shader
ID3D11ComputeShader* shader;
device->CreateComputeShader(
shaderBlob->GetBufferPointer(),
shaderBlob->GetBufferSize(),
nullptr,
&shader
);
computeShaders["vector_add"] = shader;
}Buffer Creation:
void* DirectComputeBackend::AllocateMemory(size_t size) {
D3D11_BUFFER_DESC desc = {};
desc.ByteWidth = size;
desc.Usage = D3D11_USAGE_DEFAULT;
desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;
desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
desc.StructureByteStride = sizeof(float);
ID3D11Buffer* buffer;
device->CreateBuffer(&desc, nullptr, &buffer);
return buffer;
}Each benchmark follows this structure:
class VectorAddBenchmark : public IBenchmark {
public:
void Setup(IComputeBackend* backend) override {
// 1. Allocate host memory
hostInputA = new float[size];
hostInputB = new float[size];
hostOutput = new float[size];
// 2. Initialize data
for (int i = 0; i < size; i++) {
hostInputA[i] = static_cast<float>(i);
hostInputB[i] = static_cast<float>(i * 2);
}
// 3. Allocate device memory
deviceInputA = backend->AllocateMemory(size * sizeof(float));
deviceInputB = backend->AllocateMemory(size * sizeof(float));
deviceOutput = backend->AllocateMemory(size * sizeof(float));
// 4. Copy data to device
backend->CopyHostToDevice(deviceInputA, hostInputA, size * sizeof(float));
backend->CopyHostToDevice(deviceInputB, hostInputB, size * sizeof(float));
}
BenchmarkResult Run(IComputeBackend* backend) override {
BenchmarkResult result;
// Warmup (stabilize GPU clocks)
for (int i = 0; i < 3; i++) {
backend->ExecuteKernel("vector_add", params);
backend->Synchronize();
}
// Actual benchmark
Timer timer;
timer.StartGPU();
for (int i = 0; i < iterations; i++) {
backend->ExecuteKernel("vector_add", params);
}
backend->Synchronize();
timer.StopGPU();
result.executionTimeMS = timer.GetGPUTimeMS() / iterations;
// Measure memory transfer
timer.StartCPU();
backend->CopyDeviceToHost(hostOutput, deviceOutput, size * sizeof(float));
timer.StopCPU();
result.transferTimeMS = timer.GetCPUTimeMS();
// Verify results
result.correct = VerifyResults();
return result;
}
bool VerifyResults() override {
for (int i = 0; i < size; i++) {
float expected = hostInputA[i] + hostInputB[i];
if (abs(hostOutput[i] - expected) > 0.001f) {
return false;
}
}
return true;
}
};File: src/core/DeviceDiscovery.cpp
Process:
struct SystemCapabilities {
bool cudaAvailable;
bool openclAvailable;
bool directComputeAvailable;
std::string gpuName;
size_t gpuMemoryMB;
std::string driverVersion;
};
SystemCapabilities DiscoverCapabilities() {
SystemCapabilities caps;
// 1. Try CUDA
caps.cudaAvailable = TestCUDAAvailability();
// 2. Try OpenCL
caps.openclAvailable = TestOpenCLAvailability();
// 3. Try DirectCompute
caps.directComputeAvailable = TestDirectComputeAvailability();
// 4. Query GPU info
caps.gpuName = GetGPUName();
caps.gpuMemoryMB = GetGPUMemoryMB();
caps.driverVersion = GetDriverVersion();
return caps;
}Why Runtime Detection?
- Same .exe works on NVIDIA, AMD, and Intel GPUs
- Graceful degradation (if CUDA unavailable, use OpenCL)
- Professional error messages instead of crashes
Problem: Uncoalesced memory access kills performance
Solution: Ensure stride-1 access patterns
CUDA Example:
// BAD: Strided access
__global__ void badKernel(float* data, int stride) {
int i = threadIdx.x * stride; // Non-coalesced!
data[i] = ...;
}
// GOOD: Coalesced access
__global__ void goodKernel(float* data) {
int i = threadIdx.x; // Adjacent threads access adjacent memory
data[i] = ...;
}Problem: Multiple threads accessing same bank causes serialization
Solution: Pad shared memory or use offset indexing
__shared__ float sharedData[256 + 16]; // Padding avoids bank conflictsGoal: Keep GPU fully utilized
Factors:
- Registers per thread
- Shared memory per block
- Block size
Tool: CUDA Occupancy Calculator (included in CUDA Toolkit)
Different algorithms (CUDA/OpenCL/DirectCompute) for same task, selected at runtime.
Benchmark base class defines workflow; derived classes implement specifics.
Simplified interface hiding complex backend interactions.
IComputeBackend* CreateBackend(BackendType type) {
switch (type) {
case CUDA: return new CUDABackend();
case OpenCL: return new OpenCLBackend();
case DirectCompute: return new DirectComputeBackend();
}
}Automatic cleanup in destructors prevents leaks.
1. Initialization Errors (Expected):
- Backend unavailable → Disable, report to user
- Example: "CUDA not available - NVIDIA GPU required"
2. Runtime Errors (Unexpected):
- Out of memory → Report, skip benchmark
- Kernel compilation failed → Log error, continue with other backends
3. Critical Errors (Fatal):
- Driver crash → Terminate gracefully with diagnostic info
enum class ErrorSeverity {
INFO, // Normal operation
WARNING, // Degraded functionality
ERROR, // Feature unavailable
CRITICAL // Application cannot continue
};
void LogError(ErrorSeverity severity, const std::string& message) {
std::cout << "[" << SeverityToString(severity) << "] " << message << std::endl;
if (severity == ErrorSeverity::CRITICAL) {
// Save diagnostic info
// Prompt user
// Exit gracefully
}
}Benchmark Execution
↓
Raw Timing Data
↓
Statistical Analysis (mean, median, std dev)
↓
┌─────────────┬──────────────┐
↓ ↓ ↓
Display CSV Export Visualization
(Console) (File) (OpenGL)
Single-threaded design (for simplicity and timing accuracy)
- Main thread handles UI and orchestration
- GPU executes asynchronously (but we synchronize for timing)
- No CPU parallelism (would complicate benchmarking)
Future Enhancement: Multi-threaded backend execution (run all backends in parallel)
- GPU APIs are fundamentally similar: Memory allocation, kernel launch, synchronization
- Timing is subtle: Need GPU-side events, not CPU-side timers
- Warmup matters: First run is slower due to GPU frequency scaling
- Verification is essential: Easy to get incorrect results fast
- Abstraction has cost: But enables clean architecture
Next: Read individual backend READMEs for API-specific details.