A bounded, single-producer single-consumer lock-free queue implemented in C++17.
Lock-free queues enable high-throughput producer-consumer patterns without mutex contention. In GPU systems, they're used in data loading pipelines, multi-GPU coordination, and anywhere the CPU must keep pace with GPU compute without becoming a bottleneck.
#include "spsc_queue.hpp"
SPSCQueue<int, 1024> queue;
// Producer thread
queue.push(42);
// Consumer thread
int val;
if (queue.pop(val)) {
// use val
}Without atomics, concurrent reads and writes to head_ and tail_ are data races, undefined behavior in C++. But atomics alone aren't enough. Without memory ordering, the compiler and CPU can reorder operations. A consumer might see an updated tail_ index before the data at that index is actually written. Acquire-release ordering ensures that when the consumer sees a new tail_, the corresponding buffer write is guaranteed to be visible.
bool push(const T& item) {
// The producer is the only writer to tail_.
// Reading your own variable needs no synchronization.
const size_t tail = tail_.load(std::memory_order_relaxed);
const size_t next_tail = (tail + 1) % Capacity;
// Pairs with the consumer's release store to head_.
// Ensures the producer sees the consumer's progress.
const size_t head = head_.load(std::memory_order_acquire);
if (next_tail == head) return false;
buffer_[tail] = item;
// Ensures the buffer write is visible to any consumer that reads this tail_.
tail_.store(next_tail, std::memory_order_release);
return true;
}bool pop(T& item) {
// The consumer is the only writer to head_.
const size_t head = head_.load(std::memory_order_relaxed);
// Pairs with the producer's release store to tail_.
const size_t tail = tail_.load(std::memory_order_acquire);
if (head == tail) return false;
item = buffer_[head];
// Ensures the consumer's read is complete before the producer can reuse this slot.
head_.store((head + 1) % Capacity, std::memory_order_release);
return true;
}Sequential consistency (std::memory_order_seq_cst) provides a total order of operations across all threads. This is expensive and unnecessary here. SPSC only requires pairwise synchronization: the producer publishes to tail_, the consumer observes it; the consumer publishes to head_, the producer observes it. Acquire-release handles these two independent channels without the overhead of global ordering.
False sharing occurs when two variables on the same cache line are written by different cores. Each write invalidates the entire line on the other core, causing expensive cache coherency traffic even though the variables are logically independent.
In the basic SPSCQueue, head_ and tail_ sit adjacent in memory. The producer writes tail_, the consumer writes head_, and both bounce the cache line back and forth.
In SPSCQueueOpt, the padding allows the producer core and the consumer core to write their respective values to their own cache lines, allowing true caching.
alignas(CACHE_LINE_SIZE) std::atomic<size_t> head_{0};
alignas(CACHE_LINE_SIZE) std::atomic<size_t> tail_{0};Every push() reads head_ to check if the queue is full. Every pop() reads tail_ to check if empty. These are atomic loads from variables written by the other thread, and are potential cache misses every time.
The optimization is to cache the other thread's index locally. The cached value can only be behind reality (the other thread only moves forward), so it's always safe to use. We only reload the atomic when the cached value suggests a head and tail collision. This reduces atomic loads from one-per-operation to one-per-wrap-around in the common case.
// Producer caches consumer's head
if (next_tail == cached_head_) {
cached_head_ = head_.load(std::memory_order_acquire);
if (next_tail == cached_head_) return false;
}// Modulo (slow - involves division):
next_tail = (tail + 1) % Capacity;
// Bitmask (fast - single AND instruction):
next_tail = (tail + 1) & (Capacity - 1);For power-of-2 N, (N - 1) is a bitmask of all 1s. The AND keeps only the lower log2(Capacity) bits, which is equivalent to modulo for power-of-2 values. This is equivalent to modulo but without costly division assembly. The static_assert in SPSCQueueOpt enforces this constraint at compile time.
Queue ops/sec
------------------------------------
SPSCQueue 29,399,749
SPSCQueueOpt 121,945,026
MutexQueue 38,374,182
Boost.SPSC 49,789,536
SPSCQueueOpt achieves 4.1x the throughput of the basic version and 2.4x Boost's implementation. The unoptimized SPSCQueue is actually slower than MutexQueue due to false sharing—the cache line bouncing costs more than the mutex.
Queue op p50 p99 max (ns)
------------------------------------------------
SPSCQueue push 41 42 1,792
pop 41 42 208
SPSCQueueOpt push 41 42 30,584
pop 41 42 20,708
MutexQueue push 41 42 10,833
pop 41 42 18,375
Single-threaded latency is nearly identical across implementations (~41ns). The differences appear in throughput under contention, not in isolated operation cost. The max values reflect OS scheduling noise, not the queue implementation.
mkdir build && cd build
cmake ..
make
./test_queues
./bench_queues- C++ Reference: std::memory_order
- Preshing on Programming: Acquire and Release Semantics
- A Tour of C++ (Stroustrup) — Chapters on concurrency