Lock-Free SPSC Queue

A bounded, single-producer single-consumer lock-free queue implemented in C++17.

Overview

Lock-free queues enable high-throughput producer-consumer patterns without mutex contention. In GPU systems, they're used in data loading pipelines, multi-GPU coordination, and anywhere the CPU must keep pace with GPU compute without becoming a bottleneck.

Usage

#include "spsc_queue.hpp"

SPSCQueue<int, 1024> queue;

// Producer thread
queue.push(42);

// Consumer thread
int val;
if (queue.pop(val)) {
    // use val
}

Memory Ordering Explained

The Problem

Without atomics, concurrent reads and writes to head_ and tail_ are data races, undefined behavior in C++. But atomics alone aren't enough. Without memory ordering, the compiler and CPU can reorder operations. A consumer might see an updated tail_ index before the data at that index is actually written. Acquire-release ordering ensures that when the consumer sees a new tail_, the corresponding buffer write is guaranteed to be visible.

Push Operation

bool push(const T& item) {

    // The producer is the only writer to tail_.
    // Reading your own  variable needs no synchronization.
    const size_t tail = tail_.load(std::memory_order_relaxed);
    const size_t next_tail = (tail + 1) % Capacity;

    // Pairs with the consumer's release store to head_.
    // Ensures the producer sees the consumer's progress.
    const size_t head = head_.load(std::memory_order_acquire);

    if (next_tail == head) return false;
    buffer_[tail] = item;

    // Ensures the buffer write is visible to any consumer that reads this tail_.
    tail_.store(next_tail, std::memory_order_release);
    return true;
}

Pop Operation

bool pop(T& item) {

    // The consumer is the only writer to head_.
    const size_t head = head_.load(std::memory_order_relaxed);

    // Pairs with the producer's release store to tail_.
    const size_t tail = tail_.load(std::memory_order_acquire);

    if (head == tail) return false;
    item = buffer_[head];

    // Ensures the consumer's read is complete before the producer can reuse this slot.
    head_.store((head + 1) % Capacity, std::memory_order_release);
    return true;
}

Why Not Sequential Consistency?

Sequential consistency (std::memory_order_seq_cst) provides a total order of operations across all threads. This is expensive and unnecessary here. SPSC only requires pairwise synchronization: the producer publishes to tail_, the consumer observes it; the consumer publishes to head_, the producer observes it. Acquire-release handles these two independent channels without the overhead of global ordering.

Optimizations (SPSCQueueOpt)

1. Cache-Line Padding

False sharing occurs when two variables on the same cache line are written by different cores. Each write invalidates the entire line on the other core, causing expensive cache coherency traffic even though the variables are logically independent.

In the basic SPSCQueue, head_ and tail_ sit adjacent in memory. The producer writes tail_, the consumer writes head_, and both bounce the cache line back and forth.

In SPSCQueueOpt, the padding allows the producer core and the consumer core to write their respective values to their own cache lines, allowing true caching.

alignas(CACHE_LINE_SIZE) std::atomic<size_t> head_{0};
alignas(CACHE_LINE_SIZE) std::atomic<size_t> tail_{0};

2. Cached Indices

Every push() reads head_ to check if the queue is full. Every pop() reads tail_ to check if empty. These are atomic loads from variables written by the other thread, and are potential cache misses every time.

The optimization is to cache the other thread's index locally. The cached value can only be behind reality (the other thread only moves forward), so it's always safe to use. We only reload the atomic when the cached value suggests a head and tail collision. This reduces atomic loads from one-per-operation to one-per-wrap-around in the common case.

// Producer caches consumer's head
if (next_tail == cached_head_) {
    cached_head_ = head_.load(std::memory_order_acquire);
    if (next_tail == cached_head_) return false;
}

3. Power-of-2 Capacity

// Modulo (slow - involves division):
next_tail = (tail + 1) % Capacity;

// Bitmask (fast - single AND instruction):
next_tail = (tail + 1) & (Capacity - 1);

For power-of-2 N, (N - 1) is a bitmask of all 1s. The AND keeps only the lower log2(Capacity) bits, which is equivalent to modulo for power-of-2 values. This is equivalent to modulo but without costly division assembly. The static_assert in SPSCQueueOpt enforces this constraint at compile time.

Benchmark Results

Throughput

Queue                   ops/sec
------------------------------------
SPSCQueue               29,399,749
SPSCQueueOpt           121,945,026
MutexQueue              38,374,182
Boost.SPSC              49,789,536

SPSCQueueOpt achieves 4.1x the throughput of the basic version and 2.4x Boost's implementation. The unoptimized SPSCQueue is actually slower than MutexQueue due to false sharing—the cache line bouncing costs more than the mutex.

Latency

Queue              op     p50    p99    max (ns)
------------------------------------------------
SPSCQueue         push     41     42    1,792
                  pop      41     42      208
SPSCQueueOpt      push     41     42   30,584
                  pop      41     42   20,708
MutexQueue        push     41     42   10,833
                  pop      41     42   18,375

Single-threaded latency is nearly identical across implementations (~41ns). The differences appear in throughput under contention, not in isolated operation cost. The max values reflect OS scheduling noise, not the queue implementation.

Building

mkdir build && cd build
cmake ..
make
./test_queues
./bench_queues

References

C++ Reference: std::memory_order
Preshing on Programming: Acquire and Release Semantics
A Tour of C++ (Stroustrup) — Chapters on concurrency

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bench		bench
include		include
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lock-Free SPSC Queue

Overview

Usage

Memory Ordering Explained

The Problem

Push Operation

Pop Operation

Why Not Sequential Consistency?

Optimizations (SPSCQueueOpt)

1. Cache-Line Padding

2. Cached Indices

3. Power-of-2 Capacity

Benchmark Results

Throughput

Latency

Building

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lock-Free SPSC Queue

Overview

Usage

Memory Ordering Explained

The Problem

Push Operation

Pop Operation

Why Not Sequential Consistency?

Optimizations (SPSCQueueOpt)

1. Cache-Line Padding

2. Cached Indices

3. Power-of-2 Capacity

Benchmark Results

Throughput

Latency

Building

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages