Yet another SPSC blog! The goal isn’t just to explain a final solution, such as the lock-free, index-cached implementation some of you might already know. Instead, I want to walk through the iterative process: testing different approaches, analyzing the results, and determining how to squeeze out performance.
By walking through the analysis, I hope this process provides a template for tackling similar performance challenges in the future.
The optimization discussions are focused on the x86 architecture. All measurements were taken on an AMD Ryzen 7 6800HS using gcc 13.3.0. This CPU has a single CCX (Core Complex), and the benchmarks use cores 0 and 2 - two distinct physical cores, rather than hyperthreaded siblings.
Mutex-Based
Let’s start with a simple mutex implementation. It uses a ring buffer of fixed size; if the buffer is full, the producer’s push operation fails (returns false) until the consumer pops from the buffer.
| |
Producer core: 0, Consumer core: 2
Queue size: 100,000
Iterations: 1,000,000 per benchmark
10 runs per benchmark for throughput statistics calculation
Queue Name Thru (ops/s) std-Dev
----------------------------------------
MutexSpsc 19,296,994 1,591,781
All the following benchmarks results mentioned correspond to the Queue Size 100,000 variant unless otherwise specified. The benchmark code includes other queue size variants. Obviously these numbers will differ based on the size of type T used. For all the benchmarks mentioned in this post, we are using T = int64_t.
But wait we are just incrementing `pushInd_` and never doing modulo, what if it overflows ?
Why not do pushInd_ = (pushInd_ + 1) % cap_ instead to avoid overflow concerns?
The issue with modulo approach is, it creates ambiguity: when pushInd_ equals popInd_, we can’t distinguish between a full queue and an empty queue. One simple solution is to avoid this case altogether by blocking push when size reaches cap_-1 instead.
However, overflow isn’t really a concern here. The indices are uint64_t with a maximum value of 2^64 - 1. If we keep the capacity a power of 2, the overflow behavior is well defined and the logic works correctly despite overflow, because
Even if we don’t restrict the capacity to a power of 2, assuming we achieve a throughput of 1 billion push/pop operations per second, it would take approximately 585 years of continuous operation to overflow. Therefore, we can safely ignore the overflow possibility.
Code used for benchmarking
| |
Making lock free
If there’s no enforced ordering between two accesses to a single memory location from separate threads, one or both of those accesses is not atomic, and one or both is a write, then this is a data race and causes undefined behavior.
How do we make it lock free? In the previous mutex-based code, both producer and consumer threads access both pushInd_ and popInd_. To make this lock-free, we could make both indices atomic.
But do we really need to read both variables from each thread? Notice that what we actually need is the difference pushInd_ - popInd_ (the queue size) to check if the queue is full or empty. This suggests an alternative approach: instead of making both indices atomic, we could maintain an atomic size variable that both threads read and update.
- Option 1: Make push and pop indices atomic
- Option 2: Add size and only make it atomic
Let’s proceed with option 2 first. We’ll change the code to have an atomic size variable. Note that for std::atomic, the operator++ and operator-- are shorthand for fetch_add(1, std::memory_order_seq_cst) and fetch_sub(1, std::memory_order_seq_cst), respectively. So the below code defaults to using sequential consistency ordering.
| |
| |
To verify the thread-safety we can run a simple test with -fsanitize=thread.
Test code
| |
Ok, now how do we optimize further? Let’s try changing the memory orderings to acquire-release.
| |
| |
But there is no real improvement (the difference is just noise and within std-Dev) in throughput due to this change.
Why no throughput improvement
Looking at the machine code, both versions generate the same machine code. Changing the memory ordering has no effect, as the machine code generated for size_-- (i.e., size_.fetch_sub(1, std::memory_order_seq_cst)) is lock sub, which is the same as what’s used for all fetch_add and fetch_sub operations irrespective of the memory ordering.
| |
Lock instructions are slow for two main reasons:
The core must send an invalidation-broadcast signal across the CPU’s internal interconnect (the communication bus between cores private caches and L3 cache), instructing other cores to invalidate their cached copies. The core then waits for acknowledgments from other cores and informs the cache controller: “I am performing an atomic operation; do not release this cache line until I am finished.”
Modern x86 processors are “out-of-order” execution machines. They constantly reorder instructions and use a store buffer to delay writing data to memory so the CPU doesn’t have to wait. The lock instruction acts as a full memory barrier. This forces the CPU to drain its store buffer and stalls the pipeline until the lock instruction is completed. This significantly reduces the throughput of an otherwise out-of-order system.
What now
Okay, alright, the take away is lock instruction is slow, how can we avoid it. The reason we required lock instruction for size_++ operations is because it’s a read-write operation, so we need the cache-line exclusivity while we do the read-write operation (++ and -- are read-modify-write ops). Compared to this a simple store like size_.store(1, std::memory_order_release) doesn’t require this lock prefix instruction.
So we resort to option 1 of making both pushInd_ and popInd_ indices atomic.
| |
| |
We are atomically loading and storing values.
We can also look at the machine code and confirm there is no lock instruction involved.
On x86, C++ atomic load/store operations with any of relaxed/acquire/release memory orderings convert to normal load/store machine instructions because the x86 memory model provides TSO (Total Store Ordering).
While we enforce specific memory orderings at code level using std::memory_order_xxx, the architecture itself provides default guarantees. For instance, the ARM architecture is generally weakly ordered, whereas x86 provides TSO.
Sequential Consistency
In simple terms, there is a global ordering of operations (loads and store) which satisfy the observed behaviour of all the processors. So the behaviour is as if one instruction of one of the processors is executed at a time.
Total Store Order (TSO)
Any two stores are seen in a consistent order by processors other than those performing the stores. Apart from this, the memory is consistent/coherence.
Atomic operations cost for x86
on x86 architectures, atomic load operations are always the same, whether tagged
memory_order_relaxedormemory_order_seq_cstCPUs that use the x86 or x86-64 architectures don’t require any additional instructions for acquire-release ordering beyond those necessary for ensuring atomicity, and even sequentially-consistent ordering doesn’t require any special treatment for load operations.
Note: The atomic load/store approach compared to the atomic read-update-write operations of ++/-- works correctly because in SPSC, only the producer writes to pushInd_ and only the consumer writes to popInd_, so each atomic variable has a single writer, avoiding write-write races.
What else can be optimized? The % modulo can be replaced (which is costly compared to simple and instruction) by a mask operation if we restrict the capacity to power of 2. This gives us:
| |
Cache line Optimization
With the current implementation, the pushInd, popInd, cap, buf all fall in same cache-line. (The benchmark we use allocates the spsc in new-64-aligned memory).
| |
What if we separate msk_ and buf_ into one cacheline (This cacheline will be in shared state in both cores and not modified) and pushInd_, popInd_ in separate cachelines. The data members take 3 cachelines in total. The pushInd_ and popInd_ cachelines bounce between shared and exclusive state between the producer and consumer cores.
Making the changes we get:
| |
We can instead get away with 2 cachelines by keeping a copy of msk_ and buf_ near each index. One cacheline contains [pushInd_, mskPush_, buf_] for the producer thread, and another cacheline contains [popInd_, mskPop_, bufPop_] for the consumer thread. This way, each thread primarily accesses its own cacheline.
We get:
| |
| |
Now for reasons I don’t fully understand, the 3 cacheline approach gives considerably worse results than the 2 cacheline version and there is considerably more L1-dcache-load-misses compared to the 2 cacheline approach.
The machine code generated for the 3 cacheline and 2 cacheline is almost completely same. Except that some of the mov instructions displacement value (eg the 42 part in mov rax, [rdi+42]). I initially thought the throughput difference was due to the 128 byte rule, but that does not explain the considerably higher perf stat -e L1-dcache-load-misses in 3 cacheline approach vs 2 cacheline. Share if you have any thoughts on what might be causing this.
The 128 byte rule
The 128 byte rule refers to x86-64 instruction encoding efficiency. The machine code for accessing a structure member is more compact if the member’s offset from the structure’s base is less than 128 bytes, because the offset can be expressed as an 8-bit signed displacement in the instruction encoding (e.g., mov rax, [rdi+42]).
If the offset is 128 bytes or more, the instruction must use a 32-bit displacement, making the instruction longer (e.g., mov rax, [rdi+200] requires a 4-byte immediate instead of 1-byte). This doesn’t just affect instruction cache efficiency, modern CPUs decode instructions in chunks, and longer instructions can reduce decode throughput.
Anyways, let’s move forward.
Index Caching Optimization
Even with the optimized 2-cacheline layout, running perf mem shows a lot of “core, same node any cache hit” memory accesses. The actual data buffer is written by one core and read by another, so cross-core access is unavoidable there. But the same is true for pushInd_ and popInd_, even though we’ve moved them to separate cachelines, both consumer and producer threads still need to read both indices.
However, we don’t need to read the other thread’s index on every operation. For example, if the last time we checked pushInd_ during pop(), its value was say 100 higher than popInd_, then we don’t need to check its value again for the next 100 successful pop operations. So we can instead cache the value of pushInd_ in the consumer core (let’s name this variable pushCache) and only re-read the actual pushInd_ when the cached pushCache says there are no more elements to pop. We can do a similar thing for push().
Making the code changes give:
| |
| |
Now this optimization almost always benefits the push() operation, as we will only have roughly load popInd_ once per capacity successful push operations. (assuming the consumer pop() throughput on average keep up with average producer push throughput)
For the pop() operation, the benefit depends on the producer-consumer timing. If the consumer has a higher peak throughput than the producer, then popInd_ will always be close behind pushInd_, and we’ll frequently have to reload pushInd_ bypassing the cached value. In this case, the caching of pushInd_ as pushCache and checking it first adds unnecessary overhead.
However, for most practical use-cases, the consumer won’t always be toe-to-toe with the producer (for example, the producer might push in bursts and then go silent for some time), so this optimization serves well. This is also the reason why the std-dev of LockfreeAtomicPushPopPtrCachingSpsc is significantly higher for our benchmarks.
Time To Pause
Pause instruction
Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.
An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a spin loop. A processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin-wait loop greatly reduces the processor’s power consumption.
I won’t go into detail about the memory order violation that Pause instruction avoids. The while(!q.push(someVal)) behaves similar to a spin-loop, we are looping until the popInd_ value gets updated and visible to producer thread, and similar thing for the consumer thread. So adding pause in the while(!q.push(someVal)){_mm_pause();} can be beneficial.
| |
The benchmark code also measures the latency, time between a push succeeds and the corresponding pop is able to consume the data. Obviously these numbers and all the other benchmarks so far differ based on size of type T used. For all the benchmarks mentioned we are using T = int64_t.
| |
API Design Note
The push() and pop() methods shown here are actually non-blocking try operations (more accurately should be named try_push() and try_pop()).
If your usage pattern involves retrying until the operation succeeds (e.g., while (!q.push(val)) { /* spin */ }), it’s more efficient to move the retry loop inside the push() function itself:
| |
This approach avoids repeatedly reloading pushInd_ on every retry when compared to the while(!q.try_push(val)){} approach. Similar change follows for the pop(T& t){}
Closing Thoughts
Hopefully this post has been less about “here’s how an efficient SPSC looks” and more about the process: analyze, hypothesize, measure, repeat. The specific optimizations matter less than the approach, using perf, occasionally peeking at generated assembly, and understanding why something is slow before trying to fix it.