Fast sorting, branchless by design

Sorting is one of the most studied problems in computer science. Every language ships a built-in sort, and for most applications, picking the right one is a matter of performance. Quicksort, mergesort, pdqsort. They all produce the correct output. The main question is how fast they get there.

But there’s a class of applications where correctness and speed aren’t enough. When the data being sorted is sensitive, the sort itself can become a security vulnerability. Not because it produces the wrong result, but because an attacker who observes execution time can learn something about the data.

This is not theoretical. Post-quantum cryptosystems like Classic McEliece and NTRU Prime use sorting as a core subroutine, and if the sort leaks timing information, the entire cryptosystem is compromised.

There is, however, a family of algorithms that eliminates this problem entirely, and when implemented well, can even outperform traditional sorts.


Why sorting leaks information

Consider quicksort.

It picks a pivot, partitions the array around it, and recurses. Which pivot gets chosen, how the partitioning plays out, and how deep the recursion goes all depend on the values in the array. Different inputs produce different branch patterns, different memory access sequences, and different execution times.

An attacker who can measure execution time, even remotely over a network, can use those timing differences to deduce information about what’s being sorted. This is a timing side-channel attack.

The problem isn’t specific to quicksort. Any sort whose control flow depends on the data is vulnerable. Mergesort’s merge step branches on comparisons. Insertion sort’s inner loop length depends on how far each element needs to move. Even pdqsort, which combines several strategies, makes data-dependent decisions at every level.

What we need is a sort where the sequence of operations is completely fixed: determined only by the array length, never by the values. The same comparisons, the same memory accesses, the same number of instructions, regardless of whether the array contains [1, 2, 3] or [3, 1, 2].


Sorting networks

A sorting network does exactly this. It’s a fixed circuit of compare-and-swap operations wired together in a specific pattern. Each comparator takes two values and outputs them in order: the smaller one goes to one wire, the larger one to the other. The entire network is determined at construction time, before any data is seen.

The simplest sorting network sorts two elements. It’s a single comparator: take two values, output (min, max). Done. For three elements, you need three comparators. For four, you need five. The pattern grows, but the key property remains: the comparisons don’t depend on the data. The same wires get compared in the same order every time.

This is what makes sorting networks attractive for cryptography. The comparison pattern is data-oblivious: an attacker watching the execution sees the exact same sequence of operations regardless of what’s being sorted.


Bubble sort as a sorting network

It turns out that bubble sort, often dismissed as the worst sorting algorithm, is already a sorting network.

Think about what bubble sort actually does. On each pass, it compares adjacent pairs from left to right: elements 0 and 1, then 1 and 2, then 2 and 3, and so on. If a pair is out of order, it swaps them. After n-1 passes over an array of n elements, the array is sorted.

The comparison pattern is always the same. It doesn’t matter what values are in the array: bubble sort always compares the same pairs in the same order. It’s a fixed schedule of compare-and-swap operations. That’s exactly the definition of a sorting network.

The problem is efficiency. Bubble sort needs O(n) passes, each with O(n) comparisons, for a total of O(n²) comparators. For tiny arrays that fit in L1/L2 caches, bubble sort can be surprisingly fine in practice. But for large arrays, it’s far too slow.


Odd-even transposition sort

A natural improvement is to parallelize. In odd-even transposition sort, instead of sweeping left-to-right, you alternate between two phases:

  • Phase A: compare and swap pairs (0,1), (2,3), (4,5), ...
  • Phase B: compare and swap pairs (1,2), (3,4), (5,6), ...

Within each phase, all the comparisons are independent: they operate on disjoint pairs of elements, so they can all happen simultaneously. And after n phases, the array is sorted.

On a single processor, this is no faster than bubble sort. But it demonstrates something important about sorting networks: they expose parallelism. All comparisons within a phase are independent, which is a property that data-dependent sorts like quicksort fundamentally cannot offer, because each comparison depends on the results of previous ones.

Still, odd-even transposition sort has O(n) depth (sequential phases) and O(n²) total comparators. We can do much better.


Bitonic sequences

The key insight that leads to efficient sorting networks comes from a special kind of sequence: a bitonic sequence.

A bitonic sequence is one that first increases and then decreases like [1, 4, 7, 5, 2]. It can also be a cyclic rotation of such a sequence, so [5, 2, 1, 4, 7] is also bitonic (it falls, then rises, which is the same shape rotated).

Why do we care? Because a bitonic sequence of length n can be sorted by a fixed network with only O(log n) depth.

The trick is called a bitonic merger, and it works like this: compare each element in the first half with the corresponding element in the second half (elements at distance n/2). This single layer of comparators splits the bitonic sequence into two smaller bitonic sequences, one in each half. Recurse on each half, halving the distance each time. After log n layers, the sequence is sorted.

A bitonic merger is a small, efficient, fixed-structure circuit. It is the building block for sorting any array, not just bitonic ones.


Batcher’s bitonic sort

In 1968, Ken Batcher showed how to use the bitonic merger to build a full sorting network. The construction is recursive:

  1. Split the array in half.
  2. Recursively sort the first half in ascending order and the second half in descending order.
  3. The result is a bitonic sequence. Merge it with the bitonic merger.

For clarity, assume n is a power of two. Practical implementations handle arbitrary lengths with padding or tailored networks.

Here’s a concrete example with 8 elements. Start with:

[6, 3, 7, 1, 5, 2, 8, 4]

After recursively sorting the first half ascending and the second half descending (which itself applies the same construction on smaller subarrays), we get:

[1, 3, 6, 7,  8, 5, 4, 2]
 ascending    descending

This is a bitonic sequence: it rises then falls. Now we apply the bitonic merger. The rule is simple: for each pair (i, i+d), keep the minimum at position i and the maximum at position i+d. Halve d after each layer:

    positions: [0, 1, 2, 3, 4, 5, 6, 7]

        input: [1, 3, 6, 7, 8, 5, 4, 2]

d = 4   pairs: (0,4), (1,5), (2,6), (3,7)
          out: [1, 3, 4, 2, 8, 5, 6, 7]

d = 2   pairs: (0,2), (1,3), (4,6), (5,7)
          out: [1, 2, 4, 3, 6, 5, 8, 7]

d = 1   pairs: (0,1), (2,3), (4,5), (6,7)
          out: [1, 2, 3, 4, 5, 6, 7, 8]

Three layers, four comparisons each, and the array is sorted. The comparison schedule was the same regardless of what values were in the array: only the array length determined which pairs got compared.

Each level of recursion adds O(log n) depth from the merger. There are O(log n) levels of recursion. So the total depth is O(log²n) and the total number of comparators is O(n log²n).

This is more work than quicksort’s O(n log n). But the comparison pattern is completely fixed, which is exactly the property we need for constant-time sorting. And O(n log²n) is close enough to O(n log n) to be practical. The extra log factor is a modest price for side-channel resistance. More importantly, layers are made of nothing but simple, disjoint operations that can run in parallel.

This structure is also a natural fit for GPUs. GPUs execute threads in lockstep groups, and when threads within a group take different branches, performance collapses. Data-dependent sorts like quicksort suffer from this. Sorting networks avoid it entirely: every thread executes the same compare-and-swap regardless of the data. Bitonic sort has been one of the standard GPU sorting algorithms for this reason, and is included in NVIDIA’s CUDA samples as a textbook example.

Note that there exists a theoretically optimal O(n log n) sorting network, the AKS network, but its constants are so enormous that it’s unusable in practice. As Knuth put it, Batcher’s method is better unless n exceeds the total memory capacity of all computers on earth.


djbsort

Daniel J. Bernstein (“djb”) released djbsort in 2018 and keeps updating it. The goal is a sort that is simultaneously fast, constant-time, and verifiable. This implementation uses SIMD vectorization for speed, data-oblivious execution for security, and includes formal verification tooling that operates on the compiled binary.

djbsort implements Batcher’s sorting network with the wire ordering rearranged to be SIMD-friendly.

Since all comparisons within a layer operate on disjoint pairs, the CPU can exploit this independence in two ways: SIMD instructions process multiple pairs per instruction, and out-of-order execution overlaps independent scalar operations. The sorting network’s structure exposes both forms of parallelism naturally.

The core primitive is the compare-and-swap, and djbsort provides two versions.

The fast path uses hardware min/max instructions. On modern CPUs, packed SIMD min and max operate on entire vectors of elements at once and are inherently branchless. One instruction compares 4, 8, or 16 pairs simultaneously with no data-dependent control flow.

The fallback path uses pure arithmetic. It subtracts the two values to get a difference, extracts the sign bit via a right shift, negates it to produce either an all-ones or all-zeros mask, then XOR-swaps the values conditioned on that mask. Unless the compiler is able to understand and optimize that pattern, we’ll get no branches, no data-dependent memory accesses, and only operations that are expected to run in constant time on the target architecture. The same operations execute regardless of whether the pair was already in order.


A Zig implementation

I implemented djbsort in Zig. The library exposes two functions.

sort is the fast path for native numeric types: integers and floats of any width:

const djbsort = @import("djbsort");

var data = [_]i32{ 42, -7, 13, 0, -99 };
djbsort.sort(i32, .asc, &data);
// data is now { -99, -7, 0, 13, 42 }

sortWith handles arbitrary types, including structs. It follows the same interface as std.sort.pdq. You provide a comparison function and an optional context:

const Point = struct { x: i32, y: i32 };

var points = [_]Point{ .{ .x = 3, .y = 1 }, .{ .x = 1, .y = 2 } };
djbsort.sortWith(Point, &points, {}, struct {
    fn lessThan(_: void, a: Point, b: Point) bool {
        return a.x < b.x;
    }
}.lessThan);

Both functions use the same Batcher sorting network topology. They differ in how the compare-and-swap is implemented.

The native path has two levels of optimization. On modern CPUs with SIMD support, it processes elements in vector-wide chunks using packed min/max. For the leftover elements that don’t fill a full vector, it falls back to a scalar compare-and-swap.

The scalar path is gated by a comptime flag called has_ct_minmax. When side-channel mitigations are disabled (side_channels_mitigations == .none), it’s always true. In that specific case, we don’t need constant-time code, so @min/@max are fine on any architecture. But when mitigations are enabled (and by default, they are), it’s only true on x86/x86_64 and aarch64, where @min and @max have been verified to lower to branchless instructions (cmov, csel). On other architectures, the code falls back to the arithmetic sign-bit extraction and XOR swap:

const diff = if (order == .asc) b_int - a_int else a_int - b_int;
const sign_bit: u1 = @truncate(@as(UWInt, @bitCast(diff)) >> @intCast(bits));
var mask_word = @as(usize, 0) -% @as(usize, sign_bit);

The subtraction produces a negative result when the pair is out of order. The sign bit is extracted via a right shift, then spread to a full-width mask via 0 -% sign_bit (wrapping subtraction: 0 minus 1 wraps to all-ones, 0 minus 0 stays zero). The mask drives an XOR swap of the raw bytes.

The generic path (sortWith) behaves differently depending on std.options.side_channels_mitigations. When mitigations are disabled, it uses a plain conditional swap: a branch and std.mem.swap. Fast, but not necessarily constant-time. When mitigations are enabled, it uses ctCondSwap, which XOR-masks the raw byte representation in machine-word-sized chunks. Even with mitigations enabled, sortWith is only timing-safe if the caller-supplied comparison function is itself constant-time. A data-dependent lessThan would leak through the comparator, not the swap.

Both paths use the same optimization barrier to prevent the compiler from undoing the constant-time code:

mask_word = asm volatile (""
    : [mask] "=r" (-> usize),
    : [_] "0" (mask_word),
);

This is a no-op at the hardware level. It generates no instructions. But it tells the compiler that mask_word might have been modified, which prevents LLVM from reasoning about its value and converting the branchless XOR swap back into a conditional branch. This is more efficient than std.mem.doNotOptimizeAway(&mask) which includes a memory clobber.


Float ordering

IEEE 754 floats don’t have a total order. NaN is unordered. It’s not less than, equal to, or greater than anything, including itself. And -0.0 compares equal to +0.0, even though they have different bit representations.

For a sorting network, we need a total order. The sort function imposes one: -NaN < -inf < ... < -0.0 < +0.0 < ... < +inf < +NaN.

The solution is djbsort’s “useint” technique: don’t sort the floats at all. Instead, reinterpret their bits as signed integers, apply a transform that makes integer comparison match the desired float ordering, sort using the integer sorting network, and transform back.

The transform is a function called floatSortKey. Positive floats already sort correctly as integers: larger floats have larger bit values. Negative floats sort in the wrong direction: -1.0 has a larger bit pattern than -2.0, but should come after it. The fix is to arithmetic-right-shift the sign bit to fill the entire word (all-ones for negative, all-zeros for positive), AND with maxInt to clear the sign bit, and XOR. This flips the magnitude bits of negative values, reversing their order while leaving positive values unchanged:

fn floatSortKey(comptime SInt: type, s: SInt) SInt {
    const mask = s >> (@bitSizeOf(SInt) - 1);
    return s ^ (mask & std.math.maxInt(SInt));
}

The transform is self-inverse: applying it twice gives back the original value. So the whole float sort is just a thin wrapper around the integer sort:

if (@typeInfo(T) == .float) {
    const SInt = std.meta.Int(.signed, @bitSizeOf(T));
    const int_items: [*]SInt = @ptrCast(items.ptr);
    const int_slice = int_items[0..n];
    applyFloatSortKey(SInt, int_slice);  // O(n) transform, vectorized
    sort(SInt, order, int_slice);        // reuse the integer sort
    applyFloatSortKey(SInt, int_slice);  // O(n) reverse (self-inverse)
    return;
}

The pre/post transform is itself vectorized, processing elements in SIMD-width chunks. The cost is two linear passes over the array, negligible next to the O(n log²n) sorting work. In benchmarks, float sorting runs within a few percent of integer sorting speed.


Sorting networks trade a modest increase in total work, O(n log²n) instead of O(n log n), for a property that mainstream sorts like quicksort and pdqsort cannot offer: a fixed, data-independent comparison schedule. djbsort makes this practical through SIMD vectorization and the instruction-level parallelism inherent in sorting networks.

On Apple Silicon, the SIMD path (sort) runs 2x to 5x faster than the standard library’s pdqsort across all types and sizes tested, from 16 to 1M elements. For i32 at 1M elements, djbsort is still about 3x faster.

On AMD Zen4, the speedups are even larger at small and medium sizes, up to 5.5x at 16K elements. At 1M elements the O(n log²n) work catches up, but djbsort still wins by about 1.3x.

Floats sort at the same speed as their integer counterparts on both architectures, within a few percent. The generic path (sortWith), which cannot use SIMD, beats pdq up to around 65K elements, then gradually loses as the extra log factor dominates.

For a data-oblivious sort, being faster than a mainstream data-dependent sort across most practical sizes is a welcome bonus. And for the cryptographic use cases that motivate djbsort, arrays tend to be well within that range.

The code is on GitHub.