An introduction to XET, Hugging Face's storage system (part 2)

Content-defined chunking

In part 1, we saw that the XET content-addressable storage protocol splits files into variable-sized chunks.

Why variable-sized? Because if boundaries are chosen based on content, inserting or deleting bytes typically affects only nearby chunks; content before and after can still be deduplicated.

v1: ABCDEXGH      IJKLMNOPQRSTUVWXYZ
v2: ABCDEXGH  42  IJKLMNOPQRSTUVWXYZ

In the example above, inserting 42 only changes the middle chunk, so the other two remain identical and can be deduplicated.

If the final set of files is known up front, the optimal boundaries can be found by looking for the longest common slices across versions.

But in a storage system that allows updates, we have to do something else.

The solution is to make chunk boundaries depend on the content itself.

Imagine splitting a file at every NUL byte. If we insert a couple of bytes somewhere, this will affect one chunk, but other chunks will remain unaffected.

Before: "Hello\x00World\x00Test"


         "Hello"   "World"   "Test"
              ↑         ↑
           boundary  boundary
            (NUL)     (NUL)

After insert: "Hello\x00Big World\x00Test"

         "Hello"   "Big World"   "Test"
              ↑             ↑
           boundary       boundary
            (NUL)          (NUL)

Only the middle chunk changed. "Hello" and "Test" remain identical.

Using a fixed byte value such as NUL may not work well with every kind of input. Binary files might have NUL bytes everywhere, while text files might have none. So we may want to refine this idea a little bit.

Here’s a more universal approach: compute a rolling hash and split when some pattern is found in the hash output, for example when N specific bits are zero.

Rolling hash: processes bytes one at a time
              ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
    Input:    A B C D E F G H I J K L M N O P ...
    Hash:    [continuously updated 64-bit value]

When the hash matches a pattern, we have a chunk boundary.

Byte A: hash = 0x7A3B...  (no match)
Byte B: hash = 0x1C2F...  (no match)
Byte C: hash = 0x0000...  (top bits are zero → BOUNDARY)

This is content-defined: the same content will always produce the same boundaries, regardless of where it appears in a file.

There are many variants of this concept. The algorithm currently used by Hugging Face’s XET is called GearHash.

How GearHash works

GearHash is elegantly simple. It maintains a 64-bit non-cryptographic rolling hash that updates per byte:

h = 0  # 64-bit hash state

for byte in data:
    h = ((h << 1) + GEARHASH_TABLE[byte]) & 0xFFFFFFFFFFFFFFFF

That’s it. For each input byte:

  1. Shift the current hash left by 1 bit
  2. Add a pseudorandom 64-bit value from a lookup table (indexed by the byte value)
  3. Mask to 64 bits

The lookup table contains 256 entries of pseudorandom 64-bit values. Each byte value maps to a different table entry, creating pseudorandom-looking output.

To detect chunk boundaries, XET checks if the top 16 bits of the hash are all zero:

MASK = 0xFFFF000000000000  # Top 16 bits

if (h & MASK) == 0:
    # Chunk boundary found!

Since having 16 specific bits be zero happens roughly once every 2^16 ≈ 65,536 bytes, this naturally creates chunks around 64 KiB on average.

To avoid pathologically small or large chunks, XET enforces constraints:

  • Minimum: 8 KiB - boundary checks are skipped until this size is reached
  • Maximum: 128 KiB - a boundary is forced if no natural one is found
  • Target: ~64 KiB - what the parameters are tuned for
    if chunk_size < 8 KiB:
        continue                    # Too small, keep going
    elif chunk_size >= 128 KiB:
        FORCE BOUNDARY              # Too big, must cut
    elif (hash & MASK) == 0:
        NATURAL BOUNDARY            # Content-defined cut

The beauty of GearHash is that it’s both simple and fast. The rolling hash updates in O(1) time per byte, with just a shift, an add, and a table lookup. No complex arithmetic, no buffering, no look-ahead.

Why content-defined chunking is cool

The magic happens when files change:

Original:  [AAAAAAAAAAA][BBBBBBBBBBBB][CCCCCCCCCCCC]
              chunk 1      chunk 2        chunk 3

Insert "XYZ" in the middle of chunk 2:

Modified:  [AAAAAAAAAAA][BBBBBXYZBBBBB][CCCCCCCCCCCC]
              chunk 1       chunk 2        chunk 3

Chunks 1 and 3 are unchanged. Only chunk 2 needs to be re-uploaded.

With fixed-size chunking, inserting bytes would shift all subsequent chunk boundaries, destroying deduplication. Content-defined chunking solves this elegantly.

Other chunking algorithms

GearHash is simple and effective. However, many other content-defined chunking algorithms exist, and this remains an active research area.

I found UltraCDC (paper, zig-ultracdc) to be faster and to consistently provide better space savings than GearHash for AI models.

The Zig implementation of XET includes the file_to_xorb tool that allows you to swap GearHash for UltraCDC if you want to run your own comparisons.

By the way, Gilles Chehade wrote a pretty nice overview of the systems he tested and implemented in Go for his Plakar backup application.

Chunk compression and interleaving

XET compresses chunks with LZ4 before storing them. LZ4 doesn’t provide the best compression ratio, but it’s very fast. At scale, that matters a lot.

Quick note for implementers: there are different variants of LZ4: the standard/fast one, a high-compression variant that is slower, and something referred to as LZ4F. LZ4F is a container format wrapped around regular LZ4 blocks. XET uses LZ4F, not plain LZ4.

But before compression, XET can reorganize the bytes in a way that may provide better compression ratios.

AI models are giant collections of weights, and those weights are encoded using the same representation. When they’re represented as integers, there’s not much to gain from reorganizing them.

However, the story is a bit different with floating-point numbers. Floating-point numbers are encoded as the concatenation of a sign bit, the exponent, and the mantissa.

In AI models, sign and exponent tend to change slowly and repeat across weights. Many weights share the same exponent, and sign bits are often biased. The least significant mantissa bits, on the other hand, look much closer to random noise.

With the default representation, the compressor sees a sequence of bytes like (FP32 example, 4 bytes):

[sign+exp][mantissa][mantissa][mantissa]
[sign+exp][mantissa][mantissa][mantissa]

So every highly predictable byte is immediately followed by less predictable ones. That constantly resets the compressor’s context and makes it harder to spot long runs or strong patterns.

Now imagine you interleave the data (byte grouping) like this:

  • all byte 0 of all weights
  • then all byte 1 of all weights
  • then all byte 2 of all weights

Suddenly, each stream becomes cleaner, with longer runs of identical or very similar values, helping with compression.

In XET, byte grouping is optional, but useful to enable for floating-point types. Given that chunks are guaranteed to be small, keeping them in memory to interleave or deinterleave them is not an issue.

Instead of interleaving the bytes, we can interleave the bits:

  • all bits 0 of all weights
  • then all bits 1 of all weights
  • then all bits 2 of all weights

Bit grouping increases the probability that similar weights produce long sequences of identical bits, further improving the compression ratio.

Although not supported by Hugging Face, the Zig implementation of XET supports bit grouping (fbs compression, which is bit grouping + LZ4F) if this is something you want to play with. Bit grouping requires more CPU cycles than byte grouping, though.

That’s a wrap

This is it for part 2, but there’s more to explore about the XET format in a future episode!