An introduction to XET, Hugging Face's storage system (part 1)

The storage problem

Version control systems like Git are awesome. They let you clone, update, and share the full history of changes in a project.

How do they work? Obviously, storing a full archive of all the files of every version would be a ginormous waste of storage space.

In practice, it’s pretty uncommon to change all the files in a project in a single step, so from one version to the next, there’s usually a ton of redundancy.

For that reason, Git stores objects, indexed by a hash of their content. Files with the same content, regardless of their name and which versions they appear in, will map to the same object.

Git stores full-file blobs, and it can later delta-compress related blobs to save space. That works beautifully for source code. It works much less well for large binary files, where even small changes can make the delta almost as large as the entire file.

The result is that preserving history for binaries often means storing and transferring a new large blob for each change.

Clearly not ideal. Git LFS solves that problem by replacing large files with tiny pointer files that record the LFS spec version, a SHA-256 object ID, and the file size. The actual content is stored in a separate LFS object store.

That way you download only the versions you actually check out, and the large objects live outside Git’s object database.

What Git LFS doesn’t solve, though, is that if a change is made to a file, an entire new file still needs to be stored.

Storage is not a big deal client-side (since we can store pointers instead of storing the actual content for files not present in the current revision), but all the variants still have to be uploaded. Server-side, a lot of storage space may be required even if many files actually share most of their content.

The Hugging Face Hub quickly ran into this problem.

Models, datasets, and other artifacts are stored in Git repositories. Large files are tracked via LFS pointer files, while the Hub’s storage backend uses XET (with Git LFS compatibility), so you get Git workflows plus chunk-level deduplication.

On the Hub, a lot of data is forked and versioned, and even small edits can require uploading and storing a whole new large object, even when most bytes are identical. Clearly suboptimal.

Chunking, deduplication, and xorbs

To solve that problem, we can split files into chunks and deduplicate at the chunk level instead of the file level.

v1: ABCDEXGHIJKLMNOPQRSTUVWXYZ
v2: ABCDEXGH42IJKLMNOPQRSTUVWXYZ

v1 and v2 are different, have different hashes.

But by splitting them intelligently into chunks, we get two shared chunks that can be stored only once:

v1: ABCDEXGH      IJKLMNOPQRSTUVWXYZ
v2: ABCDEXGH  42  IJKLMNOPQRSTUVWXYZ

Boom. Big storage savings server-side.

Downloading a chunked file then becomes a reconstruction problem. The server sends the client a plan: an ordered set of chunk ranges that, after reassembly, produce the final file.

As a bonus, these chunks can be downloaded in parallel, over multiple HTTP connections, in order to improve efficiency.

For practical reasons, the default XET suite targets ~64 KiB chunks, with a minimum of 8 KiB and a maximum of 128 KiB. That size produces good results in compression and deduplication.

But requiring a new HTTP connection for every chunk would be slow. A 1GB file with 16,000 chunks would need 16,000 HTTP requests. It would also create 16,000 separate objects in S3-like cloud services (with per-object storage costs) and 16,000 CDN cache entries. Not very efficient.

The solution is to group chunks into containers. Xorbs are large objects that aggregate many compressed chunks. They are capped at 64 MiB serialized size, and the current implementations cap them at 8,192 chunks, with a typical target around 1,024 chunks.

Within a xorb, chunks are stored in a single recorded order. In practice, uploaders usually append chunks in ingest order, which often preserves file order, but the protocol only relies on that recorded order.

Therefore, the reconstruction plan sent by the server is not just a list of chunks. It is a list of terms (xorb hash + chunk-index range), plus per-xorb download instructions that map those chunk ranges to exact byte ranges.

The client can then download the right ranges from the different xorbs (using the HTTP Range header), parse chunk headers, decompress the chunks, and concatenate everything to get the file.

To check integrity, it can recompute the file hash (a Merkle-style hash over the chunk hashes) and compare it to the expected file hash. Chunks and xorbs are also content-addressed by hash, so each level can be verified.

Pretty efficient, and well optimized for CDNs and other HTTP caches.

XET is useful beyond the HF Hub

This is a 10,000-foot view of XET, Hugging Face’s storage system.

What’s pretty cool about it is that it can be used for many use cases beyond models and datasets. Anything that stores large objects that get cloned and modified can benefit from it. For example, OCI image registries also tend to store a ton of very redundant data, and could greatly benefit from XET.

At its core, XET is a content-addressable storage protocol with chunk-level deduplication. In addition to being optimized for large files, it has some useful properties.

First, it uses content-defined chunking to create variable-sized blocks, and it content-addresses chunks, xorbs, and files by hash. That means inserting or deleting bytes still allows for deduplication, unlike systems using fixed-size blocks, layer-based deduplication, or file-based deduplication.

It’s also designed for efficient caching and transfers. More importantly, it has a clean design and a public specification, with multiple open-source implementations maintained by Hugging Face.

There would be a lot of value in adopting XET as a standard content-addressable storage system. Even if you are not Hugging Face and don’t serve models, you may still find applications for XET.

This is only the first part of my introduction to XET, giving you a high-level view of the problem it solves and how it works. But there’s much more to cover, including clever tricks to maximize bandwidth usage and design details to ensure integrity and avoid unauthorized access. XET introduction part 2 covers content-defined chunking and compression.

Frank DENIS random thoughts.

An introduction to XET, Hugging Face's storage system (part 1)

The storage problem

Chunking, deduplication, and xorbs

XET is useful beyond the HF Hub

Further reading