Frank DENIS random thoughts.

Pagination and HTTP caching

Pagination is a simple concept, but from the developer’s side, it’s really tricky to get right.

Adding pagination to a fixed set of documents is a piece of cake. In such a situation, lookups have no valid reason not to be O(1) unless you’re blindly using SELECT…LIMIT SQL statements.

But paginating dynamic data, where documents are constantly added, removed and modified, is slightly more challenging. Alas, discussion boards, blog engines and social networks have to suck it up.

In this post, I will focus on client-side caching of pagination. Googling for “pagination http cache” doesn’t yield anything but articles about server-side caching.

This is a sad state of affairs. While browsing a blog or a forum, going back and forth between pages is an instinctive behavior. A lack of client-side caching significantly degrades a user’s experience, is a waste of bandwidth and burns server CPU cycles for no valid reasons.

Don’t do this at home

Wordpress, vBulletin, Ning and unfortunately a lot of other forum and blog engines and services, are using the worst possible approach.

Document A is posted. Later on comes document B and a bunch of other documents: C, D, E, and so on. Let’s suppose document R is the freshest one and pages are designed to hold up to 5 documents.

Browsing http://example.com displays the newest 5 documents: R, Q, P, O, N. Good and totally expected.

Things start to go wrong when a user follows the “see older posts” link, leading to the next page.

R, Q, P, O and N are on http://example.com, which is an alias for http://example.com/page1 M, L, K, J, and I are served as http://example.com/page2. H, G, F, E, and D are served as http://example.com/page3 and so on.

So, as the page number grows, the content is getting older.

This is terrible.

Inserting a new document S means means that all pages immediately become inconsistent.

http://example.com should now hold S, R, Q, P and O. http://example.com/page1 should now hold N, M, L K and J. http://example.com/page2 should now hold I, H, G, F and E.

And this is even more terrible because inserting new documents is an operation that obviously has a high chance of being constantly performed.

This kind of pagination is inefficient. Users can’t reliably bookmark a page, because if they open the bookmark later on, they’re might reach other documents.

It ruins SEO. And to top it off, it makes pages virtually impossible to cache.

Wordpress, vBulletin, Ning and others are solving this by explicitly preventing a browser to cache a page.

So if a user reads page 1, then goes to page 2, then goes back to page 1, downloading the full page 1 will be necessary, even if nothing has changed at all.

Granted, there are “cacheability” addons for Wordpress and vBulletin, but all they do is to sometimes issue a “304 Not Modified” reply instead of providing URIs featuring a decent time to live.

How come having to rewrite every page when something is added to a text is something that never happens in the real life?

Well… the usual way to use a notebook is to start to write at page 1. Once page 1 is full, the pen holder keeps writing, but on page 2. Once page 2 is full, he keeps writing, that time on page 3. Thus, page 1 holds the oldest content and the last page of the book holds the content that was written last.

Sounds pretty straightforward, eh?

So why not do so on a web site?

The first article id

Page numbers were designed for books. More specifically, they were designed in order to manually find content according to a table of contents.

But on a computer, we never have to flick over pages in order to find the content associated to a link. We just click a link, or touch the screen.

Hence, there’s no valid reasons to use a page number in order to refer to a page.

Overblog are using URIs like:

http://example.com/40-index.html

40 doesn’t mean page 40. It actually means “the page whose first article is the 40th”. So /40-index.html holds documents 40, 41, 42, 43, 44 and the next 5-items page is /45-index.html.

An immediate benefit is that if a blog owner changes the number of articles per page, it doesn’t break anything. Previously indexed / bookmarked / cached URIs remain totally valid, although the number of articles per page might not be initially consistent.

However, unless explicitely configured the other way round, document 40 means the 40th starting from the most recent one. Publish a new document, and ex-document 40 now refers to another document, and /40-index.html is out of date.

Overblog pages uses ETags, that properly change whenever the content of a page changes. Still, even if the content hasn’t changed, a useless round-trip with a 304 reply is required every time.

Things would have been much better if document identifiers were incremental, ie. new documents get a higher id than older ones.

00f.net’s initial blog engine worked that way. Pages URIs were made of timestamps:

http://example.com/1298640310.html

The content of such a page is N documents whose timestamp is greater than or equal to 1298640310.

This scheme works pretty well, because pages can get easily cached and indexed. Pages that don’t get any update can keep the same URI forever.

An article added to the freshest page automatically causes the URI to change, without invalidating the previous URI.

Timestamps were good enough for this blog, but any increasing and globally unique identifier would be a good fit.

The downside of non-monotonically increasing ids is that while it’s easy to jump to older content (“next page”), it can be tricky to get links to previous pages (newer content) without falling into duplicate content or HTTP redirects. But for infinite scroll pagination style, it makes wonders.

Dealing with update and removal

Unfortunately, adding new documents is not the only change that is going to occur. Updates and deletions are also bound to happen, albeit less frequently.

How to deal with that?

Well, if a different version of the content is associated to the same page identifier, just bump the version up and include this version very number in the URI.

http://example.com/v2/5/

If a request to http://example.com/v1/5/ ever comes in, a permanent redirect to http://example.com/v2/5/ can be issued. In any case, document 5 (+ older ones) will be present on the page, and browser cache will get invalidated.

While finer versioning (per-page group) would be achievable, a global version number is way easier to deal with. And good enough, considering the fact that updates and deletions are way less frequent than newly added content.

If having a version number in the URI of the first page, ie. the one with the most recent documents, is a concern, this page might be handled differently by falling back to Last Modified + 304 replies.