The sessions mechanism is one of the oldest component in PHP. It was born in 1997 and it remains pretty much the same in PHP 5.3, minus a severe security flaw that has been fixed.
Still, it remains widely used, possibly overused and misused as today's web sites aren't exactly facing the same challenges as sites made 15 years ago.
A PHP session is like a super-global array. The content of this array is automatically serialized and saved as a temporary file or in a database. A key, known as a "session identifier" is sent to the client. Further requests can include this key (usually as a cookie) in order to have PHP load the file or query the database, and fill back the super-global array.
This is a very bare bone mechanism.
Here's the typical flow of execution of a PHP script using sessions.
# PHP serializes $_SESSION and stores it in the session file # The output buffer is flushed
The content of $_SESSION can be written before the end of the script if session_commit() is explicitly called.
More or less randomly, an additional function is called. That one is responsible for removing expired session files. Instead of hurting arbitrary requests, some Linux distributions delegate the cleanup to a cron-driven shell script.
To lock or not to lock?
When session_start() is called, the default session handler opens or create a session file and immediately locks it for exclusive use (see ext/session/mod_files.c).
This file gets unlocked when session_commit() is called, either explicitly, or implicitly at the end of the script.
This locking has a significant impact: parallel requests on PHP scripts using sessions will not be processed in parallel, but sequentially.
If a document is being served to a user, another session-enabled request from the same user will block. session_start() will block until the previous document has been fully served.
Why locking is good
Handling things sequentially and not in parallel is an obvious way to avoid race conditions.
A lot of PHP applications don't pay attention to atomicity and integrity. Databases operations are made without any transactions. Referencial integrity is not enforced. Denormalization makes it worse.
Consider the following pseudo-code, inspired by a popular discussion board engine:
Without any transactions, there's an obvious and possibly dangerous race condition. A user could load another document in another window while the database is still in a half-assed state.
Session locking indirectly protects against this.
Another scenario to consider is this one:
- The main page for an ecommerce web site starts to load and gets partially rendered by the browser. The script reads $_SESSION in order to display the current number of items in the basket,
- Before the request is complete, the user clicks "add to shopping cart",
- This triggers an AJAX request that adds the new item to the $_SESSION-powered basket,
- The script for the main page completes.
Locking ensures that the script called through AJAX will block until the script for the main page completes.
Sure, this prevents race conditions. But from a user point of view, it sucks. It means that nothing will immediatly happen after the user clicks "add to shopping cart". The request will stall until session_start() unblocks. From a sysadmin point of view, it also sucks. Having full processes doing nothing but wait for a lock sucks. From a developer point of view, it also sucks. Serialization was acceptable when applications were designed for a single computer running a single core. Now, we're living in a massively parallel world and every time you use a global lock, God asks Cuteoverload to remove pictures of painfully cute kittens.
What would happen without locking?
- The script for the main page loads the session file, stores the content into $_SESSION,
- The script called through AJAX loads the session file, stores the content into $_SESSION,
- The script called through AJAX adds an item to $_SESSION
- The script called through AJAX saves $_SESSION as the session file,
- The script for the main page saves $_SESSION as the session file.
A script is unaware of what another script is doing at the same time. So the script from the main page will overwrite the session file with the content it previously read, no matter what the other script did inbetween. The item won't be added to the shopping card.
But... is the lack of locking the real issue? Or is locking just a kludgy band-aid against bad code, a la magic_quotes?
Let's recap what's happening in the script for the main page:
- The session file is loaded
- $_SESSION is build, by unserializing the content of the session file
- A single value (the number of items in the cart) is read from $_SESSION
- The content of the file is overwritten with the content of $_SESSION.
Can't you spot anything wrong here, way more shocking that anything that has to do with locking?
First, $_SESSION is a single big bucket. Its whole content is read and written from/to the session file everytime. This is the worst possible ORM. Totally unrelated content is going to get packed in the same basket. A script whose purpose is to add a product to the basket can overwrite a captcha key, just because if both use the PHP session, they share the same storage space, which is completely reread/rewritten by every script every time.
Another shocking aspect is the fact that our main script was supposed to only read the session. Yet, at the end of the script, PHP will overwrite the session file. And guess what with? The content it just read, unmodified. The file should be touched, not overwritten. Best of all, the main script, that doesn't change anything to the session, blocks every other script waiting for the session.
Is locking the solution? No.
Why no lock is better
The HTTP protocol has been designed to be stateless. And it works wonderfully this way.
Serving totally different content according to unknown data stored in server-side sessions easily breaks the REST principles. If you don't care about the REST principles, consider that: relying on sessions is likely to ruin the cacheability of what you are serving. Serving the same content when asking for the same thing is the key to cacheability. And possibly what the HTTP protocol has been designed for.
Ditching locks means that requests can be handled in parallel, not sequentially.
A web site that cares a little about performance uses progressive loading. Fragments are rendered independently (and possibly reassembled as a single page through Varnish or Nginx for search engines, but we're disgressing).
If every fragment has to wait for the previous one to complete, the benefits in progressive loading can vanish. Parallelization, hence being lockless, is important.
No lock means no idle process, doing nothing but waiting for a lock being released. PHP is not event-driven. Blocked scripts can't be a good thing.
Going lockless also means that, in order to provide sessions that can be shared by multiple PHP servers, a bunch of options are available. Virtually any key/value data store can fit the bill: Redis, Membase, Kyoto Tycoon, you name it.
How to safely move from lockful sessions to lockless sessions?
Here's a simple rule: use a session as a cache for immutable data.
If your server-side scripts constantly need to know what the user name is, it might be acceptable to keep this value in the session.
Regarding the above example: use dedicated storage space in order to store the shopping cart. Just create a shopping cart entry for the current user in Redis, MongoDb, Riak, PostgreSQL, whatever. And push items there. You don't need to store the shopping cart in the session. You just need a key to retrieve the shopping cart. And this key may be kept in the session.
Respecting the REST principles makes this obvious. The URI for a shopping cart should be something like /cart/shopping-cart-id, not /cart for everybody, leading to a document that solely depends on the session data.
One URI is one resource. Having a single global array, holding mutable data from independent resources is total nonsense. If data have to be shared by different resources, they should be immutable. This is what a session is for.
Ditching PHP sessions
Are PHP sessions actually required after all?
If you have to store secret data that have to be hidden to clients, maybe. Although you need to double check that a session is really where you want to store them, ie. that these data will be constantly needed by a lot of scripts.
In almost every case, sessions are not needed at all.
How does a PHP session work? The client sends a cookie, a PHP script reads this cookie and fetches the session data, using the cookie value as a key.
Why not just keep the session data in the cookie?
This is a very common practice in the Ruby world, introduced a long time ago with Rails and Rack::Session::Cookie.
And this is fairly secure as long as the cookie also contains a HMAC for the data.
What are the benefits?
- No need to store any sessions,
- No need to wait for two queries in every script,
- One less point of failure,
- Share-nothing: any client can connect to any PHP server, at any time,
- Having sessions lasting 5 minutes or 5 years don't make any differences, server-side.
- If the secret key is leaked, you're screwed.
- Larger requests.
Cachecookie is a simple way to use cookie-based sessions.
If you want to give it a shot, just download Cachecookie.
(I'm not convinced that it is worth pushing to Github, but if you really want to fork it, ping me).
Cachecookie lets you store a multiples values in a single cookie, signed with a HMAC.
Need to store values? Say the user name and his cart id?
The third argument is the life time. Yes. Cachecookie lets you store multiple values, with individual expiration dates, in a single cookie. Ain't it great?
Let's retrieve the user name:
If no entry is found, all you get is NULL.
Of course, the signature of the cookie is verified before any operation.
Want to delete a value?
The method returns TRUE if the key existed, FALSE if you attempted to delete something nonexistent.
Want to delete everything (in fact, the single cookie)?