Frank DENIS random thoughts.

An overview of modern SQL-free databases

SQL-free databases are quickly moving forward those days. They aren’t “embedded databases” nor “key-value databases” anymore. Modern SQL-free databases are now featureful, while remaining fast, light and reliable. They really can replace SQL databases (even column databases) for a lot of tasks.

Here’s a short overview of some state-of-the-art SQL-free databases.

Tokyo Products

Tokyo Products, designed by Mikio Hirabayashi from Mixi, is a set of three pieces of software :

  • Tokyo Cabinet: a powerful database library, with multiple engines
  • Toyko Tyrant: a network layer for Tokyo Cabinet, with server-side scripting
  • Tokyo Dystopia: a full-text search system built over Tokyo Cabinet.

A fourth piece of the puzzle, although not written by Mikio, might be Lightcloud. See below for details.

Installing Tokyo Products is straightforward. The only tweak I had to do on an Ubuntu system was to add -I/usr/include/lua5.1 to CPPFLAGS in order to let Tokyo Tyrant compile with LUA support.

Supported APIs are: C, Ruby (native and FFI), Perl, Java, LUA, PHP, Clojure, Haskell, Python and CLI. Supported protocols are: native (binary and compact), HTTP (RESTful) and Memcache.

Data can be stored with transparent compression (Deflate, Bzip, TCBS).

On-disk operations are atomic and Tokyo Cabinet databases should never got corrupted. Fragmentation is minimal, but every storage engine supports vacuuming.

Transactions are supported (commit/rollback), with serializable isolation. Concurrency is achieved through a rwlock, not through MVCC.

Tokyo Cabinet

The Tokyo Cabinet library supports 4 major storage engines, with slightly different features. It keeps in-memory caches in order to speed up lookups in hashes and in trees. Of course, it is thread-safe and using it as part of a threaded app is the best way to take advantage of its cache. The C API also comes with a set of useful functions in order to play with in-memory lists and hashes, plus some goodies like URL-parsing.

Databases of fixed-length records

This is the fastest database type. Keys must be numbers, and values have a maximum length.

The fixed-length engine supports the following operations:

  • Store and replace
  • Store or keep the previous value
  • Concatenate (atomically)
  • Delete
  • Get a record
  • Get a range, with or without limits. Ex: get every record whoose key is between 1400 and 9842, but return at most 2500 records

Hash databases

The hash engine supports the following operations:

  • Store and replace
  • Store or keep the previous value
  • Asynchronously store (pending updates are queued and flushed to disk in a blast)
  • Concatenate (atomically)
  • Delete
  • Get all records matching a key : the hash engine supports multiple records for a single key
  • Iterate over all records
  • Get all keys starting with a prefix (handy, but slow with the hash engine)
  • Transactions (begin/commit/abort)
  • Get the number of records

B+Tree databases

The tree can be ordered according to any comparison function. Of course, multiple records can match a single key and walking through records matching garantees that the original order is preserved. New nodes can be inserted anywhere.

The following operations are available:

  • Store and replace
  • Store or keep the previous value
  • Bulk store for a key
  • Concatenate (atomically)
  • Get all records for a key
  • Get a range: with or without limits, borders are inclusive or not, from the beginning or to the end, or from a specific record. Ex: get 100 keys between “bar” excluded and “foo” included.
  • Get all keys matching a prefix (fast)
  • Transactions (with serializable isolation)
  • Get the number of records
  • Cursors: move cursor to the beginning, to the end or jump to a key, move to previous and next keys, insert a new node at cursor, read the current record

Table databases

The table engine handles multiple columns per record, and in addition to the primary key, indexes can be added to any (1 or more) column, just like SQL databases. Each index can be either numerical or lexical.

Supported operations:

  • Store and replace
  • Store or keep the previous value
  • Concatenate (atomically)
  • Delete
  • Get a row
  • Iterate over all records
  • Get all keys matching a prefix (slow)
  • Transactions
  • Get the number of records
  • Index optimization
  • UUID generation (die auto increments, die)
  • Complex queries with multiple conditions on any number of columns. Conditions includes: string equals, string includes, string begins with, string ends with, string includes a list of tokens, string includes at least one token of a list, string matches a regex, number equals, number is less than, greater than, less or equals than, greater or equals that, number is in a range, and number is in a list. All those conditions can be negated. Boolean operations are supported in order to mix conditions.
  • Results of queries can be limited (max results, number of initially skipped records)
  • Results of queries can be ordered, numerically or lexically, ascending or descending
  • Records matching a query can be deleted
  • Count the number of matching records for a query

This must cover 99% of what people use SQL databases for.

Tokyo Tyrant

Tokyo Tyrant is a network layer, providing access to all engines and features of Tokyo Cabinet. It exposes the following protocols:

  • Its own native, binary, compact protocol
  • HTTP (RESTful)
  • Memcache

Tokyo Tyrant also implements master-slaves and master-master (but asynchronous) replication, through traditional update logs. Continuous and on-demande replication are possible.

A great feature of Tokyo Tyrant is its support for server-side scripting. LUA can be embedded, so that external LUA scripts can be loaded and their functions can be called from Tokyo Tyrant clients. Tokyo Tyrant exposes builtins so that those functions can access the databases.

Functions can also be periodically called. An exemple is a trivial script that looks for expiration dates in records, and that removes them accordingly. Want memcache on steroids? You get it.

Tokyo Tyrant also recently introduced a nifty feature : Map-reduce. Provide your map and reduce functions (still through LUA) and it will do the rest.

The current Map-reduce implementation in Tokyo Tyrant can’t distribute the workload accross multiple hosts. But Mikio wrote that it was a work in progress and there is great hope to see that feature in the very near future.

In the meantime, while Tokyo Tyrant provides replication, it doesn’t provide anything to automatically distribute records across multiple hosts. Lightcloud tries to solves this.

Tokyo Dystopia

Tokyo Dystopia adds a full-text indexation to Tokyo Cabinet.

It supports phrase matching, prefix matching, suffix matching, substring matching, token matching and token prefix/suffix matching.

Lightcloud

Distributing records across multiple hosts might sound trivial. Actually, doing it in a reliable way is a nifty challenge. As long as the number of hosts is constant, it is trivial. But offering the ability to add/remove hosts with as little implications as possible is yet another story. You especially have to track out your data as the hash ring changes.

Lightcloud solves that part in an efficient way.

It ships with two components:

  • A script that doesn’t do much than starting Tokyo Tyrant processes. It’s just a helper, as an /etc/init.d/* script is. I don’t see the point of having written it in Python, because it mainly spawns shell commands (“ps aux fgrep ‘ttserver’ grep -v …”, “os.popen(‘rm -f …’)”, yiiirrkkk).
  • A client module that implements everything. Client modules are only available for Python and Ruby.

Keep in mind that Lightcloud is not a proxy. All the work is made at the client app level.

Lightcloud provides only 3 verbs: GET, SET and DELETE.

No other feature of Tokyo products is available through Lightcloud. Other verbs involving a single key would be trivial to add, but queries, ranges, etc. are out of question. Of course, implementing this would make Lightcloud a way much complex module. Yet, Lightcloud is not actually something that transparently transforms a Tokyo database into a cloud.

Redis

Redis is a newcomer, but it brings two very interesting features: sets and lists, with atomic operations.

It has built-in asynchronous master-slave replication, but it can’t distribute data across multiple nodes and I don’t know of any external tool that might do the job yet.

In order to be as fast as possible, it tries to work as much as possible in memory, and asynchronously commit changes to disks, with optional LZF compression. It also has a built-in expiration mechanism and it supports basic authentication.

Redis was trivial to compile and install from source code.

Supported APIs are: Ruby, Python, PHP, Erlang, TCL, Perl, LUA, Java and it also has a CLI.

Simple binary values can be applied the following operations:

  • Store and replace
  • Store or keep the previous value
  • Add and substract a number
  • Delete
  • Get a random key
  • Rename a key
  • Set the time-to-live of a record

But stored objects can also be lists. And the following atomic operations applies:

  • Append an element to the tail of the list
  • Append an element to the head of the list
  • Get the length of the list
  • Get a part of the list
  • Delete a part of the list
  • Pop (get+delete, in a single op) an element
  • Update an element of the list

Pretty slick, hey?

Redis also supports “sets”. Records can be assigned to sets. And then, it is possible to extract members of sets, and you can even easily create a new record whose content is that extraction. SQL zealots can see that as views and materialized views. The following operations apply to a set:

  • Add a member to the set
  • Delete a member from the set
  • Get the intersection between sets for a list of keys
  • Get the intersection between sets for a list of keys and store the result as a list
  • Get the number of members of the set
  • Check whether a key is a member of the set

Moreover, sets and lists can be sorted (lexically and numberically, ascending or descending, with or without limits).

Another nice feature of Redis is that is supports namespaces. Keys can easily be moved from one namespace to another. And even better, the whole content of a namespace can be flushed in a single and fast operation. A very convenient way to flush a lot of related keys.

Redis is an exciting project.

Flare

Flare is a key-value database. It doesn’t have any more complex data structure like Redis. Neither has it advanced query features like Tokyo products. Flare is actually closer to Memcache, which it shares the protocol with (only the FLUSH_ALL command is not supported yet). And Flare has persistent storage, through the Tokyo Cabinet library. Data can also get replicated to MySQL slaves.

It was very easy to install, although you don’t have to forget to install the Boost libraries before compiling it.

The strength of Flare is that it wonderfully and seamlessly supports data replication, distribution and high availability. Here’s a lazy snippet of the Flare main web page:

  • persistent storage (you can use flare as persistent memcached)
  • data replication (synchronous or asynchronous)
  • data partitioning (automatically partitioned according to number of master servers (clients do not have to care about it))
  • dynamic reconstruction, and partitioning (you can dynamically (I mean, without any service interruption) add slave servers and partition master servers)
  • node monitoring and failover (if any server is down, the server is automatically isolated from active servers and another slave server is promoted to master server)
  • request proxy (you can always get same result regardless of servers you connect to. so you can think flare servers as one big key-value storage)

The Flare server can masquerade as a Memcache server. A client would blindly see a single server while there’s actually a cluster of Flare servers.

If you are familiar with key-value databases, Flare is a very easy way to scale your databases. If you need to add redundancy and persistent storage to memcache server, Flare might also be a good pick.

CouchDB

There is a lot of documentation and presentations about CouchDB, so I won’t summarize its features here. Moreover it’s very easy to play with, especially through the built-in Sofa interface.

In order to add clustering abilities to CouchDB, I’d strongly suggest to have a look at CouchDB-Lounge. While CouchDB itself is widely covered, CouchDB-Lounge isn’t, although it’s an essential piece of the puzzle.

Although CouchDB is also an SQL-free database, it’s quite different than previous software. It can scale infinitely, it uses MVCC, creating views using Javascript map-reduce functions is wonderful, but inserting documents (not in a batch) is way slower than with other SQL-free databases. CouchDB is clearly not designed for writing or updating tiny individual documents.

Update: MongoDB =======

I never heard about it before, but MongoDB also looks like an option. It’s a collection-oriented SQL-free database, with support for Java, C++, Python, Ruby, PHP, Erlang and Factor APIs.

I’ll might write more about it once I’ll have actually played with it.

Thanks for the head up, FG.

Update 2:

Here are links to these projects: