Here’s a big picture of what data structures you can get of some NoSQL data stores, in a JSON representation.
Before picking a data store, try to model your data to these constraints and see what the best fit is.
SQL
{ "table 1": { "key 1": { "property 1": "string", "property 2": "numerical value" }, "key 2": { "property 1": "string", "property 2": "numerical value" }, ... }, "table 2": { "key 3": { "property 3": "date" }, "key 4": { "property 3": "date" }, ... }, ... }
- The set of properties is fixed for every record of a table.
- Any key and/or property can be indexed.
Cassandra
Columns:
{ "column family 1": { "key 1": { "property 1": "value", "property 2": "value" }, "key 2": { "property 1": "value", "property 4": "value", "property 5": "value" } }, ... }
and supercolumns:
{ "column family 2": { "super key 1": { "key 1": { "property 1": "value", "property 2": "value" }, "key 2": { "property 1": "value", "property 4": "value", "property 5": "value" }, ... }, ... "super key 2": { "key 1": { "property 4": "value", "property 5": "value" }, "key 2": { "property 1": "value", "property 6": "value", "property 7": "value" }, ... }, ... }, ... }
- Columns families are constrained by a schema. You need to predefine their names, their types (simple columns or supercolumns), and how the keys and properties will be sorted. Changing the schema requires a server restart in < 0.7 versions.
- A database (“keyspace”) can mix simple columns and supercolumns.
- Range queries are supported.
- Support for expiration times in >= 0.7
- Distributed, fault-tolerant.
- Never locks, even for writes. Efficiently uses multiple cores.
- Runs on a lot of operating systems (wherever a JVM exists).
- No REST client interface. Uses Thrift serialization.
- Can run Hadoop map/reduce jobs.
MongoDB
{ "namespace 1": any json object, "namespace 2": any json object, ... }
To elaborate, here’s an example:
{ "namespace 1": [ { "_id": "key 1", "property 1": "value", "property 2": { "property 3": "value", "property 4": [ "value", "value", "value" ] }, ... }, { "_id": "key 2", "property5": { "property3": "value", "property7": { "question": "6x9", "answer": 42, "list": [ 3, 5 ] } }, ... }, ... ] }
- No schema, the model is totally flexible. Any property can be added to any object any time.
- Any property can be indexed, at any depth (for example, “property3” could be indexed).
- A wide range of operations can be performed.
- Collections can be capped to a fixed number of bytes or elements.
- Will be distributed and fault-tolerant through a proxy, but this one is still in early stage.
- Write operations lock the whole database. Efficiently uses multiple cores.
- Written in C++ but currently runs on a limited number of operating systems.
- Uses its own protocol, but basic REST proxies do exist (in Python and NodeJS).
Riak
{ "bucket 1": { "key 1": document + content-type, "key 2": document + content-type, "link to another object 1": URI of other bucket/key, "link to another object 2": URI of other bucket/key, }, "bucket 2": { "key 3": document + content-type, "key 4": document + content-type, "key 5": document + content-type ... }, ... }
- Keys are indexed.
- Range queries are not supported.
- Retrieving multiple keys in a single query isn’t supported (must use map/reduce).
- Never locks, even for writes. Efficiently uses multiple cores.
- Distributed and fault tolerant.
- Records can be traversed through links.
- Runs on any platform Erlang runs on.
- REST client interface.
Pincaster
{ "layer 1": { "key 1": { { "property 1": "value", "property 2": "value", "property 3": "value", ... } }, "key 2": { { "property 1": "value", "property 4": "value", "geographic location": "latitude, longitude" } }, ... }, ... }
- Keys and geographic locations are indexed.
- Keys are lexically ordered and prefix matching queries are supported.
- Efficient storage of a lot of small entries.
- Support for expiration times on stored keys.
- High throughput with a single host.
- Writes can block reads, but efficiently uses multiple cores.
- Written in portable C, works on most operating systems.
- No distribution nor fault tolerance.
- REST client interface.
Redis
{ database number: { "key 1": "value", "key 2": [ "value", "value", "value" ], "key 3": [ { "value": "value", "score": score }, { "value": "value", "score": score }, ... ], "key 4": { "property 1": "value", "property 2": "value", "property 3": "value", ... }, ... } }
- A database can mix different types of keys.
- Keys and scores are indexed.
- Efficient storage of a lot of small entries.
- High throughput with a single host.
- Support for expiration times on stored keys.
- Collections can be lists or hold distinct elements (sets).
- Rich set of atomic operations.
- No distribution (but through some client libraries) nor fault tolerance.
- Only takes advantage of a single CPU core (but you can run multiple instances on the same host)
- Written in portable C. Runs on most operating systems.
- Uses its own protocol.
Elliptics Network
{ "key 1": "value", "key 2": "value", ... }
- No range queries.
- Keys are indexed.
- Efficient storage of a lot of small entries.
- KumoFS supports the CAS operation.
- High throughput with a single host.
- Uses Tokyo Cabinet for on-disk storage. Writes can block reads.
- Distributed and fault tolerant.
- Written in portable C and C++, works on most operating systems.
- Memcache (KumoFS, Flare) or REST (Elliptics Network) client interface.