Here’s a big picture of what data structures you can get of some NoSQL data stores, in a JSON representation.
Before picking a data store, try to model your data to these constraints and see what the best fit is.
SQL
{
"table 1": {
"key 1": {
"property 1": "string",
"property 2": "numerical value"
},
"key 2": {
"property 1": "string",
"property 2": "numerical value"
}, ...
},
"table 2": {
"key 3": {
"property 3": "date"
},
"key 4": {
"property 3": "date"
}, ...
}, ...
}
- The set of properties is fixed for every record of a table.
- Any key and/or property can be indexed.
Cassandra
Columns:
{
"column family 1": {
"key 1": {
"property 1": "value",
"property 2": "value"
},
"key 2": {
"property 1": "value",
"property 4": "value",
"property 5": "value"
}
}, ...
}
and supercolumns:
{
"column family 2": {
"super key 1": {
"key 1": {
"property 1": "value",
"property 2": "value"
},
"key 2": {
"property 1": "value",
"property 4": "value",
"property 5": "value"
}, ...
}, ...
"super key 2": {
"key 1": {
"property 4": "value",
"property 5": "value"
},
"key 2": {
"property 1": "value",
"property 6": "value",
"property 7": "value"
}, ...
}, ...
}, ...
}
- Columns families are constrained by a schema. You need to predefine their names, their types (simple columns or supercolumns), and how the keys and properties will be sorted. Changing the schema requires a server restart in < 0.7 versions.
- A database (“keyspace”) can mix simple columns and supercolumns.
- Range queries are supported.
- Support for expiration times in >= 0.7
- Distributed, fault-tolerant.
- Never locks, even for writes. Efficiently uses multiple cores.
- Runs on a lot of operating systems (wherever a JVM exists).
- No REST client interface. Uses Thrift serialization.
- Can run Hadoop map/reduce jobs.
MongoDB
{
"namespace 1": any json object,
"namespace 2": any json object,
...
}
To elaborate, here’s an example:
{
"namespace 1": [
{
"_id": "key 1",
"property 1": "value",
"property 2": {
"property 3": "value",
"property 4": [ "value", "value", "value" ]
}, ...
},
{
"_id": "key 2",
"property5": {
"property3": "value",
"property7": { "question": "6x9", "answer": 42, "list": [ 3, 5 ] }
}, ...
}, ...
]
}
- No schema, the model is totally flexible. Any property can be added to any object any time.
- Any property can be indexed, at any depth (for example, “property3” could be indexed).
- A wide range of operations can be performed.
- Collections can be capped to a fixed number of bytes or elements.
- Will be distributed and fault-tolerant through a proxy, but this one is still in early stage.
- Write operations lock the whole database. Efficiently uses multiple cores.
- Written in C++ but currently runs on a limited number of operating systems.
- Uses its own protocol, but basic REST proxies do exist (in Python and NodeJS).
Riak
{
"bucket 1": {
"key 1": document + content-type,
"key 2": document + content-type,
"link to another object 1": URI of other bucket/key,
"link to another object 2": URI of other bucket/key,
},
"bucket 2": {
"key 3": document + content-type,
"key 4": document + content-type,
"key 5": document + content-type
...
}, ...
}
- Keys are indexed.
- Range queries are not supported.
- Retrieving multiple keys in a single query isn’t supported (must use map/reduce).
- Never locks, even for writes. Efficiently uses multiple cores.
- Distributed and fault tolerant.
- Records can be traversed through links.
- Runs on any platform Erlang runs on.
- REST client interface.
Pincaster
{
"layer 1": {
"key 1": {
{
"property 1": "value",
"property 2": "value",
"property 3": "value", ...
}
},
"key 2": {
{
"property 1": "value",
"property 4": "value",
"geographic location": "latitude, longitude"
}
}, ...
}, ...
}
- Keys and geographic locations are indexed.
- Keys are lexically ordered and prefix matching queries are supported.
- Efficient storage of a lot of small entries.
- Support for expiration times on stored keys.
- High throughput with a single host.
- Writes can block reads, but efficiently uses multiple cores.
- Written in portable C, works on most operating systems.
- No distribution nor fault tolerance.
- REST client interface.
Redis
{
database number: {
"key 1": "value",
"key 2": [ "value", "value", "value" ],
"key 3": [
{ "value": "value", "score": score },
{ "value": "value", "score": score },
...
],
"key 4": {
"property 1": "value",
"property 2": "value",
"property 3": "value", ...
}, ...
}
}
- A database can mix different types of keys.
- Keys and scores are indexed.
- Efficient storage of a lot of small entries.
- High throughput with a single host.
- Support for expiration times on stored keys.
- Collections can be lists or hold distinct elements (sets).
- Rich set of atomic operations.
- No distribution (but through some client libraries) nor fault tolerance.
- Only takes advantage of a single CPU core (but you can run multiple instances on the same host)
- Written in portable C. Runs on most operating systems.
- Uses its own protocol.
Elliptics Network
{
"key 1": "value",
"key 2": "value",
...
}
- No range queries.
- Keys are indexed.
- Efficient storage of a lot of small entries.
- KumoFS supports the CAS operation.
- High throughput with a single host.
- Uses Tokyo Cabinet for on-disk storage. Writes can block reads.
- Distributed and fault tolerant.
- Written in portable C and C++, works on most operating systems.
- Memcache (KumoFS, Flare) or REST (Elliptics Network) client interface.