Choosing a NoSQL data store according to your data set.
Here's a big picture of what data structures you can get of some NoSQL data stores, in a JSON representation.
Before picking a data store, try to model your data to these constraints and see what the best fit is.
SQL
{
"table 1": {
"key 1": {
"property 1": "string",
"property 2": "numerical value"
},
"key 2": {
"property 1": "string",
"property 2": "numerical value"
}, ...
},
"table 2": {
"key 3": {
"property 3": "date"
},
"key 4": {
"property 3": "date"
}, ...
}, ...
}
- The set of properties is fixed for every record of a table.
- Any key and/or property can be indexed.
Cassandra
Columns:
{
"column family 1": {
"key 1": {
"property 1": "value",
"property 2": "value"
},
"key 2": {
"property 1": "value",
"property 4": "value",
"property 5": "value"
}
}, ...
}
and supercolumns:
{
"column family 2": {
"super key 1": {
"key 1": {
"property 1": "value",
"property 2": "value"
},
"key 2": {
"property 1": "value",
"property 4": "value",
"property 5": "value"
}, ...
}, ...
"super key 2": {
"key 1": {
"property 4": "value",
"property 5": "value"
},
"key 2": {
"property 1": "value",
"property 6": "value",
"property 7": "value"
}, ...
}, ...
}, ...
}
- Columns families are constrained by a schema. You need to predefine their names, their types (simple columns or supercolumns), and how the keys and properties will be sorted. Changing the schema requires a server restart in < 0.7 versions.
- A database ("keyspace") can mix simple columns and supercolumns.
- Range queries are supported.
- Support for expiration times in >= 0.7
- Distributed, fault-tolerant.
- Never locks, even for writes. Efficiently uses multiple cores.
- Runs on a lot of operating systems (wherever a JVM exists).
- No REST client interface. Uses Thrift serialization.
- Can run Hadoop map/reduce jobs.
MongoDB
{
"namespace 1": any json object,
"namespace 2": any json object,
...
}
To elaborate, here's an example:
{
"namespace 1": [
{
"_id": "key 1",
"property 1": "value",
"property 2": {
"property 3": "value",
"property 4": [ "value", "value", "value" ]
}, ...
},
{
"_id": "key 2",
"property5": {
"property3": "value",
"property7": { "question": "6x9", "answer": 42, "list": [ 3, 5 ] }
}, ...
}, ...
]
}
- No schema, the model is totally flexible. Any property can be added to any object any time.
- Any property can be indexed, at any depth (for example, "property3" could be indexed).
- A wide range of operations can be performed.
- Collections can be capped to a fixed number of bytes or elements.
- Will be distributed and fault-tolerant through a proxy, but this one is still in early stage.
- Write operations lock the whole database. Efficiently uses multiple cores.
- Written in C++ but currently runs on a limited number of operating systems.
- Uses its own protocol, but basic REST proxies do exist (in Python and NodeJS).
Riak
{
"bucket 1": {
"key 1": document + content-type,
"key 2": document + content-type,
"link to another object 1": URI of other bucket/key,
"link to another object 2": URI of other bucket/key,
},
"bucket 2": {
"key 3": document + content-type,
"key 4": document + content-type,
"key 5": document + content-type
...
}, ...
}
- Keys are indexed.
- Range queries are not supported.
- Retrieving multiple keys in a single query isn't supported (must use map/reduce).
- Never locks, even for writes. Efficiently uses multiple cores.
- Distributed and fault tolerant.
- Records can be traversed through links.
- Runs on any platform Erlang runs on.
- REST client interface.
Pincaster
{
"layer 1": {
"key 1": {
{
"property 1": "value",
"property 2": "value",
"property 3": "value", ...
}
},
"key 2": {
{
"property 1": "value",
"property 4": "value",
"geographic location": "latitude, longitude"
}
}, ...
}, ...
}
- Keys and geographic locations are indexed.
- Keys are lexically ordered and prefix matching queries are supported.
- Efficient storage of a lot of small entries.
- Support for expiration times on stored keys.
- High throughput with a single host.
- Writes can block reads, but efficiently uses multiple cores.
- Written in portable C, works on most operating systems.
- No distribution nor fault tolerance.
- REST client interface.
Redis
{
database number: {
"key 1": "value",
"key 2": [ "value", "value", "value" ],
"key 3": [
{ "value": "value", "score": score },
{ "value": "value", "score": score },
...
],
"key 4": {
"property 1": "value",
"property 2": "value",
"property 3": "value", ...
}, ...
}
}
- A database can mix different types of keys.
- Keys and scores are indexed.
- Efficient storage of a lot of small entries.
- High throughput with a single host.
- Support for expiration times on stored keys.
- Collections can be lists or hold distinct elements (sets).
- Rich set of atomic operations.
- No distribution (but through some client libraries) nor fault tolerance.
- Only takes advantage of a single CPU core (but you can run multiple instances on the same host)
- Written in portable C. Runs on most operating systems.
- Uses its own protocol.
Elliptics Network
KumoFS
Flare
{
"key 1": "value",
"key 2": "value",
...
}
- No range queries.
- Keys are indexed.
- Efficient storage of a lot of small entries.
- KumoFS supports the CAS operation.
- High throughput with a single host.
- Uses Tokyo Cabinet for on-disk storage. Writes can block reads.
- Distributed and fault tolerant.
- Written in portable C and C++, works on most operating systems.
- Memcache (KumoFS, Flare) or REST (Elliptics Network) client interface.
Comments
-
One clarification: you can run MongoDB on most operating systems, including Linux, OS X, Windows, Solaris, etc.
http://www.mongodb.org/display/DOCS/Downloads
-
dmerr: MongoDB is not an option on OpenBSD, NetBSD and DragonflyBSD.
It doesn’t work on bigendian architectures either, no matter what the operating system is.
MongoDB is currently the less portable data store of the list.
-
http://www.mongodb.org/display/DOCS/Building+for+FreeBSD
yes, you are right, there are some missing (endian in particular)
-
So if you would create a new product – like a Facebook clone – which would you end up going with, and why?
Thanks for taking the time to break these down – it’s exciting seeing the nosql movement unfolding so quickly.
Thanks again Jedi! Brandon