Frank DENIS random thoughts.

Choosing a NoSQL data store according to your data set.

Here’s a big picture of what data structures you can get of some NoSQL data stores, in a JSON representation.

Before picking a data store, try to model your data to these constraints and see what the best fit is.

SQL

{
	"table 1": {
		"key 1": {
			"property 1": "string",
			"property 2": "numerical value"
		},
		"key 2": {
			"property 1": "string",
			"property 2": "numerical value"
		}, ...
	},
	"table 2": {
		"key 3": {
			"property 3": "date"
		},
		"key 4": {
			"property 3": "date"
		}, ...
	}, ...
}
  • The set of properties is fixed for every record of a table.
  • Any key and/or property can be indexed.

Cassandra

Columns:

{
	"column family 1": {
		"key 1": {
			"property 1": "value",
			"property 2": "value"
		},
		"key 2": {
			"property 1": "value",
			"property 4": "value",
			"property 5": "value"
		}
	}, ...
}

and supercolumns:

{
	"column family 2": {
		"super key 1": {
			"key 1": {
				"property 1": "value",
				"property 2": "value"				
			},
			"key 2": {
				"property 1": "value",
				"property 4": "value",
				"property 5": "value"			
			}, ...
		}, ...
		"super key 2": {
			"key 1": {
				"property 4": "value",
				"property 5": "value"				
			},
			"key 2": {
				"property 1": "value",
				"property 6": "value",
				"property 7": "value"			
			}, ...
		}, ...
	}, ...
}
  • Columns families are constrained by a schema. You need to predefine their names, their types (simple columns or supercolumns), and how the keys and properties will be sorted. Changing the schema requires a server restart in < 0.7 versions.
  • A database (“keyspace”) can mix simple columns and supercolumns.
  • Range queries are supported.
  • Support for expiration times in >= 0.7
  • Distributed, fault-tolerant.
  • Never locks, even for writes. Efficiently uses multiple cores.
  • Runs on a lot of operating systems (wherever a JVM exists).
  • No REST client interface. Uses Thrift serialization.
  • Can run Hadoop map/reduce jobs.

MongoDB

{
	"namespace 1": any json object,
	"namespace 2": any json object,
	...
}

To elaborate, here’s an example:

{
	"namespace 1": [
		{
			"_id": "key 1",
			"property 1": "value",
			"property 2": {
				"property 3": "value",
				"property 4": [ "value", "value", "value" ]
			}, ...
		},
		{
			"_id": "key 2",
			"property5": {
				"property3": "value",
				"property7": { "question": "6x9", "answer": 42, "list": [ 3, 5 ] }
			}, ...		
		}, ...
	]
}
  • No schema, the model is totally flexible. Any property can be added to any object any time.
  • Any property can be indexed, at any depth (for example, “property3” could be indexed).
  • A wide range of operations can be performed.
  • Collections can be capped to a fixed number of bytes or elements.
  • Will be distributed and fault-tolerant through a proxy, but this one is still in early stage.
  • Write operations lock the whole database. Efficiently uses multiple cores.
  • Written in C++ but currently runs on a limited number of operating systems.
  • Uses its own protocol, but basic REST proxies do exist (in Python and NodeJS).

Riak

{
	"bucket 1": {
		"key 1": document + content-type,
		"key 2": document + content-type,
		"link to another object 1": URI of other bucket/key,
		"link to another object 2": URI of other bucket/key,		
	},
	"bucket 2": {
		"key 3": document + content-type,
		"key 4": document + content-type,
		"key 5": document + content-type
		...
	}, ...
}
  • Keys are indexed.
  • Range queries are not supported.
  • Retrieving multiple keys in a single query isn’t supported (must use map/reduce).
  • Never locks, even for writes. Efficiently uses multiple cores.
  • Distributed and fault tolerant.
  • Records can be traversed through links.
  • Runs on any platform Erlang runs on.
  • REST client interface.

Pincaster

{
	"layer 1": {
		"key 1": {
			{
				"property 1": "value",
				"property 2": "value",
				"property 3": "value", ...
			}
		},
		"key 2": {
			{
				"property 1": "value",
				"property 4": "value",
				"geographic location": "latitude, longitude"
			}
		}, ...
	}, ...
}
  • Keys and geographic locations are indexed.
  • Keys are lexically ordered and prefix matching queries are supported.
  • Efficient storage of a lot of small entries.
  • Support for expiration times on stored keys.
  • High throughput with a single host.
  • Writes can block reads, but efficiently uses multiple cores.
  • Written in portable C, works on most operating systems.
  • No distribution nor fault tolerance.
  • REST client interface.

Redis

{
	database number: {
		"key 1": "value",
		"key 2": [ "value", "value", "value" ],
		"key 3": [
			{ "value": "value", "score": score },
			{ "value": "value", "score": score },
			...			
		],
		"key 4": {
			"property 1": "value",
			"property 2": "value",
			"property 3": "value", ...
		}, ...
	}
}
  • A database can mix different types of keys.
  • Keys and scores are indexed.
  • Efficient storage of a lot of small entries.
  • High throughput with a single host.
  • Support for expiration times on stored keys.
  • Collections can be lists or hold distinct elements (sets).
  • Rich set of atomic operations.
  • No distribution (but through some client libraries) nor fault tolerance.
  • Only takes advantage of a single CPU core (but you can run multiple instances on the same host)
  • Written in portable C. Runs on most operating systems.
  • Uses its own protocol.

Elliptics Network

KumoFS ——– Flare ——-

{
	"key 1": "value",
	"key 2": "value",
	...
}
  • No range queries.
  • Keys are indexed.
  • Efficient storage of a lot of small entries.
  • KumoFS supports the CAS operation.
  • High throughput with a single host.
  • Uses Tokyo Cabinet for on-disk storage. Writes can block reads.
  • Distributed and fault tolerant.
  • Written in portable C and C++, works on most operating systems.
  • Memcache (KumoFS, Flare) or REST (Elliptics Network) client interface.