Frank DENIS random thoughts.

Choosing a NoSQL data store according to your data set.

Here's a big picture of what data structures you can get of some NoSQL data stores, in a JSON representation.

Before picking a data store, try to model your data to these constraints and see what the best fit is.

SQL

{
    "table 1": {
        "key 1": {
            "property 1": "string",
            "property 2": "numerical value"
        },
        "key 2": {
            "property 1": "string",
            "property 2": "numerical value"
        }, ...
    },
    "table 2": {
        "key 3": {
            "property 3": "date"
        },
        "key 4": {
            "property 3": "date"
        }, ...
    }, ...
}
  • The set of properties is fixed for every record of a table.
  • Any key and/or property can be indexed.

Cassandra

Columns:

{
    "column family 1": {
        "key 1": {
            "property 1": "value",
            "property 2": "value"
        },
        "key 2": {
            "property 1": "value",
            "property 4": "value",
            "property 5": "value"
        }
    }, ...
}

and supercolumns:

{
    "column family 2": {
        "super key 1": {
            "key 1": {
                "property 1": "value",
                "property 2": "value"               
            },
            "key 2": {
                "property 1": "value",
                "property 4": "value",
                "property 5": "value"           
            }, ...
        }, ...
        "super key 2": {
            "key 1": {
                "property 4": "value",
                "property 5": "value"               
            },
            "key 2": {
                "property 1": "value",
                "property 6": "value",
                "property 7": "value"           
            }, ...
        }, ...
    }, ...
}
  • Columns families are constrained by a schema. You need to predefine their names, their types (simple columns or supercolumns), and how the keys and properties will be sorted. Changing the schema requires a server restart in < 0.7 versions.
  • A database ("keyspace") can mix simple columns and supercolumns.
  • Range queries are supported.
  • Support for expiration times in >= 0.7
  • Distributed, fault-tolerant.
  • Never locks, even for writes. Efficiently uses multiple cores.
  • Runs on a lot of operating systems (wherever a JVM exists).
  • No REST client interface. Uses Thrift serialization.
  • Can run Hadoop map/reduce jobs.

MongoDB

{
    "namespace 1": any json object,
    "namespace 2": any json object,
    ...
}

To elaborate, here's an example:

{
    "namespace 1": [
        {
            "_id": "key 1",
            "property 1": "value",
            "property 2": {
                "property 3": "value",
                "property 4": [ "value", "value", "value" ]
            }, ...
        },
        {
            "_id": "key 2",
            "property5": {
                "property3": "value",
                "property7": { "question": "6x9", "answer": 42, "list": [ 3, 5 ] }
            }, ...      
        }, ...
    ]
}
  • No schema, the model is totally flexible. Any property can be added to any object any time.
  • Any property can be indexed, at any depth (for example, "property3" could be indexed).
  • A wide range of operations can be performed.
  • Collections can be capped to a fixed number of bytes or elements.
  • Will be distributed and fault-tolerant through a proxy, but this one is still in early stage.
  • Write operations lock the whole database. Efficiently uses multiple cores.
  • Written in C++ but currently runs on a limited number of operating systems.
  • Uses its own protocol, but basic REST proxies do exist (in Python and NodeJS).

Riak

{
    "bucket 1": {
        "key 1": document + content-type,
        "key 2": document + content-type,
        "link to another object 1": URI of other bucket/key,
        "link to another object 2": URI of other bucket/key,        
    },
    "bucket 2": {
        "key 3": document + content-type,
        "key 4": document + content-type,
        "key 5": document + content-type
        ...
    }, ...
}
  • Keys are indexed.
  • Range queries are not supported.
  • Retrieving multiple keys in a single query isn't supported (must use map/reduce).
  • Never locks, even for writes. Efficiently uses multiple cores.
  • Distributed and fault tolerant.
  • Records can be traversed through links.
  • Runs on any platform Erlang runs on.
  • REST client interface.

Pincaster

{
    "layer 1": {
        "key 1": {
            {
                "property 1": "value",
                "property 2": "value",
                "property 3": "value", ...
            }
        },
        "key 2": {
            {
                "property 1": "value",
                "property 4": "value",
                "geographic location": "latitude, longitude"
            }
        }, ...
    }, ...
}
  • Keys and geographic locations are indexed.
  • Keys are lexically ordered and prefix matching queries are supported.
  • Efficient storage of a lot of small entries.
  • Support for expiration times on stored keys.
  • High throughput with a single host.
  • Writes can block reads, but efficiently uses multiple cores.
  • Written in portable C, works on most operating systems.
  • No distribution nor fault tolerance.
  • REST client interface.

Redis

{
    database number: {
        "key 1": "value",
        "key 2": [ "value", "value", "value" ],
        "key 3": [
            { "value": "value", "score": score },
            { "value": "value", "score": score },
            ...         
        ],
        "key 4": {
            "property 1": "value",
            "property 2": "value",
            "property 3": "value", ...
        }, ...
    }
}
  • A database can mix different types of keys.
  • Keys and scores are indexed.
  • Efficient storage of a lot of small entries.
  • High throughput with a single host.
  • Support for expiration times on stored keys.
  • Collections can be lists or hold distinct elements (sets).
  • Rich set of atomic operations.
  • No distribution (but through some client libraries) nor fault tolerance.
  • Only takes advantage of a single CPU core (but you can run multiple instances on the same host)
  • Written in portable C. Runs on most operating systems.
  • Uses its own protocol.

Elliptics Network

KumoFS

Flare

{
    "key 1": "value",
    "key 2": "value",
    ...
}
  • No range queries.
  • Keys are indexed.
  • Efficient storage of a lot of small entries.
  • KumoFS supports the CAS operation.
  • High throughput with a single host.
  • Uses Tokyo Cabinet for on-disk storage. Writes can block reads.
  • Distributed and fault tolerant.
  • Written in portable C and C++, works on most operating systems.
  • Memcache (KumoFS, Flare) or REST (Elliptics Network) client interface.