Frank DENIS random thoughts.

Using Pincaster for autocompletion

Pincaster comes in handy for geo data or as a persistent memcache that speaks HTTP. But it also has other uses.

An overlooked feature is that keys are always ordered, hence prefix-matching is fast. This makes it a perfect candidate for autocompletion.

The Skyrock Spot iPhone app supports phonetical instant search on account and gang names. These names are weird beasts, usually made of a mix of decorative characters, redundant characters and digits/isolated characters that require to be pronounced as-is.

Names are normalized with Francotone. This library is dedicated to normalizing text-messaging-like French.

Then, names are just stored in Pincaster as keys composed of the normalized (phonetic) name, followed by a separator and by the original name.

For example, user "Franck142" would be stored as a key whose name is "frankunkatrde|Franck142".

In addition to hashes, Pincaster supports a "void" type. A key mapped to a void type doesn't store any data besides the key itself and requires very little memory. It might be a good pick if all you need is autocompletion. But adding a couple of key/value pairs is also totally worth it if you need to go beyond autocompletion and display basic search results.

Minus PHP's terrible signal/noise ratio, indexing a new user is as simple as:

static function update_user($user_id, $user_name) {
    assert(!empty($user_id));
    assert(!empty($user_name));
    $ft = self::_name_to_ft($user_name);
    Connections::do_with_db_instance('pincaster_masters_search',
      NULL, function($db) use ($user_name, $user_id, $ft) {
        assert(strchr($ft, '|') === FALSE);
        $layer = $db->layer('search_users');
        $layer->set($ft . '|' . mb_strtolower($user_name),
                    array('user_id' => $user_id));
    });
}

_name_to_ft() just uses Francotone in order to retrieve a phonetical translation of the user name.

Gangs are indexed in a similar way, but include more key/value pairs with scores, ranks, members and basically everything we need to display as a search result.

Finding candidates for a query only requires two prefix searches: one with the normalized form as a prefix (frankunkatrde| + wildcard), and another one with the query as a prefix (frankunkatrde + wildcard). This covers any user name that would sound like "frankunkatrde" and any user name that begins with something that would sound like "frankunkatrde".

Here's the current code, shared by anything indexed this way:

static protected function _search($query, $limit, $connection_name, $layer_name) {
    $query = mb_strtolower($query);
    $query = str_replace('|', ' ', $query);
    $query = trim(str_replace('*', ' ', $query));
    if (empty($query)) {
        return array();
    }
    $res0 = array();
    $res1 = array();
    $ft = self::_name_to_ft($query);
    Connections::do_with_db_instance($connection_name, NULL,
      function($db) use ($query, $ft, $limit, $layer_name,
                         &$res0, &$res1) {
        assert(strchr($ft, '|') === FALSE);
        $layer = $db->layer($layer_name);
        $prefix0 = $ft . '|' . $query . '*';
        $res0 = $layer->searchPrefix($prefix0, $limit, TRUE, TRUE, FALSE);
        $prefix1 = $ft . '*';
        $res1 = $layer->searchPrefix($prefix1, $limit, TRUE, TRUE, FALSE);
    });
    if (empty($res0->matches)) {
        $fres0 = array();
    } else {
        $fres0 = $res0->matches;
    }
    if (empty($res1->matches)) {
        $fres1 = array();
    } else {
        $fres1 = $res1->matches;
    }
    return array_merge($fres0, $fres1);
}

User names are then sorted by their Levenshtein distance.

This ain't no rocket science. Neither does it get any close to a real search engine like Groonga or Lucene. And any data store with ordered indexes could have been used. But for autocompletion or for displaying a basic list of matches, it's simple, blazing fast and it has low memory requirements. No need to build a trie, prefix-matching does the trick. Best of all, since Pincaster speaks HTTP/JSON, thus queries can be made straight from Javascript with no in-between applicative layer, and results can easily get cached by reverse-proxies and web browsers.