Frank DENIS random thoughts.

A complete search engine in 200 lines

Sau Sheong Chang, a Yahoo! developper, wrote an internet search engine in 200 lines.

Of course it isn’t designed to index billions of documents, but it’s amazingly featureful.

Just like any decent search engine, starting from seeds, it crawls and tokenizes web pages, handles robots.txt, take cares of pages freshness for intelligent re-crawling, filters data types, it has natural language processing (using the Porter Stemmer algorithm as a variant to lemmatisation), 3 ranking algorithms (implemented in 14 lines, grand total), and of course the 200 lines include the user interface (query box + display of ranked results, with page titles and scores). Yes, the end result is a fully working search engine, not just pieces of sample source code.

An interesting read, and a great introduction to how search engines are working.