Frank DENIS random thoughts.

Why are lightweight markup languages so heavy?

Forums, blogs and other community-based web sites need ways to let their users add some fun to the text. Smileys, bold, italic, paragraphs, alignments, colors, links and images are the basic stuff any user wants to have.

As the final rendering is usually HTML, using HTML in order to post new messages would obviously be the most efficient way.

But HTML is considered insecure, complicated and not error-proof.

This is why alternatives languages were invented. BBCode, Textile and Markdown are the most common ones.

So, an user enters text written with one of those alternative markup languages, and a complex parser transforms that into HTML. Since the user might edit the text later, the original (not HTML) text must be stored.

And since rendering Markdown (for instance) to HTML is a complex and CPU-intensive operation, applications usually save both the HTML version and the Markdown version. This is a big waste of database resources, but this is required in order to save a lot of CPU cycles.

In fact this is actually required because current implementations of HTML renderers for BBCode, Textile and Markdown are slow like hell.

Please give me pointers if you know of any C or C++ implementation, but all implementations I could find were in Ruby, PHP, Perl and Python. While I love these languages, I really think that rendering markup languages should be as fast as possible, ie. written in optimized C, C++ or even assembler. This is a critical part of any community web site and a main performance bottleneck.

Trying to avoid the side-effects of a slow implementation with caches and duplicated data is nothing but brain-damage, and it only shifts the performance bottlenecks to the database. And maintaining these caches and duplicated data adds another layer of complexity.

Why not just forget that and work on optimized rendering engines? No need to duplicate data, just keep one version and render the HTML (or the original) version as needed.

Here are three proposals:

  • Write a fast rendering engine. Sure, playing with strings in C is not fun, but one week of work to get a fast engine is probably worth the try, considering the money it will save by making real-time rendering a reality.
  • or: write the rendering engine in Javascript. Javascript implementations are fast enough for this, and you can always deliver an alternative version (SEO-optimized) for search engines.
  • or: combine both. Have the C engine create the basic HTML structure, and have the javascript engine handle stuff like links and embedded videos.

I’m currently working on the three ways and so far, the results are very exciting.