Frank DENIS random thoughts.

Translating HTML templates

Localisation of applications is a breeze nowadays. Windows has a nice built-in engine for this, MacOS resources are even better, and Unix has QT Linguist and the gettext() tools.

The idea is always the same: either separate strings from code, or extract the strings into a separate file so that they can be translated without tweaking the code. As a long-time english to french translator for various projects, I can tell that while the process is a bit rough, and while it sometimes requires language tricks in order to keep the initial genericity of some messages, this way of processing is good enough to get a decent translation of any application.

But things are different with web sites.

How to translate static web sites? Well, give the web site to the translator, let him edit it with Dreamweaver or anything that can edit HTML without totally messing the layout, and you’re done. Even if the HTML code is not the same as the initial one, as long as the layout is the same, it doesn’t matter that much.

How to translate dynamic web sites? This is where the real mess begins.

Dynamic web sites are based upon templates. Things with HTML tags (unfortunately… time to discover HAML, you won’t look back), odd tags (variables substitutions, or language-specific calls) and some text lost in that mess of tags.

Having dynamic text is easy to do with any language and framework. Tools are there for a long time. If you are using Rails, I highly recommend the Globalize plugin by the way.

But you have to add tags (yet some more tags to bring yet more obfuscation to the template) to mark parts that have to be translated and then, tools like gettext will extract these strings.

Marking the parts that have to be translated is adding things like _(…) or {t}…{/t} or .t everywhere.

NEVER DO THAT MANUALLY

This is a robotic task, doing it manually would be plenty stupid and a big waste of time. Define rules and write a simple script that will add these extra tags. A coworker recently tried to manually add these tags to Smarty templates. It was only 3 templates, but he probably spent a long time to do it anyway. The end result was:

  • incorrect
  • with inconsistencies
  • with parse errors : our production web site was broken (no one could create any new account nor edit existing ones) until the changes were reverted.

But what should be included in each string that will go to translators?

Only text? Well, only text is not an option, because HTML let you break sentences with markup:

<a href="...">Click here</a> to <strong>permanently delete </strong> your account.

Breaking in three parts would be necessary to get only raw texts to be translated without any markup. But having 3 parts “click here to”, “permanently delete” and “your account” is not something that can be translated. Some languages will for instance have to put the verb at the end of the sentence.

So the only way is to have the full thing (text + markup) in a single part to be translated.

Symbols and punctuation are also things that shouldn’t be forgotten for translations.

For instance, french require a space before “:”, while english require no space before “:”. French use “*” for side notes, while english use a dag (and in HTML templates, those are links).

Links can also change according to the language of the user. Different domains or extra parameters in URI can be necessary to walk through the localized versions of the same web site. You already have a way to dynamically replace data according to the language, why not just use that to also get the right links according to the language? Getting the localized links can also be handled in the template, but what you will get is redundant work and nothing but more markup, more mess in the template. In a basic HTML template, count how many text (real text) you have and how many characters you have for markup. Unless you switched to HAML, too many, way too many. No need to add yet more crap.

Emphasis like <strong> should not be left out from texts to be translated, even when the whole sentence has the emphasis. In japanese, for instance, you won’t use the same words to describe something important. Leaving the <strong> tag is important if you want a correct translation, even of a whole sentence.

So, the only way to get a correctly translated HTML template seems to mark as translation parts :

  • groups of inline content (including: raw text, <a> tags and <strong> tags)
  • attributes like ALT and TITLE (yes, texts in tags… HTML is a mess…)
  • the VALUE attribute of INPUT tag is tricky and it’s probably the only thing that should be processed manually. The VALUE can be just something displayed to the user, or something meaningful for the scripts that will process the form. But if you use a decent framework instead, it’s probably not an issue.