Frank DENIS random thoughts.

Excel as an exchange format is a nightmare for developpers

In my daily job, everytime someone wants to send a list of things to a developper, he always send an Excel file.

A list of names? An Excel file. A list of questions and answers? An Excel file. A list of translations for gettext? An Excel file. Mathematical expressions? An Excel file, embedding Word objects with its mathematical extension.

Of course, users are efficient with tools they work all the time with. And with Excel, they can easily create a clean layout for their data. When Excel is used in order to create documents to be read by other people, the tool is wonderful. To be fair I love Excel, this is the Microsoft thing I love, the thing that made me install Windows 3.11 a while ago. It’s not flawless, but it works pretty well, it’s easy to use, it’s powerful and anyone can create standard, but clean and nice-looking document with it.

But Excel is not a tool designed to format data for developpers. Developpers need normalized and easy to parse data, not spreadsheet documents.

Sometimes, exporting the XLS file to CSV is enough. It’s enough for very simple lists, like email lists, though you still have to check for typos.

Sometimes it’s way more work that people could imagine.

A while back, I had to make a simple online quiz system. A WYSIWYG web-based administration area was made. But most people preferred to use Excel so that they could cut/paste parts of Word, Visio, Powerpoint and Excel documents. Everyone used a different layout. When I received the first Excel documents, I asked people to at least use a simple layout: first column with the question, next columns with the answers, last column with pictures, etc. Some did their best to follow that convention, but there was still no consistency. For instance a lot of people filled end of lines with spaces in order to open a new line instead of just pressing “Return”. Diagrams were a mix of background pictures with text over them (still using spaces to get the characters at the right positions, reading the document without the right fonts gave unreadable results). There were typos and inconsistencies everywhere. People who worked on those documents did their best, and if the printed documents were perfectly useable. However, importing that into a MySQL database was a nightmare. Exporting these documents to CSV would have been pointless because of the pictures. It’s why I tried to export them as MS-HTML documents. Parsing these documents was very difficult, especially since every user used a different layout for his document. The MS-HTML documents themselves were totally bogus, the text that was over background pictures was misplaced and the space-based paragraphs didn’t work in a web browser. I gave up. There was no way to write a script that would reliably understand these documents. I printed the Excel documents, and I manually typed everything from scratch in the web-based administration area. It took me 3 weeks, including nights and week-ends. And people didn’t understand why it took so long. They thought that it was because the server was running Linux, and that shitty operating system was unable to read Excel files. A few months later, the project leader asked me for an export of everything from that quiz system, as Excel files. That was easy. But a few weeks later, he sent me back the Excel files “with some minor corrections and some new questions. could you please merge them ASAP?”. Ouch! An Excel sheet is designed to be human-readable, it’s not a raw database dump. I had to write a “diff”-like application to compare Excel sheets, in order to find similar rows and changes. Then the database (lots of tables with foreign keys, nothing to do with the single-table layout of an Excel sheet) had to be updated, although there was no useable identifier in the sheets. Yet another nightmare. This is probably the most complex application I have ever written. Just because the changes have been made through Excel sheets instead of using the web interface.

Excel is designed to create printable documents. It might be not obvious for non-developpers, but please, please, please understand that Excel is not a tool to edit application data. In databases, data is not stored as a 1x1 table with columns containing Word-like text. Computers don’t “read” documents, computers don’t understand a layout designed to be printed.

For simple lists, developpers like text. Raw text, made with text editors like Wordpad, Nodepad++, BBEdit or Context. Instead of rows, just use lines. It might be a bit less handy than Excel, but it might save a lot of time to developpers who have to use the data.

A few days ago, I was asked to merge translations of country names in a web form. An english-speaking user would see “Germany”, while a french-speaking user would see “Allemagne” in the list. Easy, very easy, especially since every country-name was already handled by a gettext interface. The relevant entries in the .po files just had to be translated. A colleague made a web-based interface in order to edit the .po files, although it that interface is not very handy for mass translations like a country list.

It’s why I was sent a .rar file with Excel sheets. In one column, there was a cut/paste of the original list as seen in internet explorer (probably, since there were HTML attributes and unrelated entries), in another column there were translations, and sometimes a comment in another column. Something very simple to parse. But it actually took me hours to do it.

Why?

Out of 4 Excel files, 3 were saved as Office 2007 XML documents.

Gnumeric was unable to open those files. Google Spreadsheets was unable to open those files. Openoffice (even version 2.1) was unable to open those files. Back home, I tried with Excel on my Mac. Yes, I bought the Microsoft Office suite for OSX. It was able to open the files. Wow. But the CSV exports were… very odd. End of lines were single \r, not \n. A small Perl script changed that. But there was still something wrong with the charset. Non-ASCII characters weren’t properly rendered. It was not Latin-15. It was not UTF-8, iconv refused it. The file command wasn’t able to discover what it was. Finally, I used Excel to save these documents as Excel-97 files, then I downloaded and installed OpenOffice, the Excel-97 files were loaded by OpenOffice and they could be exported as CSV-files that could be easily parsed and exported as .po files by yet another small Perl script. A big waste of time just because the original files were Excel documents. The content of these documents was text. No text formatting, no meaningful cells decorations, nothing but standard text, that was even written in a way that could be easily parsed. But it took hours just because it hasn’t been saved as text.

So please, if you send data to developpers, don’t use Excel. Either send raw text, or use the web-based interface (or if there’s none, ask them to write one, with a framework like Ruby On Rails it’s damn fast to do, and probably faster than it would be to postprocess Excel files). Thank you.