<< December 14, 2009 | Home | December 16, 2009 >>

Being too platform-independent

Yes, the title says platform-independent. Let me explain: A couple of weeks ago I fixed a bug that was surprisingly difficult to track down, because it was in a part of the code that I considered to be completely finished and safe and tucked away for the past 5 years.

The buggy code was in a part of the Rubble parser that was responsible for detecting line endings. First a little background: Rubble syntax is stream-oriented (aka "free format") in contrast to line-oriented syntaxes like Machine-Code Assembler, FORTRAN, BASIC, and Python. Except for one single exception: Rubble has something called line end comments, which means that anything between an unquoted "#" character and the next line ending is syntactically equivalent to a whitespace. This is a proven convenient feature for source code comments, and such line end comments have a long tradition in otherwise stream-oriented languages, from LISP to Prolog to C++ to Perl, Java, and C#.

My earliest Rubble parsers were just proof-of-concept code that assumed a Unix environment, so when a "#" character was encountered, everything up to the next newline character (Java "\n") was skipped, since this is the end-of-line convention in Unix. Actually the character introducing a line end comment was "%" instead of "#" in the early days of Rubble, for attempted compatibility with Edinburgh Prolog syntax, but that is unimportant now.

What happened next was that I decided sometime in 2004 to make this particular part of the code platform-independent. So instead of looking for the next "\n" character, the code looked for the next occurrence of the string returned by System.getProperty("line.separator"). This was the proper way to be platform-independent about line endings, or at least that's what I thought at the time.

Now fast-forward to November 2009. A customer installed Rubble on a server running Windows. You might ask: why run Windows on a server, or on any machine at all? Ever? Well, since they are a customer I try to avoid posing too many questions like that. They have their reasons. Anyway, the failure mode was that some Rubble code components never executed at all, they just failed silently. And mysteriously, at the same time there were several Rubble demos that worked perfectly.

If you have read this far you have probably figured it out by now. Yes, the Rubble code that failed had a line end comment at the very beginning of the file, and that file used the Unix end-of-line convention. On the Windows server the "line.separator" property is "\r\n" instead of just "\n", so no line ending was found and the entire file was treated as a comment.

So using the platform-independent way of looking for the "line.separator" property is obviously the wrong thing to do in a server environment, where data comes from multiple sources that don't necessarily comply with the server's local platform conventions.

The solution, it turns out, has been present in Java SE for several years now: just use the traditional ^ and $ patterns in regular expressions. According to the docs for java.util.regex.Pattern, the following strings are recognized as line terminators:
  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029').
This handles the entire spectrum of Unicode line terminators except for '\u000C' (aka ASCII form feed). I don't know why Java doesn't see form feed as a line terminator, but I am willing to compromise here and stick with the Java pattern because it's clearly good enough for all practical purposes.

Things could have taken an entirely different turn back in 2004 if I had been brave enough to use regular expressions in the parser. Then everything would have kept working magically, without intervention from me. But in those early days I was still using Java 1.3 which didn't have a regexp library. At one point I even had a Rubble version that ran on J2ME MIDP 1.0 which didn't have List and Map abstractions, or even floating-point math operations. So "getting it right from the start" was theoretically possible in this case, but only in retrospect.