My XML Whitespace Ignorance


Hey you! Come here… lean a little closer. I’ll let you in on a little secret. I’ve been fighting a stoopid XML conversion bug for the past 78 days, 18 hours, 23 minutes, and 11 seconds… approximately. It was a little inconsistency between what I thought was ignorable whitespace and what my Java based XML text streamer thought was significant. The problem roots in my eagerness to ensure my code is as platform agnostic as possible and carries over into my understanding (or lack thereof) with regard to XML conversion. tripping Hi, I’m Cliff and unless you write computer programs for a living or you’ve been here before then you won’t have a clue what I’m venting about. Hold tight and don’t bounce yet brotha’, because if you ever trip over a misplaced angle bracket on your way out of the grocery store or find yourself surrounded by entity-reference-discussing geeks at your high school reunion you’ll be armed with the appropriate information to fight back.

Here’s the deal. I wrote a phat XML converter (that’s phat pronounced as fat meaning dope, hype, cool, nifty, or really interesting. Not to be confused with FAT converters prevalent on NTFS enabled operating systems.) that reads our company’s proprietary DBMS schema definition language and produces JAXP SAX events[1]. The SAX events are where all of the fun began. You see when dealing with our proprietary DBMS schema lingo white space becomes very important. So important that it rises to the front of your consciousness overshadowing other important issues such as the need to consume food as the stomach empties, the requirement to vacate the building after hours and retrieve your offspring at misc. locations, and the necessity to respond promptly to important interoffice emails. White space overrides other low level important decisions such as formatting and refactoring. “Is whitespace being catered to?”, was the only thought I had as I worked on the converter. So I stuck little calls to a utility method called handleNewLine all though my code because I wanted to give whitespace the full respect it deserves. I also referred to a public final static constant called SEP as the value to return.

That’s where it got really cool. SEP was initialized to the value returned by System.getProperty(“line.separator”) which means that when it ran on *Nix SEP would be equal to a new line character (‘\n’ to be exact) while Windows VMs would happily plug in a carraige return before the new line character (“\r\n”). Life would be good and I wouldn’t have to touch a thing because I catered to whitespace accordingly, had I not? I never got a chance to run my toy on a Windows machine to find out if any of that logic was worth the effort, that is never until recently. My coworker picked up my toy and started playing with it after I made such a big deal in the office. (I was running around for months slapping my colleagues upside their heads, telling everyone how elite I was because my toy could totally get rid of the file format that plagued us for years. I then started drawing pictures of how the world would look like after my toy had been put to use.) He runs Windows on his developer workstation because he isn’t as patient as I am with the command line and all the extra stuff you have to do to get a Linux desktop to be as functional as a Windows desktop. When he went to build my toy from sources (it’s in a M2 project) the unit tests started complaining about XML diffs. It seems there were a bunch of entity references, “ ” showing up in my converted output. How could that be? I was extra careful about whitespace. I ran multiple versions of the XML unit tests with IgnoreWhitespace set to true, then false, then XmlUnit.SOMETIMES, then XmlUnit.I_DONT_CARE_JUST_MAKE_IT_WORK. I tried to use the ignorableWhiteSpace event in my SAX generator then I tried not to use it. I tried sending the text as ASCII then as UTF-8 then EBCIDIC. Nothing worked.

After banging my head repeatedly it dawned on me. I was seeing entity references for the carraige returns but not the line-feeds. Then line feeds were inserted literally while the returns were being escaped. Then I asked myself, why am I being so anal about whitspace platform specifics in an XML document running on a JVM? I really didn’t have a decent answer other than I had been self-conditioned to worship the whitespace throughout my career. Then a vision of my mother popped in my head, “Just because your friends race their cars at 90mph. around tight curves doesn’t mean you should do the same!” All the time I had been surrounded by applications that hailed whitespace and went through extensive means to treat it like royalty. Here I sit atop a multiplatform programming language in a language and coded character set agnostic markup grammar and the only thing I can worry about is whitespace! What was I doing???!!! I then pulled out every reference to platform specific line separators and re-ran my unit tests on WinXP under VMWare. Green bars stretched across my screen like the laser beam from the Green Lantern. Life was good thereafter, or so I thought… Tune in later to find me battling with the DB date data type from Hell. Later people…

1. JAXP SAX events are method calls to a standard Java interface intended to describe an XML document.

One thought on “My XML Whitespace Ignorance

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s