XML, XML… where for art thou?

You got this killer app (killer because it’s hype or killer because it’s killing you to get it right, whatever the reason) you’re writing and it’s going to do some XML parsing. You’re new to the XML APIs and terminology. Maybe you’ve heard some of the acronyms… SAX, DOM, DTD, etc., you’ve been around the block a couple times but it’s not something you do regularly. You’re confident enough to be dangerous and dangerous enough to be confident. The perfect combination that describe many Java developers which sends them scrambling frantically through Javadocs and forum posts to find out why they’re getting an EmptyStackException or an error stating an entity cannot be resolved. The only thing missing is a connection and credentials for commiting to a production server and you’re all set to deploy that killer bug… ahem killer app! Today’s topic is about an often overlooked subtle feature in Java’s XML APIs that can ruin a marriage, cause a city wide black-out, pardon a convicted felon, and destroy hundreds of sea manatees sliding them from the endangered to the extinction list.

It begins with the above scenario. Bob Beerclaw, Java developer, needs his software to talk to a data crunching engine developed by Merigo And Son’s company. The data cruncher is poorly documented and only communicates in a complicated XML dialect. So Bob must send and receive XML over the wire incorporating a myriad of Java/XML entanglement that can only be validated by setup and launching Bob’s SAP web front end and pointing and clicking around to manipulate various components in different Windows programs so that every piece can see and talk to every other piece. Sound familiar? (If it doesn’t then you’re lying to yourself. Ok, maybe it’s not an SAP front end but an instance of WebSphere or Orion Server and maybe you do run on Linux but you still point and click around to launch and configure different pieces connected to an Oracle backend and set the right configs so that CLOBs and BLOBs in the database are stored correctly. Maybe it’s not you tied up in all of this but I’m sure you know someone who is… the gal across from you?) Our fictitious scenario leads our hero Bob to grab the first thing that snaps into place within his cumbersome world. He grabs onto the parse(InputStream) method in the SAX API and holds on for dear life because that’s the only way McWhirter and Son’s data cruncher can give him the XML. Bob churns out code quickly because he’s a pro and he’s been here before. Just give him some Javadocs and he can find his was to Miami. He lands a solution that lets him dig into McDowel and Assoc’s XML feed and pull out the handful of relevant data and feels happy. So now our hero needs to grab data off of his file system but because he’s clever and encapsulates everywhere there is only one entrty point into his XML-blackbox that he built to handle the InputStream coming from McGonnegal And Son’s data ripper. Put your bookmark here because we’re coming back to this situation.

So now Bob completes the other end of the solution with time to spare. He peruses the Java5 javadocs some more to figure out what else a savvy guy can do with XML but is interrupted by his superiors with another completely irrelevant project. Over time Bob’s solution grows as does any app written or production. Eventually Bob’s file system XML grows complicated and needs to be broken up. Bob has been reading up on DTD syntax and decides to use entity references to componentize the XML. Furthermore he’s toying with a different set of XML parse APIs and decides to use them to test-parse his newly fragmented design. All is well until he plugs the xml snippets into production. Entities are not resolving. What went wrong?

A number of evils are at play. For starters, Java’sXML APIs are not the most user friendly from the onset. (I mean they’re nice and all but they require you to thik differently than what your used to. The traditional serial line-number driven approach developers typically ride to an answer will steer you wrong in XML land. You also have to be extra anal about little details.) The bookmark we left earlier was at a crucial decision point. Had Bob taken a different approach to feeding the parser, like using the parse(File) method, then things would work differently. You see he chose the parse(InputStream) method because he was originally dealing with streams and since he’s not about copy/paste he wrote his objects in such a way that they could be reused applying the same parse(InputStream) method to the file system. He just passed a FileInputStream into his component and then had time to go out for beer. That left Java’s DocumentBuilder with no clue as to where on the file system the XML originated from so when he later decided to use entity references for snippets of extracted XML and did so with relative paths the parser could not relate. Incidentally there is an overloaded method of parse(InputStream, String) that would give the parser a hint about how to resolve external entity references but you wouldn’t know that from the onset because as I said earlier it’s not the most friendly API. You only learn these things after killing three months and several million hair folicles on a head-scratching escapade chasing your tail wondering why a relative path which obviously exists on your hard-drive cannot be found by a dumb parser because it’s just dumb and Bill Gates would’ve done a better job at writing Java because who makes this stuff so hard anyway?

It’s mysteries like this and the EmptyStackException that you see from time to time in XSL unit testing that separate the pros from the wannabes. It’s too bad too because XML is one of those things that works really well when used respectfully but definitely takes extra effort to master. I’m Cliff, and you’re wasting time when you should be logging your time from last week… thanx for reading up to this point! Holla back y’all…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s