code name / Broken HTML parsing in Java

Given: HTML code, non-valid and non-well-formed.
Make it a well-formed XHTML, in Java.

We considered JTidy, but it’s source looked too messy and hacky. I looked for a parser that could correct XML – not in sense of schema, but correct any mistakes, and pass out a well-formed XML and error list.
I was able to find JTidy only; there also used to be an OpenXML that’s now merged into xalan/xerces. Though, it wasn’t clear if parser itself was merged in. After studying Xerces with the property “continue-after-fatal-error” on, my conclusion was: no, OpenXML parser wasn’t merged. Xerces restores HTML without understanding its semantics, say: it turns <br> text <br> text into <br> text <br> text </br> </br> text. A well-formed XML, but not HTML.

A good guy hinted us a solution: TagSoup.
IMHO, its sources look clean, author looks competent, API is simple, product is actively maintained (maillist is quite active, or at least was in 2005) and mentioned test set is impressive (as said, 8% of those broken test pages are still not parsed correctly, but still output is well-formed).
Command-line testing resulted in well-formed XML in every case I tried except for nasty <script> tag tricks like <script ..> document.write("</script>") </script>, but the latest (again 2005) maillist letters already address this problem and author was working on it.
Though, it misses detailed documentation (even had no javadoc comments), and as presentation document said, unlike JTidy does not convert markup to CSS. This may result in still non-conformance to XHTML DTD. Specific XHTML DTD conformance is not guaranteed also, but output is still well-formed.
In general, I would use TagSoup if I had to validate. It’s nice to have JTidy alternative.

code name

Broken HTML parsing in Java

Post a Comment

Victor Sergienko

Couple of words

Recent Posts

Meta