Skip to content

Category Archives: Main

On-topic posts

Coding convention Geek-code

20-Aug-07

Discussing Java vs C# coding conventions, I got an idea:
Geek code for a coding conventions.
Like:
-----BEGIN GEEK CODE-CODE BLOCK-----
GP:java,c,cpp,haskell
Off:2S P:N Name:Camel Flex:2
------END GEEK CODE-CODE BLOCK------

which means:

  • GP:java,c,cpp,haskell - geek of programming languages (listed);
  • Off:2S - prefer 2-space offsets;
  • P:N - place parentheses on new line (opposed to S - same);
  • Name:Camel - prefer camelCase;
  • Flex(0,1,2) - I’m flexible on this and can easily accept other’s conventions.

This will help others to see what you prefer an not (?) to start stupid holy arguing.


Broken HTML parsing in Java

16-Aug-07

Given: HTML code, non-valid and non-well-formed.
Make it a well-formed XHTML, in Java.

We considered JTidy, but it’s source looked too messy and hacky. I looked for a parser that could correct XML - not in sense of schema, but correct any mistakes, and pass out a well-formed XML and error list.
I was able to find JTidy only; there also used to be an OpenXML that’s now merged into xalan/xerces. Though, it wasn’t clear if parser itself was merged in. After studying Xerces with the property “continue-after-fatal-error” on, my conclusion was: no, OpenXML parser wasn’t merged. Xerces restores HTML without understanding its semantics, say: it turns <br> text <br> text into <br> text <br> text </br> </br> text. A well-formed XML, but not HTML.

A good guy hinted us a solution: TagSoup.
IMHO, its sources look clean, author looks competent, API is simple, product is actively maintained (maillist is quite active, or at least was in 2005) and mentioned test set is impressive (as said, 8% of those broken test pages are still not parsed correctly, but still output is well-formed).
Command-line testing resulted in well-formed XML in every case I tried except for nasty <script> tag tricks like <script ..> document.write("</script>") </script>, but the latest (again 2005) maillist letters already address this problem and author was working on it.
Though, it misses detailed documentation (even had no javadoc comments), and as presentation document said, unlike JTidy does not convert markup to CSS. This may result in still non-conformance to XHTML DTD. Specific XHTML DTD conformance is not guaranteed also, but output is still well-formed.
In general, I would use TagSoup if I had to validate. It’s nice to have JTidy alternative.