Wednesday, July 25, 2012

HTML Agility Pack

Problems with loading ad parsing HTML file as XML document

Recently I had a situation where I had to parse entire live HTML pages. I implemented it by loading the HTML file as a XMLDocument and parsed it without any problem. It worked perfectly for months until one day when I received an error "XML parsing error at line 46 position 260". I tried analyzing my piece of code but couldn't find anything weird or meaningless. After digging around and hitting the bush around the problem for a while, I figured that the HTML page had an "&" and so XMLDocument was not able to load the document as "&" has an whole different meaning in the XML world. Unfortunately, no XML library gets this right. I tried using various options like XMLResolver, XMLReaderSettings, ValidationOptions, etc to solve the problem but couldn't come up with what I wanted and what would have made me happy.


Problem: The XML DTDs are particularly serious, because DTDs usually include general entity declarations (especially for things like &) which the XML file will rely on. So, if a parser chooses to neglect loading the DTD, and XML makes use of general entity reference, the parser will fail doing the job what it was intended to. The only solution is to create a transparent caching entity resolver, which would put the downloaded files into an archive in the library search path, so that the archive would be dynamically created and automatically bundled with any software distributed. I guess, even in the JAVA world, there isn't an impressive EntityResolver which could do the job (the java gurus shouldn't get pissed off!)

So, I decided to load the HTML just as HTML and parse it. Google introduced me to what is called as HTML Agility pack. This library provides all the necessary methods that a XML namespace has in .Net. Loading and parsing the HTML page was very similar to the XMLDocument parsing, and I altered my parser in no time.



The parsing looks very clean and does the job exactly as how I wanted it to do. 

The code snippet above explains how I loaded the HTML page and parsed it. As you could see, it is as simple as you would parse a XML document. Using the HTML Agility pack, you can also parse the document using LINQ. As of now I am not a huge enthusiast of the LINQ feature in .NET, so I stuck to the traditional parsing methodology.



No comments:

Post a Comment