Hi all.
This is to announce a pre-alpha release of my Hypertext Parsing Suite.
It can "clean" HTML ( or other markup languages, depending on the user-defined rule table ) to get an output which matches the rules in your table, create a DOM tree for use in various applications and a lot of other stuff.
It comes in the class of Tidy ( tidy.sourceforge.net ), but has very different goals. One is - it does NOT try to create standards-compliant HTML, but cleans and parses to create something useful for applications ( like, in information retrieval and extraction etc. ). It reads its rules from a user-supplied table, and the default one resembles HTML very closely.
The suite includes a sample.cpp file to show how to use the library. And the library can be linked to your applications too. Examples could be extracting particular parts from your markup language files like links, or tables.
It is written in C++, and does not *yet* use the auto* tools. Simple make commands to build and install ( in your $HOME )
We have run this code on huge amounts of data ( > 1 GB ), and found it to be stable and without memory leaks. There are other related applications written around this code which I'm planning to release soon.
Please download and test.
URL: http://www.it.iitb.ac.in/~jaju/hypar/
Online docs created from doxygen can be found on the site.