Hi all.
This is to announce a pre-alpha release of my Hypertext Parsing Suite.
It can "clean" HTML ( or other markup languages, depending on the user-defined rule table ) to get an output which matches the rules in your table, create a DOM tree for use in various applications and a lot of other stuff.
It comes in the class of Tidy ( tidy.sourceforge.net ), but has very different goals. One is - it does NOT try to create standards-compliant HTML, but cleans and parses to create something useful for applications ( like, in information retrieval and extraction etc. ). It reads its rules from a user-supplied table, and the default one resembles HTML very closely.
The suite includes a sample.cpp file to show how to use the library. And the library can be linked to your applications too. Examples could be extracting particular parts from your markup language files like links, or tables.
It is written in C++, and does not *yet* use the auto* tools. Simple make commands to build and install ( in your $HOME )
We have run this code on huge amounts of data ( > 1 GB ), and found it to be stable and without memory leaks. There are other related applications written around this code which I'm planning to release soon.
Please download and test.
URL: http://www.it.iitb.ac.in/~jaju/hypar/
Online docs created from doxygen can be found on the site.
On Thu, 5 Sep 2002, Ravindra Jaju wrote:
It is written in C++, and does not *yet* use the auto* tools. Simple make commands to build and install ( in your $HOME )
If you want someone to do the autotools stuff for you, you know where to find me.
On Thu, Sep 05, 2002 at 12:49:36PM +0530, Philip S Tellis wrote:
On Thu, 5 Sep 2002, Ravindra Jaju wrote:
It is written in C++, and does not *yet* use the auto* tools. Simple make commands to build and install ( in your $HOME )
If you want someone to do the autotools stuff for you, you know where to find me.
Thanks! :)
Though it's not currently on top of my priority list, I'll keep it in mind.
Right now, the list is like ... 1] Some API stabilization/changes and bug-hunting ( a case where the memory required shoots up unproportionately, as reported by 'top' ) 2] Making it compile with GCC [2|3].x ( does, but a few unresolved issues ) 3] Auto-tools ....
The code released has been tested only with GCC 2.96.x ( RH version )
Thanks again.
On Thu, 5 Sep 2002, Ravindra Jaju wrote:
1] Some API stabilization/changes and bug-hunting ( a case where the memory required shoots up unproportionately, as reported by 'top' )
use gprof.
2] Making it compile with GCC [2|3].x ( does, but a few unresolved issues )
3.0 has issues. try with 3.01
On Thu, Sep 05, 2002 at 12:39:26PM +0530, Ravindra Jaju wrote:
Hi all.
This is to announce a pre-alpha release of my Hypertext Parsing Suite.
Addendum:
Lest I give an impression that it was done all alone ....
Kunal Punera - He wrote the entire DOM tree creation part.
Soumen Chakrabarti - Guide and chief trouble-shooter. Most importantly, whatever good structure that the code has ( including API/attribute names ) are due to him, apart from a *lot* more.