Saturday, April 29, 2006

Retrieving the content of a blog site: finding the right crawler.

A Web crawler is a program that scans parts of the web in search for information. the crawler can be used to retrieve accessible information from a site, build a navigation index, or search for a particular element.

Our current goal lies in the full retrieval of a blog site's content, that of Slashdot in particular. We would like to store the content of Slashdot on an offline drive, parse it, and analyze the inherent structure of links. Then we will try to visualize this structure in a dynamic manner using a Radial Tree.

The current crawlers that we considering are WIRE andGNU Wget. There is also the BAILANDO Project at Berkeley where some developed tools such those of the WebTANGO sub-project could be of great use. WebTango has an online crawling tool that is currently not working, but its site announces its return in a few days....

The thing is that we need first to differentiate between the blog's infomative content, such as posts and comments, and other stuff such as links, advertisement, and admin functions. Parsing RSS might be of great help since RSS syndicates the content in a more structured manner than HTML. RSS feeds are usually devoted for informative content only, but that remains to be tested.

The first steps that we are taking is to identify the source of informative content to be crawled, here RSS is a primary candidate, then we need to identify the method for crawling, or in other words the crawler to be used for content retrieval.

0 Comments:

Post a Comment

<< Home