spacer

DAML Crawler

Home | About DAML | Announcements | Roadmap | Site Search

DAML Crawler is a program that collects DAML statements by traversing WWW references and links.

Quick Links

results
current status
query the results

Architecture

The initially envisioned architecture of the DAML Crawler is shown below.
spacer

"Content root" starting locations are submitted and stored in a database. The Crawler runs periodically, maintains a queue of pages to visit, and stores the source and collected DAML content in files. URI/file mapping information and statistics are also stored in the database. Results are published via a WWW interface.

Teknowledge has built a DAML Semantic Search Service to query the Crawler results.

Current Implementation

Content is currently gathered and collected by site, where a site is defined as a protocol://host:port triple. Content roots are used to identify sites. Only sites identified by content roots are currently processed. A Java thread is created for each site. The thread processes a queue of pages seeded from the content roots. The processing of each page includes: To limit the load on the site being crawled, in accordance with prevailing WWW practice, the thread sleeps for 30 seconds between each page retrieval.

Results

Results are reported by site. These results consist of The site summary includes counts of each of these results, as well as a count of the total number of pages processed.

You can see the current status of the running Crawler here.

Possible Future Directions

When seeded with DAML content located on a large site, the Crawler may access large numbers of pages that don't contain DAML content. The use of heuristics to focus on DAML pages may be desirable. In particular, we are looking at the weighting algorithms employed by Expose.

Please provide suggestions and other feedback to crawler@daml.org.

Availability

An instance of DAML Crawler processing public Internet content is hosted on www.daml.org. Results are available here. download to run on private intranets. --> The open source DAML Crawler will soon be available for download to run on private intranets.

Related Work

Folks at Karlsruhe concurrently developed the RDF Crawler, which also handles DAML.

Authors

Mike Dean and Kelly Barber
$Id: index.xml,v 1.14 2002/04/11 14:27:51 kmbarber Exp $
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.