Introduction to Information Extraction Technology
IJCAI-99 Tutorial
August 2, 1999, Stockholm, Sweden
by
Douglas E. Appelt
David Israel
Artificial Intelligence Center
SRI International
This web page contains pointers to various resources and sites of interest for those interested in building information extraction systems, and in understanding current research and the state of the art.
We have made an effort to include what information we could find on the World Wide Web, but we are making no claims that this collection is complete. We know of some interesting projects for which we were unable to find web sites. Absence from this list is not to be construed as a negative evaluation of the system or the research behind it.
Notes from the tutorial
Probably the most important collections of papers for the field of information extraction are the proceedings of the Message Understanding Conferences (MUC). These conferences are actually DARPA-sponsored evaluations in which participants evaluated their systems in blind tests on a common set of data. The conferences were MUC-3 (1991), MUC-4 (1992), MUC-5(1993), and MUC-6 (1995). , and MUC-7 (to be published). The proceedings are available from Morgan Kaufmann Publishers, Inc.
Papers relating to aspects of information extraction
The University of Massachusetts at Amherst, led by Wendy
Lehnert, developed very effective information extraction systems for the
MUC-4 and MUC-5 evaluations. Recent research has focused on issues involving
automatic learning of extraction rules from annotated corpora.
Dekang Lin of the University of Manitoba was responsible
for developing two vastly different information extraction systems for the
microelectronics domain of MUC-5, and the management succession domain of
MUC-6. They are described here.
This paper by Roche and Schabes describes a deterministic
finite-state part of speech tagger that would, in our opinion, be particularly
suitable for application in information extraction systems.
An html version of a paper that describes how the SRI FASTUS
system works in considerable detail.
This is a collection of various papers produced by the NYU PROTEUS project over the many years of its duration.
Resources and tools for building information extraction systems
The Consortium for Lexical Research was operated by the Computing Research Laboratory at New Mexico State University, until December 1, 1995, when it ceased operation due to lack of funding support. The Consortium maintained a collection of lexical resources that were available to members. Now that the center no longer exists, CRL has made the files of CLR available free of charge to all interested parties. Although the resources are no longer maintained and updated, and hence can become out of date, they are still a very valuable source of information for information extraction system building. Among other resources the Consortium offers
The LDC acts as a repository for a variety of forms of
linguistic data, including speech data, annotated TREEBANK and tagged
corpora from the Wall Street Journal, Brown Corpus, ACL data collection
initiative, and other sources, as well as comprehensive lexicons such
as COMLEX (English) and CELEX (English, German and Dutch). Resources
are available to institutional members, and to others on a pay-per-item
basis. See the web site for more information
A longstanding research project at Princeton University
led by George Miller has developed a system that categorizes a very
large number of English words according to their senses, allowing searches
for synonyms, hyponyms and hypernyms. It is available free to interested
parties from the web page, and runs on various computing platforms.
Eric Brill has been involved in research in part-of-speech
tagging for several years, and has developed a tagger that formed the
basis of the Roche and Schabes tagger described in the paper cited in
the previous section. His tagger is downloadable from his web page.
Davy Temperly and Dan Sleator of Carnegie Mellon University
have developed a dependency-grammar formalism and associated parser
that is quite comprehensive and very robust. They have measured its
performance on Wall Street Journal texts as comparable to the best parsers.
It is fast and robust enough that it may be useful for information extraction
system development, and it is available for free from their web page.
This page is maintained by
Douglas E. Appelt (appelt@ai.sri.com
Updated April 28, 1999