Building Information Extraction Systems

Introduction to Information Extraction Technology

IJCAI-99 Tutorial
August 2, 1999, Stockholm, Sweden

Douglas E. Appelt
David Israel

Artificial Intelligence Center
SRI International

This web page contains pointers to various resources and sites of interest for those interested in building information extraction systems, and in understanding current research and the state of the art.

We have made an effort to include what information we could find on the World Wide Web, but we are making no claims that this collection is complete. We know of some interesting projects for which we were unable to find web sites. Absence from this list is not to be construed as a negative evaluation of the system or the research behind it.

Notes from the tutorial

Probably the most important collections of papers for the field of information extraction are the proceedings of the Message Understanding Conferences (MUC). These conferences are actually DARPA-sponsored evaluations in which participants evaluated their systems in blind tests on a common set of data. The conferences were MUC-3 (1991), MUC-4 (1992), MUC-5(1993), and MUC-6 (1995). , and MUC-7 (to be published). The proceedings are available from Morgan Kaufmann Publishers, Inc.

Research Projects and Systems in Information Extraction

ALEMBIC Workbench (MITRE)
FASTUS (SRI International)
Highlight (SRI Cambridge)
Here you will find a web demo of an IE system you can take out for a spin.
LaSIE/Gate (University of Sheffield)
NetOwl (SRA, formerly Isoquest)
IDENTIFINDER (BBN)
This is a high-performing HMM name recognizer.
PROTEUS (New York University)
TextPro (Doug Appelt)
This is a lightweight information extraction system for PowerPC Macintosh computers, written by yours truely just for the fun of it. More than a toy, it's based on the TIPSTER architecture, and uses the TIPSTER Common Pattern Specification Language. It's available for free, and is particularly well suited to name-recognition applications.

Papers relating to aspects of information extraction

CIIR Information Extraction Publications
The University of Massachusetts at Amherst, led by Wendy Lehnert, developed very effective information extraction systems for the MUC-4 and MUC-5 evaluations. Recent research has focused on issues involving automatic learning of extraction rules from annotated corpora.
Dekang Lin's Publication List
Dekang Lin of the University of Manitoba was responsible for developing two vastly different information extraction systems for the microelectronics domain of MUC-5, and the management succession domain of MUC-6. They are described here.
Deterministic Part of Speech Tagging ...
This paper by Roche and Schabes describes a deterministic finite-state part of speech tagger that would, in our opinion, be particularly suitable for application in information extraction systems.
The SRI FASTUS System
An html version of a paper that describes how the SRI FASTUS system works in considerable detail.
PROTEUS Project Technical Reports
This is a collection of various papers produced by the NYU PROTEUS project over the many years of its duration.

Resources and tools for building information extraction systems

Lexical Resources
- The Consortium for Lexical Research
  The Consortium for Lexical Research was operated by the Computing Research Laboratory at New Mexico State University, until December 1, 1995, when it ceased operation due to lack of funding support. The Consortium maintained a collection of lexical resources that were available to members. Now that the center no longer exists, CRL has made the files of CLR available free of charge to all interested parties. Although the resources are no longer maintained and updated, and hence can become out of date, they are still a very valuable source of information for information extraction system building. Among other resources the Consortium offers
  - Lexicons (various languages)
  - JUMAN
  - Gazetteers
  - Company name lists
  - Part of Speech Taggers
  - Annotated corpora
  - Parsers
- The Linguistic Data Consortium
  The LDC acts as a repository for a variety of forms of linguistic data, including speech data, annotated TREEBANK and tagged corpora from the Wall Street Journal, Brown Corpus, ACL data collection initiative, and other sources, as well as comprehensive lexicons such as COMLEX (English) and CELEX (English, German and Dutch). Resources are available to institutional members, and to others on a pay-per-item basis. See the web site for more information
- WordNet
  A longstanding research project at Princeton University led by George Miller has developed a system that categorizes a very large number of English words according to their senses, allowing searches for synonyms, hyponyms and hypernyms. It is available free to interested parties from the web page, and runs on various computing platforms.
Part of Speech Tagging
- Eric Brill's Home Page
  Eric Brill has been involved in research in part-of-speech tagging for several years, and has developed a tagger that formed the basis of the Roche and Schabes tagger described in the paper cited in the previous section. His tagger is downloadable from his web page.
Syntax and Parsing
- The Link Grammar Home Page
  Davy Temperly and Dan Sleator of Carnegie Mellon University have developed a dependency-grammar formalism and associated parser that is quite comprehensive and very robust. They have measured its performance on Wall Street Journal texts as comparable to the best parsers. It is fast and robust enough that it may be useful for information extraction system development, and it is available for free from their web page.

This page is maintained by
Douglas E. Appelt (appelt@ai.sri.com
Updated April 28, 1999

Related searches:
research system available speech tagger

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.