ACH/ALLC 1993 Conference Report

Peter Flynn Computer Centre, University College, Cork, Ireland

This report is a summary of the joint conference of the Association for Computing in the Humanities and the Association for Literary and Linguistic Computing, held at Georgetown University, Washington DC, 16&enrule;19 June 1993. It contains a précis of the text published in the preprints supplemented by the author's notes, but omissions occur for a few sessions for which (a) no paper was available; (b) where a panel discussion was held viva voce; or (c) where a fuller report is available from the speaker. In dealing with topics sometimes outside my own field, I will naturally have made mistakes, and I ask the authors' pardon if I have misrepresented them.

A hypertext version of this report is available on the Internet at curia.ucc.ie/info/achallc/georgetown.html and can be accessed through the WorldWideWeb using lynx, mosaic or similar browsers.


Tuesday, June 15

Preconference Activities


Wednesday, June 16

Opening Session

Welcomes: Mr. John J. DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus; Rev. Robert B. Lawton, S.J., Dean, Georgetown College; Susan K. Martin, University Librarian; Nancy Ide, President, Association for Computers and the Humanities; Susan Hockey, President, Association for Literary and Linguistic Computing

Keynote Speaker: Clifford Lynch, Director of Library Automation, Office of the President, University of California

The opening ceremony was held in the splendour of Gaston Hall, Georgetown University. Dr Michael Neuman, organiser of the conference, warmly welcomed attendees and then introduced each of the speakers.

In his welcoming speech, Mr John J DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus noted that it was impressive that the Georgetown conference followed Oxford and preceded the Sorbonne. Georgetown had a campus in Italy and there seemed to be similarities between Georgetown and Florence in literature, history, philosophy and art. Working in computing in the humanities must be a little like working in Florence in the 16th century. We stand on the verge of several possibilities and the very idea of `text' is central to shaping these possibilities. We will be facing serious questions in the years ahead and it was appropriate that the purpose of the conference was to improve learning.

The Rev Robert B Lawton SJ, Dean of Georgetown College, spoke of the computer as representing sophisticated technology for retrieving and processing information that really represents a profound movement in human evolution when we can extend our very powers of thinking.He hoped that the conference would lead to profitable conversations which will enrich the time spent at Georgetown.

Nancy Ide welcomed the audience on behalf of the Association for Computing and the Humanities. She concluded by pointing out that the theme suggests that we are at an important moment and set the scene and for the conference.

On behalf of the Association for Literary and Linguistic Computing, Chairman Susan Hockey complimented the Georgetown University organisers for their effort and effectiveness in organising the conference.The programme committee had put together a programme that really shows where Literary and Linguistic Computing and the Humanities Computing will make contributions well into the next decade.There is great potential for working together with librarians to pursue electronic research so that a significant contribution could be made to the electronic library of the future. The conference is a valuable opportunity to bring together the issues which concern librarians, scholars and the skills of computer scientists to develop programs for the creation of electronic texts and their manipulation.

Susan K. Martin, University Librarian also noted the growing interest in electronic texts in the research library community. Librarians will take on new and exciting roles as the true potential of electronic texts becomes more fully understood.

In the opening keynote address, Clifford Lynch, Director of Library Automation, Office of the President, University of California, surveyed the current and future scenes for electronic information delivery and access. In a presentation which ranged over many topics with great clarity and vision, Dr Lynch stressed that the future lay in electronic information and getting definitional handles on its components. He spoke about the technology and computer methods whereby 12,000 constituent networks existed where ideas could be exchanged, and where information can be accessed. Information exists that is independent of particular computer technology and is usable through open standards and network servers that can migrate from one generation of technology to the next thereby preventing preservation disasters of the past; information can now cross many generations of technology.

Inspiration and new analysis tools continue to allow researchers to build on the work of others. Databases and knowledge management by libraries have allowed them to become a central part of scientific enterprise to be used and shared internationally. Networks act as a facilitator of collaboration and for the inclusion of geographically remote researchers.

To establish how to handle electronic texts there was a need for coordination between libraries, centres for electronic texts and initiatives such as the TEI. He also pointed to the fact that intellectual property and copyright of textual material is turning out to be an incredible nightmare.

A move to overcome problems by developing a superstructure for the use of textual resources was proposed. Networking information was the right kind of perspective for thinking about these problems. And the future of networking information provided an interesting voyage fundamental to the issues of scholarship.

Track 1, 11.00: Vocabulary Studies. Chair: Christian Delcourt (Université de Liège)

Douglas A Kibbee (University of Illinois) The History of Disciplinary Vocabulary: A Computer-Based Approach to Concepts of `Usage' in 17th-Century Works on Language.

By linking over 50 texts dealing with the French language, from Estienne's treatises to the Dictionnaire de l'Académie, in which the key issues in the debate over usage are mentioned (dialects, archaisms, neologism, foreign borrowing, spelling, pronounciation, sociolinguistic variation, etc) it is possible to reconstruct what constituted the metalanguage of grammatical discussion in 17th century France. He argues that the full-text database techniques of corpora linguistics can be brought to bear on the neglected analysis of the history of discourse (in particular the definition of `usage') and that the importance of these disciplines in another age cannot be subject to current theoretical fashions.

Terry Butler, Donald Bruce (University of Alberta) Towards the Discourse of the Commune: Computer-Aided Analysis of Jules Vallès' Trilogy Jacques Vingtras

This study concentrates on the representational aspects of the discursive status of the Paris Commune of 1871, using a computer-aided analysis of the titular trilogy. The authors' hypothesis (that Proudhon's formal proposition of `anarchism' is realized in the narrative, metaphor and lexical items of Vallès, Rimbaud and Reclus) is being tested in two stages, firstly on unmarked texts, and secondly using PAT and Tact on TEI-marked versions of the texts, both to verify the existence of the empirical regularities and to ascertain heuristically any undiscovered patterns or relationships.

Track 2, 11.00: Statistical Analysis of Corpora. Chair: Nancy Ide (Vassar College)

Hans van Halteren (University of Nijmegen) The Usefulness of Function and Attribute Information in Syntactic Annotation

The author distinguishes two type of corpus exploitation: micro, where a specific phenomenon (for example, a linguistic element) is studied in detail, and macro, where groups of phenomena are studied on a corpus-wide basis (for example, to derive a probabilistic parser). Given the substantial effort needed to create syntactic annotation of corpus material, he examines the level of detail in annotation required for the examples of each of the two types of exploitation identified. Micro-exploitation is usually more successful if it involves only categorisation, rather than exploring functional relationships; the macro- exploitation of parser generation is more problematic, since the use of the data is much more varied.

R Harald Baayen (Max-Planck Institute for Psycholinguistics) Quantitative Aspects of Lexical Conceptual Structure

In the analysis of lexical conceptual structure, distributional data may help in solving problems of linguistic underdetermination. With mophological productivity as an example, the author uses the Dutch Eindhoven corpus to show that the frequency distributions of inchoative and separative readings of the Dutch prefix `ont-' are statistically non-distinct. He argues that linguistic analysis is called for in which deverbal and denominal reversatives are assigned identical lexical conceptual structures.

Elizabeth S Adams (Hood College) Let the Trigrams Fall Where They May: Trigram Type and Tokens in the Brown Corpus

Analysis of the distribution of trigram tokens in the Brown corpus shows that one-sixth of the trigrams occurred only once, 30 percent between 2 and 10 times, and a quarter between 11 and 100 times, with only 24 trigrams occurring over 10,000 times. A comparison of the increase in the number of types as documents were added (once in order of occurrence and once in random order) indicates that the incremental addition of documents to the corpus will not push the number of types over about 11,000. The computational effectiveness of trigram-based retrieval is emphasized given its advantages in terms of useability.

Track 3, 11.00: The Academical Village: Electronic Texts and the University of Virginia (Panel)

John Price-Wilkin (University of Virginia) Chair
Jefferson's term for the integrated learning environment he envisaged has been taken as the metaphor for a vigorous pursuit of development of computing resources for the Humanities at the University of Virginia. SGML-tagged texts on VT100 and X-Windows platforms using PAT and Lector, the Institute for Advanced Technology in the Humanities is designed to place nationally-recognised scholars in an environment where they can experiment freely with computer-aided research projects.
Kendon Stubbs (University of Virginia)
The electronic text initiative got under way in 1991 with the goals of providing facilities to ordinary faculty, graduates and students (rather than specialists); making the texts available remotely, rather than only on Library premises; and to focus on SGML-tagged texts. It has proven its worth in making Humanities computing a high-impact, high-visibility and low-cost initiative; a catalyst for innovation elsewhere in the Library; and as a part of the infrastructure and model for future development.
David Seaman (University of Virginia)
The Electronic Text Center is open most of the Library's regular hours, providing a walk-in service. Apart from the service, it provides an introduction to the technology and information about Humanities computing in a non-threatening way. The early signs are that faculty and researchers take readily to SGML-conformant texts and online tools.
David Gants (University of Virginia)
Two examples of computer applications in the Humanities come from teaching experience and from continuing research:
  • Using a multimedia package, it was possible to create an electronic version of several scenes from The Merchant of Venice, employing verbal, visual and textual annotation, for example a transcription of the 1600 quarto and 1623 folio versions, a 1701 adaptation, an audio track of Warren Mitchell delivering Hath not a Jew eyes?, a 17th century pictorial representation of jewish life in England, and an account of English anti-semitism. On a computer connected to the network, further resources such as WAIS can also be used.
  • Using SGML, it has been possible to make a reconstruction of all the variant forms of the 1616 folio Workes of Ben Jonson, with corrections, re-impressions and resettings, in a database which will form a major part of the bibliography for a doctoral dissertation.
Edward Ayers (University of Virginia)
One project follows two communities, one Northern and one Southern, through the era of the [American] Civil War. It can be conceived as thousands of intertwined biographies, tracing the twists and turns in people's lives as they confronted the central experience of the American nation. The technology allows the inclusion of scanned images from microfilm of newspapers, the manuscript census, maps and other images, views of the battlescape, and political, economic and military news of the time.

Track 1, 2.00: Interrogating the Text: Hypertext in English Literature (Panel)

Harold Short (King's College, London), Chair
This session emphasises the pædagogical theory behind courseware design, an examination of the elaborate claims which have been made concerning the revolutionary impact of hypertext in education, and its facility for democratising education and allowing student-centered learning.
Patrick W. Conner, Rudolph P. Almasy (West Virginia University) Corpus Exegesis in the Literature Classroom: The Sonnet Workstation
The Sonnet Workstation is a HyperCard implementation of a literary corpus which allows students to read, compare and write about any corpus of short texts. It incorporates hypertext links to annotations and other texts and offers a search rotuine to allow students to carry out searches of several megabytes of 16th-century English sonnets with a glossary and online thesaurus.
Mike Best (University of Victoria) Of Hype and Hypertext: In Search of Structure
Two practical examples of programs developed by the author illustrate the use of hypertext and explore some of the theoretical questions. DynaMark is a classroom program to allow commenting and reviewing by instructors and students, allowing the attachment of comments to text, and to enable the text to be studied as it were under a microscope. By contrast, Shakespeare's Life and Times lets the student expand from the text out into the world, using HyperCard to provide the links between classroom and the library, with access to hypermedia links such as music of the period and text spoken in the relevant dialects.
Stuart Lee (Oxford University) Hypermedia in the Trenches: First World War Poetry in Hypercard - Observations on Evaluation, Design, and Copyright
Much software is successful in packages based on periods culturally removed from the time of today's students (such as Beowulf or Shakespeare). Lee and Sutherland's HyperCard version of Isaac Rosenberg's Break of Day in the Trenches provides bracnhes out into three main areas: Rosenberg's own life; analogues; and World War I. Rather than be a definitive teaching tool for WWI poetry, it is more of a prerequisite study for a tutorial or seminar on the poet. In analysis, using a group of A-level literature students, a worrying 96% of them enjoyed it but felt they no longer needed to research the material as they had `seen everything'.

Track 2, 2:00: Discourse and Text Analysis. Chair: Estelle Irizarry (Georgetown University)

Greg Lessard, Michael Levison (Queen's University) Computational Models of Riddling Strategies

Previous research has demonstrated that there is a formalisable, learnable set of mechanisms which can generate, in principle, an unlimited set of `Tom Swifties' (a form of wordplay such as `I hate Chemistry, said Tom acidly'). This is now extended to analyse structures such as riddles (`Why did the dog go out into the sun? To be a hot dog'). Such riddles share an essential trait with Tom Swifties: they are learned and learnable linguistic strategies.

The VINCI natural language generation environment offers a context-free phrase-structure, a syntactic tree transformation mechanism, a lexicon and lexical pointer mechanism and a lexical transformation mechanism to provide a modelling environment suitable for such analyses.

The model analysed allows the selection of the specific different semantic traits which pose the problem, and generates a question containing them. Three different kinds of question are exemplified, and the paper examines in more detail the linguistic constraints on riddles, in particular the tension between lexicalisation of the correct answer versus productivity.

Walter Daelemans, Antal van den Bosch (Tilburg University), Steven Gilles, Gert Durieux (University of Antwerp) Learning Linguistic Mappings: An Instance-Based Learning Approach

One of the most vexing problems in Natural Language Processing is the linguistic knowledge acquisition bottleneck. For each new task, basic linguistic datastructures have to be handcrafted almost from scratch. This paper suggests the application of Machine Learning algorithms to automatically derive the knowledge necessary to achieve particular linguistic mappings.

Instance-Based Learning is a framework and methodology for incremental supervised learning, whose distinguishing feature is the fact that no explicit abstractions are constructed on the basis of the training examples during the training phase: a selection of the training items themselves is used to classify new inputs.

A training example consists of a pattern (a set of value/attribute pairs) and a category. The algorithm is input as an unseen test pattern drawn from input space, and associates a category with it. The paper compares the results of this approach quantitavely to the alternative similarity-based approaches, and qualitatively to handcarfted rules-based alternatives.

Michael J Almeida, Eugenie P Almeida (University of Northern Iowa) NewsAnalyzer - An Automated Assistant for the Analysis of Newspaper Discourse

The NewsAnalyzer program is intended to assist researchers froma variety of fields in the studty of newspaper discourse. It works by breaking up newspaper articles into individual statements which are then categorised along several syntactic, semantic and pragmatic dimensions. The segmented and categorised text can then serve as data for further analysis along lines of interest to the researcher, for example statistical, content or idoelogical analysis.

The classification scheme distinguishes between factual and non-factual statements: factual ones are further classified between stative or eventive, and non-factual ones have several further subcategories. This is done through a combination of shallow syntactic analyses, the use of semantic features on verbs and auxiliaries, and the identification of special function words.

The program is implemented in scheme and runs on Apple Macintosh computers. A hand-coded version has been used in a linguistic-based content analysis of two weeks of newspaper coverage in which all front-page stories were coded and analysed for writing style, and has also been used to study the ways in which newswriters reported predictions on the Presidential and Vice-presidential candidates[in the 1992 US Presidential election].

Track 3, 2:00: Networked Information Systems. Chair: Eric Dahlin (University of California, Santa Barbara)

Malcolm B Brown (Dartmouth College) Navigating the Waters: Building an Academic Information System

The Dartmouth College Information System (DCIS) is organised as a distributed computing resource using the layered OSI model for its network interactions. It includes features such as:

The client/server model used introduces a modularity that was not previously available, and the use of the network frees the user from geographical constraints and delays. It is felt that the information thus made available has the potential equal to that of word processing to facilitate the basic work of Humanities scholarship.

Charles Henry (Vassar College) The Coalition for Networked Information (CNI), the Global Library, and the Humanities

The CNI could be the key in supporting the Humanities, as it builds upon its original programs of scholarship enhancement through the creation of a free and accessible global library of electronic holdings via the National Education and Research Network (NREN).

Through its working group `The Transformation of Scholarly Communication', the CNI intends to identify and help promulgate projects and programs in the Humanities that have significant implications for changing scholarship methodology and teaching when made available on the NREN.

Christian-Emil Øre (University of Oslo) The Norwegian Information System for the Humanities

This six-year project started in March 1991 to convert the paper-based archives of the collection departments in Norwegian universities to computer-based form, making the `Norwegian Universities' Database for Language and Culture'. An estimated 750 manyears work is required: currently 120 people are engaged from the skilled unemployed on a 50%-50% work and education programme.

The information core is held in SYBASE and accessed using PAT and SIFT on UNIX platforms. Using client-server technology, the database resides on a machine in Oslo and can be accessed through local clients (Omnis7 or HyperCard) or by remote X-terminals.

Data loading has started with four subprojects: coins (a collection of approximately 200,000 items at the University of Oslo), archaeology (reports in SGML on all archaeological sites in Norway), Old Norse (from 1537 CE, approximately 30,000 printed pages) and Modern Norwegian (the creation of a lexical database for the language). All text is stored in TEI-conformant form.

Track 1, 4:00: The Computerization of the Manuscript Tradition of Chrétien de Troyes' Le Chevalier de la Charrette (Panel)

Joel Goldfield (Plymouth State College), Chair and Reporter
The best work in the future of literary computing will be dramatically facilitated by the availability of databases prepared by those scholarswho have a masterful knowledge of their discipline, who have availed themselves of detailed, appropriate encoding schemes, and who have envisioned the widest scope of uses of their databases at all levels of scholarship.
Karl D Uitti (Princeton University) Old French Manuscripts, the Modern Book, and the Image
`Text' is often equated with the `final' , printed work of an author, but this is frequently an arbitrary construct: before printing, scribes considered themselves a part of the literary process whereas our editions contain what we believe the mediæval author `wrote'. By replicating in database form an important Old French MS tradition, we wish to augment the resources open to scholars by making available an authentically mediæval and dynamic example of pre-printing technology.
Gina L Greco (Portland State University) The Electronic Diplomatic Transcription of Chrétien de Troyes's "Le Chevalier de la Charrette (Lancelot):" Its Forms and Uses
The Princeton project differs from the Ollier-Lusignan Chrétien database in that it includes all eight manuscripts. The participants believe it is important not to resolve scribal abbreviations but to preserve this information intact: the MS text will be transcribed exactly with word divisiona, punctuation and capitalization. It is our hope that the electronic `editing' will be continuous, but with control centred in Princeton.
Toby Paff (Princeton University) The `Charrette' Database: Technical Issues and Experimental Resolutions
Treating the `Charette' materials as a database has several advantages over other approaches. While it provides fast access to words, structures, lines and sections for analysis, it also offers a rich array of resources for dealing with orthographic, morphological, grammatical and interpretative problems. The Foulet-Uitti edition is available in a SPIRES database and is augmented by lexicographical and part-of-speech indexes. The Postgres implementation allows the matching of dictionary searches with images of the manuscripts themselves.

Track 2, 4:00: Computer-Assisted Learning Systems. Chair: Randy Jones (Brigham Young University)

Eve Wilson (University of Kent at Canterbury) Language of Learner and Computer: Modes of Interaction

The difficulties of providing a Computer-Assisted Language Learning system (CALL) are exacerbated by the teacher is often not a computer specialist. Such packages need to accommodate both the needs and aptitudes of the learner as well as the goals of the teaching programme. Good interface design is essential so that it is easy for teachers to add material as well being easy for the students to use. The proposed system uses SGML to define the formal structure, and a sample DTD for a student exercise is included.

Floyd D Barrows, James B. Obielodan (Michigan State University) An Experimental Computer-Assisted Instructional Unit on Ancient Hebrew History and Society

The program content covers the story of the Hebrew people from their settlement in the Jordan valley to the fall of Jerusalem in 70 CE. It is implemented in ToolBook 1.5 for the IBM PC but can be ported to Hypercard for the Macintosh. The courseware is designed to provide interactive lessons on forces that influenced the development of the Hebrew people and students can select from 12 Lesson units, which include problem-solving questions. It can be used for enrolled students (giving a grade) or for guest users (interest only) and provides pre-and post-test elements to judge performance.

Track 3, 4:00: Information Resources for Religious Studies. Chair: Marianne Gaunt (Rutgers University)

Michael Strangelove (University of Ottawa) The State and Potential of Networked Resources for Religious Studies: An Overview of Documented Resources and the Process of Creating a Discipline-Specific Networked Archive of Bibliographic Information and Research/Pedagogical Material

An increasing number and variety of networked archives related to religious studies have appeared in the last few years, based on LISTSERV and ftp. The author provides an overview of the experience of creating and cataloguing network-based resources in religious studies, and relates the process and online research strategy of writing a comprehensive bibliography and guide to networked resources in religious studies, The Electric Mystic's Guide to the Internet. The issues of size and growth rate, security, copyright and verification, skills and tools required, and the funding strategies neededare also discussed.

Andrew D. Scrimgeour (Regis University) Cocitation Study of Religious Journals

Cocitation Analysis was developed in 1973 and is used here to study how humanities scholars perceive the similarities of 29 core religion journals. The resultant map is a graphic picture of the field of religious studies as organized by its journals. The map depicts the spatial relationships between each speciality area and also between the individual journals within each area. These maps are useful for providing an objective technique for tracing the development of a discipline over time and are of potential benefit in teaching basic courses in religious studies.

Evening Activities

5:45: ALLC Annual General Meeting
[text needed from Susan Hockey]
8:00: Report of the Text Encoding Initiative
[text needed from Lou Burnard/Michael Sperberg-McQueen]

Thursday, June 17

Track 1, 9:00: Hypertext Applications. Chair: Roy Flannagan, Ohio University

John Lavagnino (Brandeis University) Hypertext and Textual Editing

The recent innovation of hypertext has been taken as providing a clear and convincing solution to the problems facing textual scholars. The author examines what hypertext can and cannot do for editions compared with print.

Space is the predominant factor: it is unlimited in hypertext and has obvious attractions for handlinmg multiple- version and apparatus criticus problems. The other main factor is ordering: hypertext can provide multiple views of a corpus that are unobtainable in print. The mechanisms used are the link, the connection between one point in hypertext and another, and the path, which is a seres of pre-specified links.

Despite its obvious attractions, editors have been too limited in what they want from hypertext, seeking only solutions to the problems of traditional publishing rather than taking advantage of the new medium's possibilities.

Risto Miilumaki (University of Turku) The Prerelease Materials for Finnegans Wake: A Hypermedia Approach to Joyce's Work in Progress

The electronic presentation of a complicated work such as Finnigans Wake and its manuscripts can facilitate an international cooperative effort at a critical edition. The present approach covers the drafts, typescripts and proofs of chapters vii and viii or part I of the work, done in Asymetrix' ToolBook OpenScript for MS-Windows/MCI. The implementation allows synoptic browsing of the pages, using a `handwriting' font for manuscripts, typewriter font for drafts and Times for proofs, with access to graphic images of the manuscript pages themselves.

Catherine Scott (University of North London) Hypertext as a Route into Computer Literacy

The high numbers of Humanities students required to undertake courses in computer literacy often undergo courses on a campus-wide basis, regardless of their discipline. The paper proposes the use of training in hypertext systems as a vehicle for instructing them in file creation, formatting, making links and incorporating graphics so that they learn to use the computer's screen as a vehicle for presenting arguments, displaying interrelated information and for providing readers with choices. The skills they learn are transferable into other applications, and the students who have gone through the UNL course have received it very enthusiastically.

Track 2, 9:00: Parsing and Morphological Analysis. Chair: Paul Fortier (University of Manitoba)

Hsin-Hsi Chen, Ting-Chuan Chung (National Taiwan University) Proper Treatments of Ellipsis Problems in an English- Chinese Machine Translation System

Conjunctions, comparatives and other complex sentences usually omit some constituents. These elliptical materials interfere with the parsing and the transfer in machine translation systems. This paper formulates Ellipsis Rules based on X-scheme. The differences between English and Chinese constructions are properly treated by a set of transfer rules.

Ellipsis is the omission of an element whose implied presence may be inferred from other components of the sentence, as in I like football and Kevin @ tennis (where the `@' stands for an omitted `likes'). The approach to parsing ellipsis is to divide the grammar rule base into normal (N) rules and ellipsis (E) rules. Recognition of one phrase (I like football) can then trigger a differential analysis of the remainder of the sentence, the relationship being governed by the E-rules.

Chen identifies four elliptical constructions in English,

Because the specific features of elliptical construction in English are described by the uniform E-rules, the grammar rules for other phrases need not be changed. A set of lexical and structural transfer rules has been constructed to capture the differences between English and Chinese elliptical constructions, implemented on an English-Chinese machine translation system using Quintus Prolog and C.

Jorge Hankamer (University of California, Santa Cruz) keCitexts: Text-based Analysis of Morphology and Syntax in an Agglutinating Language

Text corpora are used for many purposes in the study of language and literature: frequency tables derived from corpora have become indispensable in experimental psycholinguistics. The keCi analyser has been developed to automatically lemmatise a library of texts in Turkish, a language with an agglutinating morphology.

The system matches the root at the left edge of the input string and follows a morphotactic network to uncover the remaining morphological structures of the string. The developing corpus has been converted from disparate formats into a common scheme in ASCII and so far, 5000 lines (sentences, about 70Kb) has been `cleaned' by keCi.

Juha Heikkilä, Atro Voutilainen (University of Helsinki) ENGCG: An Efficient and Accurate Parser for English Texts

The ENGCG parser constitutes a reliable linguistic interafce for a wide variety of potential applications in the humanities and related fields, ranging from parsing proper via corpus annotation to information retrieval.

It performs morphological analysis with part-of-speech disambiguation and assigns dependency-oriented surface- syntactic functions to input wordforms, using the Constraint Grammar techniques developed by Karlsson (1990) by expressing the structures which exclude inappropriate alternatives in parsing.

The system exists in a LISP development version and a C++ production version running on a Sun Sparcstation 2.

Track 3, 9:00: Documenting Electronic Texts (Panel)

Annelies Hoogcarspel (Center for Electronic Texts in the Humanities), Chair. TEI Header, Text Documentation, and Bibliographic Control of Electronic Texts
While it is estimated that there are thousands of electronic texts all over the world, there is generally no standardized bibliographic control. If electronic texts were cataloged according to accepted standards, duplication could be avoided, and their use encouraged more effectively.
The Rutgers Inventory of Machine-Readable Texts in the Humantiies is now maintained by CETH, the Center for Electronic Texts in the Humanities, and uses the standard AACR2 [Anglo-American Cataloging Rules (2nd Ed, 1988)]. The RLINMARC program is used to hold the data in MDF (computer files) format.
The file header described by the proposals (P2) of the Text Encoding Initiative (TEI) incorporates all the information needed to follow the rules of AARC2 as well as other information now often lacking. In particular, proper cataloging can indicate the degree of availability of a text, so where there is uncertainty about copyright questions, an entry could still indicate to the serious scholar whether a copy of the text is available or not.
Richard Giordano (Manchester University)
Lou Burnard (Oxford University)

Track 1, 11:00: Statistical Analysis of Texts. Chair: Joel Goldfield (Plymouth State College)

Thomas B Horton (Florida Atlantic University) Finding Verbal Correspondences Between Texts

In the early 1960s, Joseph Raben and his colleagues developed a program that compared two texts and found pairs of sentences in which one text contained verbal echoes of the other. Despite the flexibility of modern concordance systems, Raben's program appears not to have survived in modern form. This study examines the problem, Raben's solution and possible new approaches.

Using the accepted premise that Shelley was heaving influenced by Milton, Raben developed an algorithm to analyse canonically-converted sentences from work of both authors. Although the algorithm has not been retested in 30 years, it has now been reimplemented using modern tools, and its effectiveness examined. Work is also proceeding on comparing this approach to a passage-by-passage approach using the ``word cluster'' technique.

David Holmes (The University of the West of England), Michael L. Hilton (University of South Carolina) Cumulative Sum Charts for Authorship Attribution: An Appraisal

Cumulative sum (CuSum) charts are primarily used in industrial quality control, but have found application in authorship attribution studies, and one particular technique (QSUM, Morton & Michaelson) has been the centre of forensic controversy in the UK in some allegedly forged-confession cases.

The QSUM test uses the assumption that people have unique sets of communicative habits, and implements CuSum charts to present graphically the serial data in which the mean value of the plotted variable is subject to small but important changes. But as there is as yet no statistically valid way of comparing two CuSum charts, any decision regarding their significance will necessarily be either subjective or arbitrary.

Experiments with weighted CuSums indicate that they perform marginally better than the QSUM test, but are not consistently reliable: it may be that authors do not follow habit as rigidly as would be needed for CuSum techniques to determine authorship correctly.

Lisa Lena Opas (University of Joensuu) Analysing Stylistic Features in Translation: A Computer-Aided Approach

Computational linguistics can facilitate the examination of how successful a translation is in replicating important stylistic features of the original text. An example is the Finnish translation of Samuel Becket's How It Is, which describes the writing process itself and the effort put into it.

A feature of this novel is its use of repetition, and research is cited on the effect of ``shifts'' in style occasioned by the non-coincidence of stylistic conventions such as repetition which differ between two languages. TACT was used to analyse consistency in the use of specific words and phrases which were used in the translation, and it was noted that shifts have indeed occurred.

Track 2, 11:00: Phonetic Analysis. Chair: Joe Rudman (Carnegie Mellon University)

Wen-Chiu Tu (University of Illinois) Sound Correspondences in Dialect Subgrouping

In the classification of the sound properties of cognate words, there is a lexicon-based, equal-weighted technique for quantification which can be used without constructing phonological rules. In a study of Rukai, a Formosan language), a modification of Cheng's quantitative methods was used to store 867 words and variants. These were subjected to statistical refinement and then sucessfully subgrouped by a process using a data matrix of difference and sameness, a measure of the degree of similarity, and cluster analysis.

Ellen Johnson, William A Kretzschmar, Jr (University of Georgia) Using Linguistic Atlas Databases for Phonetic Analysis

The Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) [USA] has been used to analyse the well-known phonological feature of the loss of post-vocalic /r/. Uisng a specially-designed screen and laserprinter font in the upper half of the PC character set, database searches can be cariied out on complex phonetic strings.

Two methods were examined, assigning each pronounciation a score, and treating each pronounciation as binary (ie with or without retroflexion). The techniques used are discussed, and the system for encoding phonetic symbols and diacritical marks is described.

Track 3, 11:00: Preserving the Human Electronic Record: Responsibilities, Problems, Solutions (Panel)

Peter Graham (Rutgers University), Chair
Whoever takes on the responsibility of preserving the elctronic human record will find two problems: preservation of the media and preservation of the integrity of the data stored on it. Shifts of culture and training are needed among librarians and archivists to enable the continuation of their past print responsibilities into the electronic media.
Gordon B. Neavill (University of Alabama)
There are parallels between the oral tradition and the manuscript period, and the electronic environment. The malleability of electronic text contrasts sharply with the fixity of print, and the problem is re-emerging of how information should survive through time, and how we authenticate the intellectual content.
W Scott Stornetta (Bellcore)
In addition to authentication there is the problem of ficing a document at a point in time.A technique has been developed at Bellcore of timestamping a document digitally which satisfies both requirements, and work is under way to see if this is an appropriate tool for preserving scholarly information integrity.

Track 1, 2:00: The Wittgenstein Archives at the University of Bergen (Panel) Claus Huitfeldt (University of Bergen), Chair

Claus Huitfeldt, Ole Letnes (University of Bergen) Encoding Wittgenstein
A specialized encoding system is being developed to facilitate preparation, presentation and analysis of the texts being collected for the computerized version of Ludwig Wittgenstein's writings. The objective is to make both a strictly diplomatic, and a normalized and simplified reading-version of every manuscript in his 20,000- page unpublished Nachlass, with its constant revisions, rearrangements and overlap, using a modified version of MECS (Multi- Element Code System). The target is for scanned raster images, MECS transcriptions, a TEI version, the diplomatic and normalised/simplified versions in wordprocessor formats, a free-text retrieval system, and a filter/browser/converter.
Claus Huitfeldt (University of Bergen) Manuscript Encoding: Alphatexts and Betatexts
Within the markup scheme being used, are some important distinctions to handle the complexity of the material:
  • Alpha-exclusion codes to mark elements inside words which need to be disregarded in graphword selection;
  • Beta-exclusion codes to mark words which cannot be integrated in a coherent reading;
  • Language codes to mark the different languages used;
  • Substitution codes to mark mutually exclusive readings;
  • Reiterative codes to mark substitution codes pertaining to a literal transcription of a portion of the manuscript.
Alois Pichler (University of Bergen) What Is Transcription, Really?
The encoding of a text involves a number of different activities transferring the multidimensional aspects of a handwritten text into the unidimensional medium of a computer file. The author identifies nine specific kinds of coding and exmaines their use, and presents an alternative model to the hierarchic one used in MECS-WIT. The requirement of ``well-formedness'' should be regarded as only one rule among the other, equally valid, rules.

Track 2, 2:00: Data Collection and Collections. Chair: Antonio Zampolli (Istituto di Linguistica Computazionale)

Shoichiro Hara, Hisashi Yasunaga (National Institute of Japanese Literature) On the Full-Text Database of Japanese Classical Literature

The spread of computers which can handle Japanese language processing has led to the starting of the National Institute of Japanese Literature (NJIL)'s recension full-text database The Outline of Japanese Classical Literature (100 vols, 600 works), comprising databases containing the Texts, Bibliographies, Utilities (revision notes) and Notes (headnotes, footnotes, sidenotes etc).

Another project is underway to construct a full-text database of current papers in the natural sciences, using DQL and SGML. A visual approach has been chosen to overcome the inherent complexity of DQL's SQL parentage.An elemental query is written in a box attached to the node or leaf, and complex queries can be constructed by gathering these using the mouse.

The possibility of applying TEI standards to Japanese classical literature is being studied.

Ian Lancashire (University of Toronto) A Textbase of Early Modern English Dictionaries, 1499-1659

The scale of the task of adequately documenting Early Modern English may explain why a Dictionary project was not funded several decades ago, when research showed that such a peoject could turn out larger than the OED. However, a text database of Renaissance bilingual and English-only dictionaries would be feasible as a way of making available information that would appear in an EMED.

Using such an electronic knowledge base, a virtual Renaissance English dictionary could be constructed, using SGNL tagging, and inverting the structure of bilingual texts: early results (ICAME, Nijmegen, 1992) indicate that there are some phrasal forms and new senses not found in the OED.

Dionysis Goutsos, Ourania Hatzidaki, Philip King (University of Birmingham) Towards a Corpus of Spoken Modern Greek

The analysis of the perennial and much-disputed problem of diglossia in contemporary Greek, and the problems of teaching Modern Greek both as a first and a foreign language, would be much facilitated if there were a database of the corpus of modern spoken Greek. Such a project has been proposed since 1986, and many of the technical problems identified then have since found at least partial solutions.

The have been many fragmented projects around the world on modern written Greek, and there is now a survey under way to determine the nature and size of the extant corpus.

Track 3, 2:00: Networked Electronic Resources: New Opportunities for Humanities Scholars (Panel)

Christine Mullings (University of Bath), Chair. HUMBUL: A Successful Experiment
The Humanities Bulletin Board (HUMBUL) was established experimentally in 1986 to meet the growing need for up-to-date information about the use of computer-based techniques in the Arts and Humanities in the UK and elsewhere. There are currently over 4,000 users, and the number is growing at around 60 per month. A survey of usage revealed that primary use was made of the Diary and Conferences section, Situations Vacant, and the list of other bulletin board on JANET (the UK academic network). A recent addition is HUMGRAD, a mailing- list service for postgraduate students, who often feel isolated from other areas of their work, and the possibility of adding facilities like item expiry is being considered.
Richard Gartner (Bodleian Library) Moves Towards the Electronic Bodleian: Introducing Digital Imaging into the Bodleian Library, Oxford
Digital imaging presents new opportunities for conserving material and disseminating it to readers more efficiently tan hitherto possible. The Bodleian Library, which has an ongoin