Course: Applications of Text Processing

Dr. Michael Oakes

Overview

Natural Language Processing while conventional information retrieval is restricted to the "bag of words" model, natural language is much richer than that. We will introduce the concept of linguistic levels above lexis (individual words) such as parts of speech, syntax and parsers, semantics, and pragmatics. In the light of this, we will discuss why the bag of words model still works as well as it does, and the ways in which techniques from information retrieval can be used in linguistic processing.
Clustering and Classification. Clustering comprises a family of algorithms which can automatically assign entities (such as documents) to categories, which may be newly discovered in the process. Classification is the process of assigning entities to their correct categories. We will look both at the theory of different clustering algorithms and get some hands-on experience with the statistical programming language. Texts can be clustered by topic, genre or writing style.

Objectives

The aim is to cover two important applications of text processing other than core Information Retrieval, to enable participants to select which language processing techniques might be useful for their own work in the area.

Structure

The half-day tutorial will be structured as follows:

Levels of language and ambiguity in language (lecture)
Stemming rules and part of speech taggers (pen and paper practical)
Semantics and discourse level phenomena (lecture)
Break
Text classification feature selection and learning methods (lecture)
Clustering (pen and paper practical)
Evaluation of text classifiers.

Instructor

My Ph.D was in Information Retrieval (search engine technology). My previous research assistant posts were in automatic sentence alignment of English, French and Spanish telecommunications texts, automatic summarisation of journal articles about agriculture and automatic classification of news feeds about the pharmaceuticals industry. While at Sunderland I have supervised seven Ph.D. students who have now completed theses in Information Retrieval, most recently Naveed Anwar, who worked on the data mining of audiology patient records, and Nandita Tripathi, who worked on the automatic classification of news articles and web services.
My own research has been in corpus linguistics, e.g. discovering differences between the types of English used throughout the world. I recently wrote an article on disputed authorship, plagiarism software and spam filters for the Oxford Handbook of Computational Linguistics, and edited a book "Quantitative Methods in Translation Studies" with Meng Ji at the University of Tokyo. I recently completed the EU-funded VITALAS project on a multi-media search engine. I am a committee member for the Information Retrieval Specialist Group of the British Computer Society, and a reviewer for the European Conference on Information Retrieval (ECIR).
This year and last year I have taught courses at the University on search engine technology, forensic linguistics, medical statistics and decision support systems. I have also given lectures on medical statistics externally to Information Analysts at the Teesside NHS Trust in Middlesbrough, and to trainee psychiatrists at Roseberry Park Hospital in Middlesbrough.

<<< back to Tutorial Programme

Related searches:
retrieval language search automatic recently

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.