Large scale distributed syntactic, semantic and lexical language models

This research proposes to build large scale distributed language models (LSDLMs) using a rigorous approach that simultaneously accounts for word lexical information, sentence syntactic structure, and document semantic content to substantially improve the performance of large-scale machine translation (MT) and automatic speech recognition (ASR) systems for both high and low density languages.

It has been a long-standing challenge in statistical language modeling to develop a unified framework to integrate various language model (LM) components to form a more sophisticated model that is tractable, scalable, and performs well empirically. The proposed research will be conducted under the directed MRF paradigm to sequentially embed more advanced semantic topic components and hierarchical Pitman-Yor processes to form complex distributions for natural language. It starts with simple composite LMs developed in the pilot project and iteratively adds more complicated components, so eventually a family of composite LMs with increasing expressive power are produced. By exploiting the particular structure of each composite LM, the seemingly complex statistical representations will be decomposed into simpler ones enabling the estimation and inference algorithms for the simpler composite LMs to become internal building blocks for the estimation of complex composite LMs. Thus finally solving the estimation problem for extremely complex, high-dimensional distributions. During this process, a long standing open problem, smoothing fractional counts due to latent variables in Kneser-Ney's sense in a principled manner, might be solved. We demonstrate how to integrate this family of complex LSDLMs into one-pass decoders of the state-of-the-art phrase- and parsing-based MT systems, respectively, and lattice rescoring decoder in ASR system.

Developing complex LMs for MT and ASR systems that are trained by corpora with up to web-scale data is an unprecedented work, which is an ideal fit to the NSF's strategic long term vision of a Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21). The directed MRF paradigm to integrate various kinds of known LMs is potentially transformative as it provides a novel way to break the limitations of existing LM combination methods and deploy algorithmically interesting methodologies that are scalable to web-scale data sets. Thus it is highly likely to make a significant impact on improving the performance of MT and ASR systems for both high and low density languages. If successful, the research results can be used in real MT and ASR systems, such as Google translator and Google voice search. The techniques developed in this project will not only lead to effective, robust, and intelligent language technology applications but also might be extended and applied to solve problems in computational biology and computer vision. The project will provide an excellent environment for interdisciplinary education in information technology that bridges areas of language and speech processing, machine learning, and data-intensive science and engineering to benefit students of all levels. The proposed research will provide not only important topics for graduate students but also a wide range of senior project topics for undergraduate students and train students to address computational challenges in a big data era.

Keywords: lexical information; syntactic structure; semantic content; directed Markov random field; distributed algorithms and cloud computing; machine translation and speech recognition.

PI: Shaojun Wang

Funding: This research is funded by the National Science Foundation under awards IIS-0812483 and IIS-1218863 as well as Google research awards.

Publications:

M. Tan, W. Zhou, L. Zheng and S. Wang, ``A scalable distributed syntactic, semantic and lexical language model,'' Computational Linguistics, Vol. 38, No. 3, pp. 631-671, 2012. [pdf]
M. Tan, W. Zhou, L. Zheng and S. Wang, ``A large scale distributed syntactic, semantic and lexical language model for machine translation,'' The 49th Annual Meeting of the Association for Computational Linguistics and Human Language Technologies (ACL/HLT), 201-210, 2011. [pdf]

Talks:

A scalable distributed syntactic, semantic and lexical language model, Presented at Google, Microsoft and CLSP at JHU, 2010-2011. [pdf]

Navigation

You are here

Large scale distributed syntactic, semantic and lexical language models