MapReduce Algorithms in Information Retrieval

spacer

A Tutorial at Search Solutions 2012 --> by Dell Zhang.


Aims and Learning Objectives

Working with very big datasets (in terabytes/petabytes or even more) that are beyond the capacity of a single PC is often not a luxury, but a necessity, for the search industry today.

MapReduce is a programming model that facilitates writing programs to be transparently distributed over a cluster of commodity computer servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. It has since enjoyed widespread adoption via an open-source implementation in Java --- Hadoop --- which has become the de facto standard and the dominant platform for back-end cloud computing, with prominent users like Yahoo, Facebook, Twitter, eBay, Amazon, etc.

This tutorial aims to introduce MapReduce --- currently the most accessible and practical means for tackling "Web-scale" problems --- to IR practitioners who are not familiar with this technology and its great potential in IR.

The learning objectives are to

Description of Topics

Scope and Relevance

This tutorial will cover fundamental MapReduce concepts as well as concrete MapReduce algorithms particularly in the field of IR. The emphasis will be put on the scalability issues and the design trade-offs associated with large-scale text data processing using MapReduce.

Format

Half Day (3 Hours)

Target Audience

This is an introductory tutorial targeted at IR practitioners who are interested in large-scale text data processing (in the cloud). The prerequisites include the ability of programming and some basic knowledge of IR (e.g., inverted index), but no background in parallel or distributed computing will be assumed.

Instructor Bio

Dr. Dell Zhang is a Senior Lecturer in Computer Science at Birkbeck, University of London, a Senior Member of ACM, a Senior Member of IEEE, and a Fellow of RSS. Before he moved to the UK, he was a Research Fellow at the Singapore-MIT Alliance. His research is on the theme of improving information retrieval and organisation through machine learning or data mining. He has a number of publications in such areas, and serves regularly as a reviewer, editorial board member, or programme committee member for relevant international journals and conferences. He has received a couple of best paper awards, and co-organised workshops in CIKM and RecSys etc. He is teaching postgraduate courses on information retrieval and also on cloud computing.


Stay tuned for more details.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.