Big news
Release 5.0 has several source and binary incompatibilities, and introduces quasi-succinct indices. Benchmarks on the performance of quasi-succinct indices can be found here; for instance, this table shows the number of seconds to answer 1000 multi-term queries on a document collection of 130 million web pages:
MG4J | MG4J* | Lucene | |
---|---|---|---|
Terms | 70.9 | 132.1 | 130.6 |
And | 27.5 | 36.7 | 108.8 |
Phrase | 78.2 | — | 127.2 |
Proximity | 106.5 | — | 347.6 |
Both engines were set to just enumerate the results without scoring. The column labelled MG4J* gives the timings of an artificially modified version in which counts for each retrieved document have been read (MG4J now stores document pointers and counts in separate files, but Lucene interleaves them, so it has to read counts compulsorily). Proximity queries are conjunctive queries that must be satisfied within a window of 16 words. The row labelled “Terms” gives the timings for enumerating the posting lists of all terms appearing in the queries.
For what matters index size, the following table compares the new quasi-succinct indices, the previous high-performance indices using default γ/δ compression and Lucene indices on a number of different collections:
MG4J (new) | MG4J (γ/δ) | Lucene |
---|---|---|
TREC GOV2 (text, 25 M documents) | ||
36.9 GB | 40.3 GB | 42.1 GB |
TREC GOV2 (title, 25 M documents) | ||
264 MB | 308 MB | 396 MB |
Web .uk (text, 130 M documents) | ||
108 GB | 117 GB | 126 GB |
Web .uk (title, 130 M documents) | ||
1.38 GB | 1.59 GB | 2.15 GB |
Mímir token index (1 M documents) | ||
0.96 GB | 1.01 GB | 1.34 GB |
Tweets (13 M documents) | ||
302 MB | 341 MB | 423 MB |
Call for collaboration
The new quasi-succinct indices are very fast, but suggestions on low-level Java optimizations are welcome. In particular, the C++ version would benefit from a review by people acquainted with optimization for superscalar processors.
Introduction
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. MG4J is a highly customisable, high-performance, full-fledged search engine providing state-of-the-art features (such as BM25/BM25F scoring) and new research algorithms.
The main points of MG4J are:
- Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
- Efficiency.MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents. Its new quasi-succinct indices provide unprecedented performance.
- Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
- Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
- Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
- Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It's up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
- Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
- Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
- Multithreading. Indices can be queried and scored concurrently.
- Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.
The starting point for understanding MG4J is a look at the tutorial, which explains how to index a sample collection and query the newly constructed index from the command line or using a browser. Then, the Javadoc class documentation can provide more insights.
MG4J is free software distributed under the GNU Lesser General Public License. If you find MG4J useful, we kindly ask you to quote the following reference:
@INPROCEEDINGS{BoVTREC2005, title = "{M}{G}4{J} at {T}{R}{E}{C} 2005", author="Paolo Boldi and Sebastiano Vigna", year = 2005, booktitle = "The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings", editor = "Ellen M. Voorhees and Lori P. Buckland", publisher = "NIST", series = "Special Publications", number = "SP 500-266", note = "\texttt{\small mg4j.di.unimi.it/}", }
Installation
You can grab MG4J from Maven Central. Otherwise, you just have to install the .jar file coming with the distribution and the dependencies, which are gathered for your convenience in a tarball.
Citations
Here you can find (in no particular order) research papers that have been written using MG4J. The list is not exhaustive, and we will be happy to include works that are missing.
- Kerstin Denecke, Peter Dolog, and Pavel Smrz. Making use of social media data in public health. In Proceedings of the 21st international conference companion on World Wide Web, pages 243−246, New York, NY, USA, 2012. ACM.
- Soumen Chakrabarti, Sasidhar Kasturi, Bharath Balakrishnan, Ganesh Ramakrishnan, and Rohit Saraf. Compressed data structures for annotated web search. In Alain Mille, Fabien L. Gandon, Jacques Misselis, Michael Rabinovich, and Steffen Staab, editors, Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 121−130. ACM, 2012.
- Pavel Smrz and Lubomir Otrusina. Finding indicators of epidemiological events by analysing messages from twitter and other social networks. In Proceedings of the second international workshop on Web science and information exchange in the medical web, pages 7−10. ACM, 2011.
- Soumen Chakrabarti, Devshree Sane, and Ganesh Ramakrishnan. Web-scale entity-relation search architecture. In Proceedings of the 20th International Conference Companion on World-Wide Web, pages 21−22, New York, NY, USA, 2011. ACM.
- Hamish Cunningham, Valentin Tablan, Ian Roberts, Mark A. Greenwood, and Niraj Aswani. Information extraction and semantic annotation for multi-paradigm information management. In Mihai Lupu, Katja Mayer, John Tait, Anthony J. Trippe, and W. Bruce Croft, editors, Current Challenges in Patent Information Retrieval, volume 29 of The Information Retrieval Series, pages 307−327. Springer Berlin Heidelberg, 2011.
- Frank Hopfgartner and Joemon Jose. Semantic user profiling techniques for personalised multimedia recommendation. Multimedia Systems, 16:255−274, 2010.
- Jeffrey Pound, Peter Mika, and Hugo Zaragoza. Ad-hoc object retrieval in the web of data. In Proceedings of the 19th international conference on World wide web, pages 771−780. ACM, 2010.
- Erik Graf, Ingo Frommholz, Mounia Lalmas, and Keith van Rijsbergen. Knowledge modeling in prior art search. In Advances in Multidisciplinary Retrieval, volume 6107 of Lecture Notes in Computer Science, pages 31−46. Springer, 2010.
- Eneko Agirre, Olatz Ansa, Xabier Arregi, Maddalen Lopez de Lacalle, Arantxa Otegi, Xabier Saralegi, and Hugo Zaragoza. Using semantic relatedness and word sense disambiguation for (CL)IR. In 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009. Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 166−173. Springer, 2010.
- Xabier Arregi Maddalen Lopez de Lacalle Arantxa Otegi Xabier Saralegi Hugo Zaragoza Eneko Agirre, Olatz Ansa. Elhuyar-IXA: Semantic relatedness and cross-lingual passage retrieval. In 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009. Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 273−280. Springer, 2010.
- Nuno Cardoso, Patrícia Sousa, and Mário J. Silva. Experiments with geographic evidence extracted from documents. In Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, volume 5706 of Lecture Notes in Computer Science, pages 885−893. Springer, 2009.
- Fabien Campagne. Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols. BMC bioinformatics, 9(1):132, 2008.
- Nuno Cardoso, Patrícia Sousa, and Mário J. Silva. The University of Lisbon at GeoCLEF 2008. In Working Notes for CLEF 2008, pages 17−19, 2008.
- Diana Santos, Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling, and Yvonne Skalban. GikiP at GeoCLEF 2008: Joining GIR and QA forces for querying Wikipedia. In Evaluating Systems for Multilingual and Multimodal Information Access, volume 5706 of Lecture Notes in Computer Science, pages 894−905. Springer, 2009.
- Nuno Cardoso, David Cruz, Marcirio Chaves, and Mário Silva. Using geographic signatures as query and document scopes in geographic IR. In Advances in Multilingual and Multimodal Information Retrieval, volume 5152 of Lecture Notes in Computer Science, pages 802−810. Springer, 2008.
- Nuno Cardoso, David Cruz, Marcirio Chaves, and Mário J. Silva. The University of Lisbon at GeoCLEF 2007. In Working Notes for CLEF 2007, 2007.
- Kevin C. Dorff, Matthew J. Wood, and Fabien Campagne. Twease at TREC 2006: Breaking and fixing BM25 scoring with query expansion, a biologically inspired double mutant recovery experiment. In Proceedings of the Text Retrieval Conference, 2006.
- Lei Shi and Fabien Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC bioinformatics, 6(1):88, 2005.
- Peter Mika. Distributed indexing for semantic search. In Proceedings of the 3rd International Semantic Search Workshop, pages 1−4. ACM, 2010.
- Nuno Cardoso, Mário J. Silva, and Diana Santos. Handling implicit geographic evidence for geographic IR. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 1383−1384. ACM, 2008.
- Roi Blanco and Alvaro Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conference on Information Retrieval Research ECIR2005, number 3408 in Lecture Notes in Computer Science, pages 375−387, 2005.
- Roi Blanco and Alvaro Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Information Retrieval, 9(4):499−517, 2006.
- Roi Blanco and Alvaro Barreiro. A software architecture for effective document identifier reassignment. In Roberto Moreno Díaz, Franz Pichler, and Alexis Quesada Arencibia, editors, Computer Aided Systems Theory - EUROCAST 2005, 10th International Conference on Computer Aided Systems Theory, Las Palmas de Gran Canaria, Spain, February 7-11, 2005, Revised Selected Papers, volume 3643 of Lecture Notes in Computer Science, pages 254−262. Springer, 2005.
- Minsuk Lee, Weiqing Wang, and Hong Yu. Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinformatics, 7(140), 2006.