WARNING: Release 3.0 has minor binary incompatibilities with previous
releases, mainly due to the move from the interface
it.unimi.dsi.util.LongBigList
to the now standard
it.unimi.dsi.fasutil.longs.LongBigList
.
It is part of a parallel release
of fastutil, the DSI Utilities,
Sux4J, MG4J, WebGraph, etc. that were
all modified to fit the new interface, and that prepare the way for our
"big" versions, that is, supporting >231 entries in arrays (simulated),
elements in lists, terms, documents, nodes, etc. Please read our (short)
"Moving Java to Big Data" document for details.
Introduction
Sux is an umbrella nickname for the results of my fiddling with the implementation of basic succinct data strucures.
The resulting code is rather sparse. The main highlights are:
- a novel, broadword-based implementation of rank/select queries for up to 264 bits that is highly competitive with known 32-bit implementations on 64-bit architectures (additional space required is 25% for ranking and 12.5%-37.5% for selection);
- several Java structures using the Elias–Fano representation of monotone sequences for storing pointers, variable-length bit arrays, etc.
- Java code implementing minimal perfect hashing using around 2.68 bits per element (also using some broadword ideas);
- a few Java implementations of monotone minimal perfect hashing.
Sux is free software distributed under the GNU Lesser General Public License.
Why C++?
C++ sucks. No, that's not the reason. The problem is that if you want to compare experimentally your rank/select implementations you need a language that puts you in control—every single computational aspect, including cache misses, perturbs the result.
The C++ code in Sux just uses C++ for namespace handling. No object-oriented or otherwise bizarre feature of the language is used.
Why Java?
Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:
- Improvements in JVMs makes low-level code written in Java faster and faster; often, the performance penalty w.r.t. an equivalent C/C++ application is relatively small.
- Succinct techniques can be mixed in several different ways, and an object-oriented language makes it very easy to play with different implementations of the same interface.
- An object-oriented language make it possible to write interfaces such as BitVector, which provide a wide range of accesses and methods that can be made efficient in suitable classes (e.g., LongArrayBitVector).
- When you're convinced your data structure works, you can easily rewrite it in C/C++.
Installation
You can grab Sux4J from Maven Central. Otherwise, you have to install the .jar file coming with the distribution and the dependencies, which are gathered for your convenience in a tarball.