Recent Preprints

Paolo Boldi and Sebastiano Vigna. Four degrees of separation, really. Arxiv preprint arxiv:1205.5509, 2012.

Abstract

We recently measured the average distance of users in the Facebook graph, spurring comments in the scientific community as well as in the general press (“Four Degrees of Separation”). A number of interesting criticisms have been made about the meaningfulness, methods and consequences of the experiment we performed. In this paper we want to discuss some methodological aspects that we deem important to underline in the form of answers to the questions we have read in newspapers, magazines, blogs, or heard from colleagues. We indulge in some reflections on the actual meaning of “average distance” and make a number of side observations showing that, yes, 3.74 “degrees of separation” are really few.

PDF version.
Sebastiano Vigna. Spectral ranking, 2009.

Abstract

This note tries to attempt a sketch of the history of spectral ranking—a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities. Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than fifty years ago, almost exactly in the same terms, and has been studied in psychology and social sciences. I will try to describe it in precise and modern mathematical terms, highlighting along the way the contributions given by previous scholars.

Disclaimer

This is is a work in progress with no claim of completeness. I have tried to collect evidence of spectral techniques in ranking from a number of sources, providing a unified mathematical framework that should make it possible to understand in a precise way the relationship between contributions. Reports of inaccuracies and missing references are more than welcome.

PDF version.
Sebastiano Vigna. Broadword implementation of parenthesis queries. Arxiv preprint arxiv:1301.5468, 2013.

Abstract

We continue the line of research started in “Broadword Implementation of Rank/Select Queries” proposing broadword (a.k.a. SWAR—“SIMD Within A Register”) algorithms for finding matching closed parentheses and the k-th far closed parenthesis. Our algorithms work in time O(log w) on a word of w bits, and contain no branch and no test instruction. On 64-bit (and wider) architectures, these algorithms make it possible to avoid costly tabulations, while providing a very significant speedup with respect to for-loop implementations.

PDF version; a complete C++ and Java™ implementation is available at the Sux project home page.

Journals

Paolo Boldi, Marco Rosa, and Sebastiano Vigna. Robustness of social and web graphs to node removal. Social Network Analysis and Mining, 2013.

Abstract

Given a social network, which of its nodes have a stronger impact in determining its structure? More precisely: which node-removal order has the greatest impact on the network structure? We approach this well-known problem for the first time in a setting that combines both web graphs and social networks. Our experiments are performed on datasets that are orders of magnitude larger than those appearing in the previous literature: this is possible thanks to some recently developed algorithms and software tools that approximate accurately the number of reachable pairs and the distribution of distances in large graphs. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results; at the same time, they reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.

Public datasets are available at the LAW site; Java™ implementations are available as free software at the WebGraph home page.
Paolo Boldi and Sebastiano Vigna. E = I + T: The internal extent formula for compacted tries. Inform. Process. Lett., 111:310−313, 2011.

Abstract

It is well known that in a binary tree the external path length minus the internal path length is exactly 2n − 2, where n is the number of external nodes. We show that a generalization of the formula holds for compacted tries, replacing the role of paths with the notion of extent, and the value 2n − 2 with the trie measure, an estimation of the number of bits that are necessary to describe the trie.

PDF version.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Theory and practice of monotone minimal perfect hashing. ACM Journal of Experimental Algorithmics, 16(3):132−144, 2011.

Abstract

Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable Ω(n log n) lower bound on the number of bits required to store the function. Recently, it was observed that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in our paper “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”, and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practice, and provide a balance between access speed, ease of construction, and space usage.

PDF version. The algorithms described in the paper are implemented in Sux4J.
Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Viscous democracy for social networks. Commun. ACM, 54:129−137, June 2011.
Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Query reformulation mining: models, patterns, and applications. Information Retrieval, 14:257−289, 2011.

Abstract

Understanding query reformulation patterns is a key task towards next generation web search engines. If we can do that, then we can build systems able to understand and possibly predict user intent, providing the needed assistance at the right time, and thus helping users locate information more effectively and improving their web-search experience. As a step in this direction, we build a very accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We then apply the model to automatically label two very large query logs sampled from different geographic areas, and containing a total of approximately 17 million query reformulations. We study the resulting reformulation patterns, matching some results from previous studies performed on smaller manually annotated datasets, and discovering new interesting reformulation patterns, including connections between reformulation types and topical categories. We annotate two large query-flow graphs with reformulation type information, and run several graph-characterization experiments on these graphs, extracting new insights about the relationships between the different query reformulation types. Finally we study query recommendations based on short random walks on the query-flow graphs. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without the need of users' clicks. Our experiments also show that it is important to consider transition-type labels on edges for having recommendations of good quality.

Online version.
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Permuting web and social graphs. Internet Math., 6(3):257−283, 2010.

Abstract

Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The authors of the LINK database, for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinate-free) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework, and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.

PDF version; datasets and Java™ implementations are available as free software at the LAW site.
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. PageRank: Functional dependencies. ACM Trans. Inf. Sys., 27(4):1−23, 2009.

Abstract

PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. In this paper, we give a mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and by proving that the k-th iteration of the Power Method gives exactly the value obtained by truncating the PageRank power series at degree k, we show how to obtain an approximation of the derivatives. Finally, we view PageRank as a linear operator acting on the preference vector and show a tight connection between iterated computation and derivation.

PDF version.
Paolo Boldi, Flavio Chierichetti, and Sebastiano Vigna. Pictures from mongolia. Extracting the top elements from a partially ordered set. Theory Comput. Systems, 44(2):269−288, 2009.

Abstract

You are back from that very long, marvelous journey. You have a thousand pictures, but your friends and relatives will stand just a few dozens. Choosing is a painful process, in particular when you cannot decide between the silent vastity of that desert and the idyllic picture of that tranquil, majestic lake. We are going to help.

PDF version.
Paolo Boldi, Violetta Lonati, Massimo Santini, and Sebastiano Vigna. Graph fibrations, graph isomorphism, and PageRank. RAIRO Inform. Théor., 40:227−253, 2006.

Abstract

PageRank is a ranking method that assigns scores to web pages using the limit distribution of a random walk on the web graph. A fibration of graphs is a morphism that is a local isomorphism of in-neighbourhoods, much in the same way a covering projection is a local isomorphism of neighbourhoods. We show that a deep connection relates fibrations and Markov chains with restart, a particular kind of Markov chains that include the PageRank one as a special case. This fact provides constraints on the values that PageRank can assume. Using our results, we show that a recently defined class of graphs that admit a polynomial-time isomorphism algorithm based on the computation of PageRank is really a subclass of fibration-prime graphs, which possess simple, entirely discrete polynomial-time isomorphism algorithms based on classical techniques for graph isomorphism. We discuss efficiency issues in the implementation of such algorithms for the particular case of web graphs, in which O(n) space occupancy (where n is the number of nodes) may be acceptable, but O(m) is not (where m is the number of arcs).

PDF version.
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Paradoxical effects in PageRank incremental computations. Internet Math., 2(3):387−404, 2005.

Abstract

Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.

This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall's τ.

We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.

PDF version; PostScript version; Java™ classes for computing Kendall's τ efficiently are available as free software at the LAW site.
Paolo Boldi and Sebastiano Vigna. Codes for the World−Wide Web. Internet Math., 2(4):405−427, 2005.

Abstract

We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes.

PDF version (technical report); PostScript version (technical report); to download software and data sets, please look at the WebGraph home page.
Paolo Boldi and Sebastiano Vigna. Mutable strings in Java: Design, implementation and lightweight text-search algorithms. Sci. Comput. Programming, 54(1):3−23, 2005.

Abstract

The Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference-based and mutable, content-based). Analogously, available text-search methods on string classes are implemented either as trivial, brute-force double loops, or as very sophisticated and resource-consuming regular-expression search methods. Motivated by our experience in data-intensive text applications, we propose a new string class, MutableString, which tries to get the right balance between extremes in both cases. Mutable strings can be in one of two states, compact and loose, in which they behave more like String and StringBuffer, respectively. Moreover, they support a wide range sophisticated text-search algorithms with a very low resource usage and setup time, using a new, very simple randomised data structure (a generalisation of Bloom filters) that stores an approximation from above of a lattice-valued function. Computing the function value requires a constant number of steps, and the error probability can be balanced with space usage. As a result, we obtain practical implementations of Boyer-Moore type algorithms that can be used with very large alphabets, such as Unicode collation elements. The techniques we develop are very general and amenable to a wide range of applications.

PDF version; PostScript version; to download MutableString and TextPattern please look at the MG4J project home page.
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711−726, 2004.

Abstract

We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.

PDF version; PostScript version; a Java™ class for computing consistent hashing is available as free software at the LAW site.
Paolo Boldi and Sebastiano Vigna. Lower bounds for sense of direction in regular graphs. Distr. Comput., 16(4):279−286, 2003.

Abstract

A graph G with n vertices and maximum degree Δ_G cannot be given weak sense of direction using less than Δ_G colours. It is known that n colours are always sufficient, and it was conjectured that just Δ_G+1 are really needed, that is, one more colour is sufficient. Nonetheless, it has just been shown that for sufficiently large n there are graphs requiring ω(n/log n) more colours than Δ_G. In this paper, using recent results in asymptotic graph enumeration, we show not only that (somehow surprisingly) the same bound holds for regular graphs, but also that it can be improved to Ω(n log log n/log n) We also show that Ω(d_G(log log d_G)^1/2 colours are necessary, where d_G is the degree of G.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Lower bounds for weak sense of direction. J. Discrete Algorithms, 1:119−128, 2003.

Abstract

A graph with n and maximum degree Δ cannot be given weak sense of direction using less than Δ colours. It is known that n colours are always sufficient, but it has been conjectured that just Δ+1 are really needed. On the contrary, we show that for sufficiently large n there are graphs requiring Δ+ω(n/log n) colours. We also give simple examples of small graphs requiring Δ+2 colours, which have been verified mechanically.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Universal dynamic synchronous self-stabilization. Distr. Comput., 15(3):137−153, 2002.

Abstract

We prove the existence of a "universal" synchronous self-stabilizing protocol, that is, a protocol that allows a distributed system to stabilize to a desired nonreactive behaviour (as long as a protocol stabilizing to that behaviour exists). Previous proposals required drastic increases in asymmetry and knowledge to work, whereas our protocol does not use any additional knowledge, and does not require more symmetry-breaking conditions than available; thus, it is also stabilizing with respect to dynamic changes in the topology. We prove an optimal quiescence time n+D for a synchronous network of n processors and diameter D; the protocol can be made finite state with a negligible loss in quiescence time. Moreover, an optimal D+1 protocol is given for the case of unique identifiers. As a consequence, we provide an effective proof technique that allows to show whether self-stabilization to a certain behaviour is possible under a wide range of models.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Complexity of deciding sense of direction. SIAM J. Comput., 29(3):779−789, 2000.

Abstract

In this paper we prove that deciding whether a distributed system (represented as a coloured digraph with n nodes) has weak sense of direction is in AC¹ (using n⁶ processors). Moreover, we show that deciding sense of direction is in P. Our algorithms can also be used to decide in AC¹ whether a coloured graph is a Cayley colour graph.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Fibrations of graphs. Discrete Math., 243:21−66, 2002.

Abstract

A fibration of graphs is a morphism that is a local isomorphism of in-neighbourhoods, much in the same way a covering projection is a local isomorphism of neighbourhoods. This paper develops systematically the theory of graph fibrations, emphasizing in particular those results that recently found application in the theory of distributed systems.

PDF version; PostScript version. See also the Graph-fibrations home page.
Paolo Boldi and Sebastiano Vigna. δ-uniform BSS machines. J. Complexity, 14(2):234−256, 1998.

Abstract

A δ-uniform BSS machine is a standard BSS machine which does not rely on exact equality tests. We prove that, for any real closed archimedean field R, a set is δ-uniformly semi-decidable iff it is open and semi-decidable by a BSS machine which is locally time bounded; we also prove that the local time bound is nontrivial. This entails a number of results about BSS machines, in particular the existence of decidable sets whose interior (closure) is not even semi-decidable without adding constants. Finally, we show that the sets semi-decidable by Turing machines are the sets semi-decidable by δ-uniform machines with coefficients in Q or T, the field of Turing computable numbers.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Equality is a jump. Theoretical Computer Science, 219(1−2):49−64, 1999.

Abstract

We define a notion of degree of unsolvability for subsets of Rⁿ (where R is a real closed Archimedean field) and prove that, in contrast to Type 2 computability, the presence of exact equality in the BSS model forces exactly one jump of the unsolvability degree of decidable sets.

PDF version; PostScript version.
Sebastiano Vigna. On the relations between distributive computability and the BSS model. Theoretical Computer Science, 162:5−21, 1996.

Abstract

This paper presents an equivalence result between computability in the BSS model and in a suitable distributive category. It is proved that the class of functions R^m→Rⁿ (with n,m finite and R a commutative, ordered ring) computable in the BSS model, and the functions distributively computable over a natural distributive graph based on the operations of R coincide. Using this result, a new structural characterization, based on iteration, of the same functions is given.
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Measuring with jugs. Theoretical Computer Science, 282(2):259−270, 2002.

Abstract

We study the jug problem in its most general form: given a set of jugs of fixed capacities, find out which quantities are measurable, and provide upper and lower bounds on the number of steps necessary for measurements.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. The Turing closure of an Archimedean field. Theoretical Computer Science, 231:143−156, 2000.

Abstract

A BSS machine is δ-uniform if it does not use exact tests; such machines are equivalent (modulo parameters) to Type 2 Turing machines. We define a notion of closure related to Turing machines for archimedean fields, and show that such fields admit nontrivial δ-uniformly decidable sets iff they are not Turing closed. Then, the partially ordered set of Turing closed fields is proved isomorphic to the ideal completion of unsolvability degrees.

PDF version; PostScript version.

Warning: This paper has been plagiarised.
Paolo Boldi and Sebastiano Vigna. Minimal sense of direction and decision problems for Cayley graphs. Inform. Process. Lett., 64(6):299−303, 1997.

Abstract

Sense of direction is a property of the labelling of (possibly anonymous) networks which allows to assign coherently local identifiers to other processors on the basis of the route followed by incoming messages. A graph has minimal sense of direction whenever it has sense of direction and the number of colours equals its maximum outdegree. We prove that an outregular digraph with minimal weak sense of direction is a Cayley colour graph (in the general sense, i.e., we do not require connectedness). Since Cayley colour graphs are known to possess minimal transitive sense of direction, we obtain a characterization of outregular graphs with minimal (weak,transitive) sense of direction. As a consequence, deciding whether a coloured graph is a Cayley colour graph reduces to deciding whether it has weak sense of direction, which can be done in AC¹.

PDF version; PostScript version.
Bruno Codenotti, Ivan Gerace, and Sebastiano Vigna. Hardness results and spectral techniques for combinatorial problems on circulant graphs. Linear Algebra Appl., 285(1−3):123−142, 1998.

Abstract

We show that computing (and even approximating) maximum clique and minimum graph coloring for circulant graphs is essentially as hard as in the general case. In contrast, we show that, under additional constraints, e.g., prime order and/or spareness, graph isomorphism and minimum graph coloring become easier in the circulant case, and we take advantage of spectral techniques for their efficient computation.

PDF version; PostScript version.
Paolo Boldi and Sebastiano Vigna. Coverings that preserve sense of direction. Inform. Process. Lett., 75:175−180, 2000.

Abstract

Sense of direction is a property of labelled networks (i.e., arc-coloured graphs) that allows one to assign coherently local identifiers to other processors on the basis of the route followed by incoming messages. We prove that (weak) sense of direction is preserved by the construction of regular coverings (i.e., coverings induced by voltage assignments in a group) whose voltage assignment depends only on colours. Moreover, this construction preserves minimality.

PDF version; PostScript version.
Stefano Kasangian and Sebastiano Vigna. The topos of labelled trees: A categorical semantics for SCCS. Fund. Inform., 32:27−45, 1997.
Nicoletta Sabadini, Sebastiano Vigna, and Robert F.C. Walters. A note on recursive functions. Math. Struct. Comp. Sci., 6:127−139, 1996.

Conference Proceedings

Sebastiano Vigna. Quasi-succinct indices. In Stefano Leonardi, Alessandro Panconesi, Paolo Ferragina, and Aristides Gionis, editors, Proceedings of the 6th ACM International Conference on Web Search and Data Mining, WSDM'13, pages 83−92. ACM, 2013.

Abstract

Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.

PDF version. Quasi-succinct indices are now the default indices in MG4J. You can download the additional code used to perform the benchmarks described in the paper.
Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna. Four degrees of separation. In ACM Web Science 2012: Conference Proceedings, pages 45−54. ACM Press, 2012. Best paper award.

Abstract

Frigyes Karinthy, in his 1929 short story “Láancszemek” (“Chains”) suggested that any two persons are distanced by at most six friendship links. (The exact wording of the story is slightly ambiguous: “He bet us that, using no more than five individuals, one of whom is a personal acquaintance, he could contact the selected individual […]”. It is not completely clear whether the selected individual is part of the five, so this could actually allude to distance five or six in the language of graph theory, but the “six degrees of separation” phrase stuck after John Guare's 1990 eponymous play. Following Milgram's definition and Guare's interpretation, we will assume that “degrees of separation” is the same as “distance minus one”, where “distance” is the usual path length—the number of arcs in the path.) Stanley Milgram in his famous experiment challenged people to route postcards to a fixed recipient by passing them only through direct acquaintances. The average number of intermediaries on the path of the postcards lay between 4.4 and 5.7, depending on the sample of people chosen.

We report the results of the first world-scale social-network graph-distance computations, using the entire Facebook network of active users (≈721 million users, ≈69 billion friendship links). The average distance we observe is 4.74, corresponding to 3.74 intermediaries or “degrees of separation”, showing that the world is even smaller than we expected, and prompting the title of this paper. More generally, we study the distance distribution of Facebook and of some interesting geographic subgraphs, looking also at their evolution over time.

The networks we are able to explore are almost two orders of magnitude larger than those analysed in the previous literature. We report detailed statistical metadata showing that our measurements (which rely on probabilistic algorithms) are very accurate.

PDF version; Java™ implementations are available as free software at the WebGraph home page; we distribute the HyperANF runs, current degree distributions and property files of the graphs discussed in the paper. The graphs are also described on the LAW dataset page.
Roi Blanco, Peter Mika, and Sebastiano Vigna. Effective and efficient entity search in RDF data. In Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor, Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist, editors, The Semantic Web — ISWC 2011. 10th International Semantic Web Conference, Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, pages 83−97. Springer, 2011.

Abstract
Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors com- bined have led to an increased interest in semantic search, i.e., access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. Robustness of social networks: Comparative results based on distance distributions. In Social Informatics, Third International Conference, SocInfo 2011, volume 6894 of Lecture Notes in Computer Science, pages 8−21. Springer, 2011.

Abstract

Given a social network, which of its nodes have a stronger impact in determining its structure? More formally: which node-removal order has the greatest impact on the network structure? We approach this well-known problem for the first time in a setting that combines both web graphs and social networks, using datasets that are orders of magnitude larger than those appearing in the previous literature, thanks to some recently developed algorithms and software tools that make it possible to approximate accurately the number of reachable pairs and the distribution of distances in a graph. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results, and at the same time reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.

PDF version; public datasets are available at the LAW site; Java™ implementations are available as free software at the WebGraph home page.
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 625−634. ACM, 2011.

Abstract

The neighbourhood function N_G(t) of a graph G gives, for each t, the number of pairs of nodes <x, y> such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm (approximate neighbourhood function) has been proposed with the purpose of approximating N_G(t) on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters and combines them efficiently through broadword programming; our implementation uses decomposition to exploit multi-core parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation. Then, we turn to the study of the distribution of the distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clear-cut characterisation of proper social networks vs. web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new non-local structural index for complex networks whose computation is highly scalable.

PDF version; Java™ implementations are available as free software at the WebGraph home page; public datasets are available at the LAW site.
Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587−596. ACM, 2011.

Abstract

We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses decomposition to perform aggressively on multi-core architecture, making it possible

﻿Recent Preprints

﻿Journals

﻿Conference Proceedings

Recent Preprints

Journals

Conference Proceedings