Big Learning Main/Papers

Papers

The accepted papers are organized as follows

Day 1 - December 16th, 2011
- Contributed Talks
- Spotlights
Day 2 - December 17th, 2011
- Contributed Talks
- Spotlights

Day 1 - December 16th, 2011

Day 1 Contributed Talks

Frdric Bastien, Arnaud Bergeron, Pascal Vincent, Yoshua Bengio and Andreas Klckner.
A Common GPU n-Dimensional Array for Python and C
Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work.This paper proposes and presents a first version of a common GPU n-dimensional array(tensor) named GpuNdArray~\citep{GpuNdArray} that works with both CUDA and OpenCL.It will be usable from python, C and possibly other languages.
PDF, Slides

Mihai Budiu, Jamie Shotton, Derek Murray and Mark Finocchio.
Parallelizing the Training of the Kinect Body Parts Labeling Algorithm
We present the parallelized implementation of decision forest training as used in Kinect to train the body parts classification system. We describe the practical details of dealing with large training sets and deep trees, and describe how to parallelize over multiple dimensions of the problem.
PDF, Slides

Tammo Krueger, Danny Panknin and Mikio Braun.
Fast Cross-Validation via Sequential Analysis
With the increasing size of today's data sets, finding the right parameter configuration via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible the method speeds up the computation while preserving the capability of the full cross-validation. The experimental evaluation shows that our method reduces the computation time by a factor of up to 70 compared to a full cross-validation with a negligible impact on the accuracy.
PDF, Appendix, Slides

Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael Jordan.
Bootstrapping Big Data
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB.
PDF, Long Version, Slides

Day 1 Spotlights

James Bergstra, Frederic Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron and Yoshua Bengio.
Theano: Deep Learning on GPUs with Python
In this paper, we present Theano, a framework in the Python programming language for defining, optimizing and evaluating expressions involving high-level operations on tensors. Theano offers most of NumPy's functionality, but adds automatic symbolic differentiation, GPU support, and faster expression evaluation. Theano is a general mathematical tool, but it was developed with the goal of facilitating research in deep learning. The Deep Learning Tutorials introduce recent advances in deep learning, and showcase how Theano makes such algorithms compact, elegant, and fast.
PDF

Ronan Collobert, Koray Kavukcuoglu and Clement Farabet.
Torch7: A Matlab-like Environment for Machine Learning
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be interfaced to third-party software thanks to Luas light interface.
PDF,Long Version,Spotlight

Nico Piatkowski.
Parallel Algorithms for GPU accelerated Probabilistic Inference
Real world data is likely to contain an inherent structure. Those structures may be represented with graphs which encode independence assumptions within the data.Performing inference in those models is nearly intractable on mobile devices or casual workstations. This work introduces and compares two approaches for accelerating the inference in graphical models by using GPUs as parallel processing units. It is empirically showed, that in order to achieve a scaleable parallel algorithm, one has to distribute the workload equally among all processing units of a GPU. We accomplished this by introducing Thread-Cooperative message computations.
PDF

Amit Goyal, Piyush Rai and Hal Daum Iii.
Multiple Hash Functions for Learning
In this paper, we explore the idea of feature-hashing in learning problems. We first evaluate some hashing strategies on the basis of theirefficacy on classification problems. We then explore the following trade-off:Given a fixed budget (say $K$) for the hashed feature vector, should one use a \textit{single} hash function that gives a hashed vector of size $K$, or use \textit{multiple} hash functions to come up with smaller representations (say $3$ hashfunctions, each giving a representation of size $K/3$)? In particular, for the latter setting, how should the different hashed representations be combined?We propose online learning algorithms for this setting using multiple Perceptrons (one for each hashed representation), and explore a number of Perceptron update and prediction schemes. Experimental results demonstrate that ourupdate schemes give better classification accuracies than the case whena single hashed feature vector is used to train the model.
PDF,Spotlight

Ping Li, Anshumali Shrivastava, Joshua Moore and Christian Konig.
b-Bit Minwise Hashing for Large-Scale Learning
Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of $b$-bit minwise hashing provides a substantial improvement by storing only the lowest $b$ bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.
PDF,Spotlight

Zhenwen Dai, Jacquelyn A. Shelton, Jrg Bornschein, Abdul Saboor Sheikh and Jrg Lcke.
Combining approximate inference methods for efficient learning on large computer clusters
Based on the previously developed framework of parallel EM learning (Bornschein, Dai, and Luecke, 2010), we extend it to different models with corresponding parallelization techniques. To further tackle problems of computational complexity and to utilize the capability of the parallel computing hardware (CPU/GPU clusters), we developed a set of techniques which can be catered to specific large-scale learning problems. We design a dynamic data repartition technique for "Gaussian sparse coding", use specialized GPU kernels for translation invariance learning, and show how sampling can be used to further scale the learning on very high dimensional data. We propose these as examples of a parallelization toolbox which can be creatively combined and exploited in model-task driven ways.
PDF,Spotlight

Zhuhua Cai, Zografoula Vagena, Christopher Jermaine and Peter J. Haas.
Very Large Scale Bayesian Inference Using MCDB
This extended abstract describes how the Monte Carlo database system (MCDB) can be used to easily implement Bayesian inference via Markov chain Monte Carlo (MCMC) over very large data sets. To implement an MCMC simulation in MCDB, a programmer specifies dependencies among variables and how they parameterize one another using the SQL language. A user can write MCDB SQL without worrying about how the code will be parallelized across many machines; MCDB automatically takes care of the optimization and parallelization.
PDF

Harris Lin, Neeraj Koul and Vasant Honavar.
Learning Updatable Classifiers from Remote Data
The increasing availability of large data offers an exciting opportunity to use such data to build predictive models using machine learning algorithms. However, most approaches to learning assume direct access to data, and can not efficiently cope with frequent updates to the data. In this paper we show that learning using statistical queries provides a powerful paradigm to address these challenges. We summarize our work and present INDUS, an open source implementation of learning algorithms based on the proposed statistical query paradigm.
PDF,Spotlight

Simon Wilson and Jiwon Yoon.
Large-scale Bayesian ICA with application to separation of the cosmic microwave background
A functional approximation to implement Bayesian source separation analysis is introduced and applied to separation of the Cosmic Microwave Background (CMB) using WMAP data. The approximation allows for tractable full-sky map reconstructions at the scale of both WMAP and Planck data and models the spatial smoothness of sources through a Gaussian Markov random field prior. It is orders of magnitude faster than the usual MCMC approaches. The performance and limitations of the approximation are also discussed.
PDF

Haimonti Dutta.
A Randomized Gossip-based Algorithm for Classification on Peer-to-Peer Networks
Peer-to-Peer (P2P) networks are distributed systems in which nodes of equal roles and capabilities exchange information and services directly with one another. In recent years, they have become a popular way to share large amounts of data. Such architectures, however, complicate the process of knowledge discovery and data mining since algorithms must deal with distributed (and often) dynamic sources of data and computing. In this paper, we present a distributed algorithm for learning linear classifiers in P2P networks. The problem is posed as a linear program such that each peer has its own constraints, but needs to solve a global objective function. A randomized-gossip based approximate algorithm is presented which reduces communication cost in the network significantly while ensuring convergence of the algorithm.
PDF

Evgeny Sitnikov, Achim Rettinger and Ole Mengshoel.
Distributed Unsupervised Semantic Parsing of Large-Scale Document Corpora
Large-scale text corpora constitute a great opportunity and challenge for natural language processing including machine learning. One state-of-the-art approach, Unsupervised Semantic Parsing technique (USP), clusters synonymic relations on a sentence level and has been shown to answer meaningful questions by exploiting these semantic clusters. In this paper, we propose Distributed USP (DUSP), which improves USP's ability to handle large text corpora by distributing several of USP's key algorithmic steps over a cluster of commodity computers. In experiments with DUSP, we processed a corpus that was over 13 times larger than the largest corpus we were able to handle using USP. In addition, DUSP's processing speed was 284 documents per minute, versus 69.2 for USP.
PDF,Spotlight

Marina Sokol, Konstantin Avrachenkov, Arnaud Legout and Paulo Goncalves.
Graph Based Classification of Content and Users in BitTorrent
P2P downloads still represent a large portion of today's Internet traffic. More than 100 million users operate BitTorrent and generate more than 30% of the total Internet traffic. According to the Wikipedia article about BitTorrent, the traffic generated by BitTorrent is greater than the traffic generated by Netflix and Hulu combined. Recently, a significant research effort has been done to develop tools for automatic classification of Internet traffic by application. The purpose of the present work is to provide a framework for subclassification of P2P traffic generated by the BitTorrent protocol. The general intuition is that the users with similar interests download similar contents. This intuition can be rigorously formalized with the help of graph based semi-supervised learning approach. In particular, we have chosen to work with PageRank based semi-supervised learning method, which scales well with very large volumes of data.
PDF

Day 2 - December 17th, 2011

Day 2 Contributed Talks

Matthew Hoffman and Andrew Gelman.
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlations that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than popular methods such as random walk Metropolis or Gibbs sampling. However, HMC's performance is highly sensitive to two user-specified parameters: a step size $\epsilon$ and a desired number of steps $L$. In particular, if $L$ is too small then the algorithm exhibits undesirable random walk behavior, while if $L$ is too large the algorithm wastes computation. We present the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps $L$. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. NUTS is able to achieve similar performance to a well tuned standard HMC method, without requiring user intervention or costly tuning runs. NUTS can thus be used in applications such as BUGS-style automatic inference engines that require efficient "turnkey'' sampling algorithms.
PDF,Long version,Code, Slides

John Duchi, Martin Wainwright and Peter Bartlett.
Randomized Smoothing for (Parallel) Stochastic Optimization
By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for non-smooth optimization. A combination of our techniques with recent work on decentralized optimization yields order-optimal parallel stochastic optimization algorithms. We give applications of our results to statistical machine learning problems, providing experimental results demonstrating the effectiveness of our algorithms.
PDF, Long version, Slides

Neal Parikh and Stephen Boyd.
Block Splitting for Large-Scale Distributed Learning
Machine learning and statistics with very large datasets is now a topic of widespread interest, both in academia and industry. Many such tasks can be posed as convex optimization problems, so algorithms for distributed convex optimization serve as a powerful, general-purpose mechanism for training a wide class of models on datasets too large to process on a single machine. In previous work, it has been shown how to solve such problems in such a way that each machine only looks at either a subset of training examples or a subset of features. In this paper, we extend these algorithms by showing how to split problems by both examples and features simultaneously, which is necessary to deal with datasets that are very large in both dimensions. We present some experiments with these algorithms run on Amazons Elastic Compute Cloud.
PDF,Long Version, Slides

Rainer Gemulla, Peter J. Haas, Yannis Sismanis, Christina Teflioudi and Faraz Makari.
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel ``stratified'' variant of SGD, we obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations and has good scalability properties.
PDF, Slides

Day 2 Spotlights

Sameer Singh and Andrew McCallum.
Towards Asynchronous Distributed MCMC Inference for Large Graphical Models
Many machine learning and information extraction tasks require extraction of latent structure from large datasets. These often require representation of a large number of long-rang dependencies. To enable inference for these incredibly massive and dense graphical models, we need to exploit distributed computation. Existing work either relies on shared-memory multi-core setting, or expects synchronization and locking to sample correctly. Instead we propose a truly asynchronous architecture for MCMC sampling consisting of independent parallel workers with no communication. We study a few different conflict resolution techniques and observe their effect on convergence of MCMC chains, evaluated on synthetic models.
PDF, Long Version, Spotlight

Dmitry Pechyony, Libin Shen and Rosie Jones.
Solving Large Scale Linear SVM with Distributed Block Minimization
Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such datasets from one node to another is very slow. We present a new algorithm for training linear Support Vector Machines over such large datasets. Our algorithm assumes that the dataset is partitioned over several nodes on a cluster and performs a distributed block minimization along with the subsequent line search. The communication complexity of our algorithm is independent of the number of training examples. With our Map-Reduce/Hadoop implementation of this algorithm the accurate training of SVM over the datasets of tens of millions of examples takes less than 11 minutes.
PDF, Appendix, Spotlight

Byung Kang, Woosang Lim and Kyomin Jung.
Scalable Kernel k-Means via Centroid Approximation
Although kernel k-means is central for clustering complex data such as images and texts by implicit feature space embedding, its practicality is limited by the quadratic computational complexity. In this paper, we present a novel technique based on scalable centroid approximation that accelerates kernel k-means down to a sub-quadratic complexity. We prove near-optimality of our algorithm, and experimentally confirm its efficiency.
PDF, Spotlight

Avneesh Saluja and Branislav Kveton.
Semi-Supervised Learning with Cover Trees
Semi-supervised learning (SSL) has emerged in recent years as an important tool to tackle large amounts of data. Generally in large-data scenarios, one finds that the ratio of unlabeled to labeled data is very high, and that annotating and labeling data for training a learning algorithm is often resource-demanding and expensive. This issue makes semi-supervised techniques a natural way to explore, understand, and learn from these large datasets. A subset of semi-supervised learning approaches that has been quite successful relies on constructing a similarity or a data adjacency graph between all the points in the training set, labeled and unlabeled, and using the structure of the graph as well as the labeled points to predict labels over the unlabeled data. These methods are collectively referred to as graph-based semi-supervised learning. Graph- based learning is Ω(n^2) in the number of datapoints in the training set, because it takes θ(n^2) time to build the similarity graph over all datapoints. This approach is impractical for large n. In this work, we investigate another way of summarizing the data. Instead of focusing on k representative vertices, we represent data as a tree at different levels of granularity. We show that our approach is scalable to large datasets, and compares favorably to representative vertex selection methods for scaling up graph-based SSL. Since our method results in a naturally sparse representation of the similarity matrix, we also compare it to an ad-hoc sparsification technique, i.e., thresholding the edge weights, where we achieve improved results.
PDF

Martin Zinkevich.
A Theoretical Analysis of a Warm Start Technique
There has been a lot of interest in warm-start approaches to gradient descent techniques, but little analysis. In this paper, we formally analyze a method of warm-starting batch gradient descent using small batch sizes. We argue that this approach is fundamentally different than mini-batch, in that it requires only sequential passes over the data, improving performance on datasets stored on a disk drive. This is an important aspect of scaling batch methods to larger datasets.
PDF

Sangkyun Lee and Christian Bockermann.
Scalable Stochastic Gradient Descent with Improved Confidence
Stochastic gradient descent methods have been quite successful for solving large-scale and online learning problems. We provide a simple parallel framework to obtain solutions of high confidence, where the confidence can be easily controlled by the number of processes, independently of the length of learning processes. Our framework is implemented as a scalable open-source software which can be configured for a single multicore machine or for a cluster of computers, where the training outcomes from independent parallel processes are combined to produce the final output.
PDF, Spotlight

Kevin Duh, Jun Suzuki and Masaaki Nagata.
Distributed Learning-to-Rank on Streaming Data using Alternating Direction Method of Multipliers
We show that Alternating Direction Method of Multipliers is an effective method for large-scale learning-to-rank on multi-cores and clusters, especially in scenarios requiring joint distributed and streaming architectures.
PDF, Spotlight

Markus Weimer, Tyson Condie and Raghu Ramakrishnan.
Machine learning in ScalOps, a higher order cloud computing language
ScalOps is a new internal domain-specific language (DSL) for Big Data analytics that targets machine learning and graph-based algorithms. It unifies the so-far distinct DAG processing as found in e.g. PIG and the iterative computation needs of machine learning in a single language and runtime. It exposes a declarative language that is reminiscent to Pig with iterative extensions: The scaloop block captures iteration and packages it in the execution plan so that it can be optimized for caching opportunities and handed off to the runtime. The Hyracks runtime directly supports these iterations as recursive queries, thereby avoiding the pitfalls of an outer driver loop. We highlight the expressiveness of ScalOps by presenting two example implementations: Batch Gradient Descent - a trivially parallel algorithm - and Pregel, a computational framework of its own. The resulting code is nearly a 1:1 translation of the target mathematical description.
PDF, Spotlight

Ryan Curtin, James Cline, Neil Slagle, Matthew Amidon and Alexander Gray.
MLPACK: A Scalable C++ Machine Learning Library
MLPACK is a new, state-of-the-art, scalable C++ machine learning library, which will be released in early December 2011. Its aim is to make large-scale machinelearning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. MLPACK provides many cutting-edge algorithms,including---but not limited to---tree-based k-nearest-neighbors, fasthierarchical clustering, MVU (maximum variance unfolding) with LRSDP, RADICAL (ICA),and HMMs. Each of these algorithms is highly flexible and configurable, and all ofthem can be configured to use sparse matrices for datasets. Benchmarksare shown to prove that MLPACK scales more effectively than competing toolkits.Future plans for MLPACK include support for parallel algorithms, support fordisk-based datasets, as well as the implementation of a large collection ofcutting-edge algorithms. More information on MLPACK is available at theproject's website, which is located at www.mlpack.org.
PDF, Spotlight

Andrea Tacchetti, Pavan Mallapragada, Matteo Santoro and Lorenzo Rosasco.
GURLS: a Toolbox for Large Scale Multiclass Learning
In this paper we discuss a computational solution to the problem of large scale multi-category learning. This problem is ubiquitous in real world applications, and yet, very few fully automatic software tools are currently available. A widely adopted solution to multiclass problems is to use a base binary classification algorithms to build multiple independent one-versus-all classifiers. For most binary classification algorithms, this approach becomes impractical as the number of classes gets larger. In this context, we study a regularized least squares (RLS) approach and develop a corresponding toolbox, that we call GURLS - Grand Unified Regularized Least Squares. Two mainproperties make this approach appealing. First, RLS allows for automatic parameter selection and training of multiclass classifiers using arbitrary binary decomposition (e.g. one-vs-rest, error correction codes) with the same order of complexity as a single binary classifier. Second, the RLSframework allows to leverage on highly optimized libraries for numerical linear algebra and matrix operations and to easily implementation of various training strategies from recursive updates, to and stochastic gradient methods that becomes crucial when dealing massive high-dimensional data.
PDF, Spotlight

Heather Miller, Philipp Haller and Martin Odersky.
Tools and Frameworks for Big Learning in Scala: Leveraging the Language for High Productivity and Performance
Implementing machine learning algorithms for large data, such as the Web graph and social networks, is challenging. Even though much research has focused on making sequential algorithms more scalable, their running times continue to be prohibitively long. Meanwhile, parallelization remains a formidable challenge for this class of problems, despite frameworks like MapReduce which hide much of the associated complexity. We present three ongoing efforts within our team, previously presented at venues in other fields, which aim to make it easier for machine learning researchers and practitioners alike to quickly implement and experiment with their algorithms in a parallel or distributed setting. Furthermore, we hope to highlight some of the language features unique to the Scala programming language in the treatment of our frameworks, in an effort to show how these features can be used to produce efficient and correct parallel systems more easily than ever before.
PDF, Spotlight

Alekh Agarwal, Olivier Chappelle, Miroslav Dudik and John Langford.
A Reliable Effective Terascale Linear Learning System
We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features,\footnote{The number of features here refers to the number of non-zero entries in the data matrix.} billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. One of the core techniques used is a new communication infrastructure--often referred to as AllReduce--implemented for compatibility with MapReduce clusters. The communication infrastructure appears broadly reusable for many other tasks.
PDF, Spotlight