Home

Program

At a glance
Schedule
Tutorials
Workshops
Invited speakers
Awards
Proceedings
Accepted papers
Discussion
Dates

For Participants

News
Message Board
Pictures
Registration
Accommodation
Venue and banquet
Local info
Transportation
Student volunteers
Jerusalem tour
Scholarships

For authors

Call for papers
Paper submission
Final formatting
Areas and chairs

Misc

Sponsors
Past ICMLs
Organization
Call for participation
Survey and Report

ICML 2010 proceedings front matter: pre-ToC post-ToC
ICML 2010 proceedings table of contents

ICML 2010 - Abstracts

Invited Application Track
Jump to Main Track

Paper ID: 901

Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine

Thore Graepel (Microsoft Research); Joaquin Quionero Candela (Microsoft Research); Thomas Borchert (Microsoft Research); Ralf Herbrich (Microsoft Research)

We describe a new Bayesian click-through rate (CTR) prediction algorithm used for Sponsored Search in Microsoft’s Bing search engine. The algorithm is based on a probit regression model that maps discrete or real-valued input features to probabilities. It maintains Gaussian beliefs over weights of the model and performs Gaussian online updates derived from approximate message passing. Scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. We discuss the challenges arising from evaluating and tuning the predictor as part of the complex system of sponsored search where the predictions made by the algorithm decide about future training sample composition. Finally, we show experimental results from the production system and compare to a calibrated Naïve Bayes algorithm.

[Full Paper] [Discussion]

Paper ID: 902

Detecting Large-Scale System Problems by Mining Console Logs

Wei Xu (UC Berkeley); Ling Huang (Intel Labs Berkeley); Armando Fox (UC Berkeley); David Patterson (UC Berkeley); Michael I. Jordan (UC Berkeley)

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime
problems. We use a combination of program analysis and information retrieval techniques to transform free-text console logs into numerical features, which captures sequences of events in the system. We then analyze these features using machine learning to detect operational problems. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. In addition, we extend our methods to online problem detection where the sequences of events are continuously generated as data streams.

[Full Paper] [Discussion]

Paper ID: 903

The Role of Machine Learning in Business Optimization

Chid Apte (IBM T. J. Watson Research Center)

In a trend that reflects the increasing demand for intelligent applications driven by business data, IBM today is building out a significant number of applications that leverage machine learning technologies to optimize business process decisions. This talk highlights this trend; and describes the many different ways in which leading edge machine learning concepts are being utilized in business applications developed by IBM for its internal use and for clients.

[Full Paper] [Discussion]

Paper ID: 904

Music Plus One and Machine Learning

Christopher Raphael (Indiana University)

A system for musical accompaniment is presented in which a computer-driven orchestra follows and learns from a soloist in a concerto-like setting. The system is decomposed into three modules: the first computes a real-time score match using a hidden Markov model; the second generates the output audio by phase-vocoding a preexisting audio recording; the third provides a link between these two, by predicting future timing evolution using a Kalman filter-like model. Several examples are presented showing the system in action in diverse musical settings. Connections with machine learning are highlighted, showing current weaknesses and new possible directions.

[Full Paper] [Discussion]

Paper ID: 905

Climbing the Tower of Babel: Unsupervised Multilingual Learning

Benjamin Snyder (MIT); Regina Barzilay (MIT)

For centuries, scholars have explored the deep links among human languages. In this paper, we present a class of probabilistic models that use these links as a form of naturally
occurring supervision. These models allow us to substantially improve performance for
core text processing tasks, such as morphological segmentation, part-of-speech tagging,
and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual
model for the computational decipherment of lost languages.

[Full Paper] [Discussion]

Paper ID: 906

FAB-MAP: Appearance-Based Place Recognition and Mapping using a Learned Visual Vocabulary Model

Mark Cummins (University of Oxford); Paul Newman (University of Oxford)

We present an overview of FAB-MAP, an algorithm for place recognition and mapping developed for infrastructure-free mobile robot navigation in large environments. The system allows a robot to identify when it is revisiting a previously seen location, on the basis of imagery captured by the robot's camera. We outline a complete probabilistic framework for the task, which is applicable even in visually repetitive environments where many locations may appear identical. Our work introduces a number of technical innovations - notably we demonstrate that place recognition performance can be improved by learning an approximation to the joint distribution over visual elements. We also investigate several principled approaches to making the system robust in visually repetitive environments, and define an efficient bail-out strategy for multi-hypothesis testing to improve system speed. Our model has been shown to substantially outperform standard tf-idf ranking on our task of interest. We demonstrate the system performing reliable online appearance mapping and loop closure detection over a 1,000km trajectory, with mean filter update times of 14 ms.

[Full Paper] [Discussion]

Paper ID: 907

Discriminative Latent Variable Models for Object Detection

Pedro Felzenszwalb (University of Chicago); Ross Girshick (University of Chicago); David McAllester (Toyota Technological Institute at Chicago); Deva Ramanan (UC Irvine)

In this talk, I will discuss recent work by colleagues and myself on discriminative latent-variable models for object detection. Object recognition is one of the fundamental challenges of computer vision. We specifically consider the task of localizing and detecting instances of a generic object category, such as people or cars, in cluttered real-word images. Recent benchmark competitions such as the PASCAL Visual Object Challenge suggest our method is the state-of-the-art system for such tasks. This success, combined with publically-available code that runs orders of magnitude faster than comparable approaches, has turned our system into a standard baseline for contemporary research on object recognition.

[Full Paper] [Discussion]

Main Track
Jump to Invited Application Track

Paper ID: 16

Large Graph Construction for Scalable Semi-Supervised Learning

Wei Liu (Columbia University); Junfeng He; Shih-Fu Chang

In this paper, we address the scalability issue plaguing graph-based semi-supervised learning viaa small number of anchor points which adequately cover the entire point cloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted average of the labels on anchor points. Because conventional graph construction is inefficient in large scale, we propose to construct a tractable large graph by couplinganchor-based label prediction and adjacency matrix design. Contrary to the Nystrom approximation of adjacency matrices which results in indefinite graph Laplacians and in turn leads to potential non-convex optimization over graphs, the proposed graph construction approach based on a unique idea called AnchorGraph provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians. Our approach scales linearly with the data size and in practice usually produces a large sparse graph. Experiments on large datasets demonstrate the significant accuracy improvement and scalability of the proposed approach.

[Full Paper] [Discussion]

Paper ID: 23

Boosting Classifiers with Tightened L0-Relaxation Penalties

Noam Goldberg (Technion); Jonathan Eckstein

We propose a novel boosting algorithm which improves on current algorithms for weighted voting classification in striking a better balance between classification accuracy and sparsity of the weight vector. In order to justify our optimization formulations, we first consider a novel integer linear program as a model forsparse classifier selection, generalizing the minimum disagreementhalfspace problem, whose complexity has been investigated in computational learning theory. Specifically, our mixed integer problem is that of finding a separating hyperplane with minimum empirical error subject to an L0-norm penalty. We find that common soft margin linear programming formulations for robust classification are equivalent to a continuous relaxation of our model. Since the initial continuous relaxation is weak, we suggesta tighter relaxation, using novel cutting planes, to better approximate the integer solution. To solve this relaxation, we propose a new boosting algorithm based on linear programming with dynamic generation of variables and constraints. We demonstrate the classification performance of our proposed algorithm with experimental results, and justify our selection of parameters using a minimum description length, compression intrepretation of learning.

[Full Paper] [Discussion]

Paper ID: 26

Variable Selection in Model-Based Clustering: To Do or To Facilitate

Leonard Poon (HKUST); Nevin Zhang (Hong Kong University Of Science & Technology); Tao Chen; Yi Wang

Variable selection for cluster analysis is a difficult problem. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the ``best'' clustering may be misguided. It makes more sense to facilitate variable selection by domain experts, that is, to systematically identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture model, show its ability to cluster data along multiple facets, and demonstrate it is often more reasonable to facilitate variable selection than to perform it.

[Full Paper] [Discussion]

Paper ID: 28

Modeling Interaction via the Principle of Maximum Causal Entropy

Brian Ziebart (Carnegie Mellon University); Drew Bagnell (Cmu); Anind Dey (Carnegie Mellon University)

The principle of maximum entropy provides a powerful framework for statistical models of joint, conditional, and marginal distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work presents the principle of maximum causal entropy--an approach based on causally conditioned probabilities that can appropriately model the availability and influence of sequentially revealed side information. Using this principle, we derive models for sequential data with revealed information, interaction, and feedback, and demonstrate their applicability for statistically framing inverse optimal control and decision prediction tasks.

[Full Paper] [Discussion]

Paper ID: 35

Multi-Task Learning of Gaussian Graphical Models

Jean Honorio (Stony Brook University); Dimitris Samaras (Stony Brook University)

We present multi-task structure learning for Gaussian graphical models. We discuss uniqueness and boundedness of the optimal solution of the maximization problem. A block coordinate descent method leads to a provably convergent algorithm that generates a sequence of positive definite solutions. Thus, we reduce the original problem into a sequence of strictly convex $\ell_\infty$ regularized quadratic minimization subproblems. We further show that this subproblem leads to the continuous quadratic knapsack problem, for which very efficient methods exist. Finally, we show promising results in a dataset that captures brain function of cocaine addicted and control subjects under conditions of monetary reward.

[Full Paper] [Discussion]

Paper ID: 45

Spherical Topic Models

Joseph Reisinger (UT Austin); Austin Waters (UT Austin); Bryan Silverthorn (UT Austin); Raymond Mooney (University Of Texas At Austin)

We introduce the Spherical Admixture Model (SAM), a Bayesian topic model for arbitrary L2 normalized data. SAM maintains the same hierarchical structure as Latent Dirichlet Allocation (LDA), but models documents as points on a high-dimensional spherical manifold, allowing a natural likelihood parameterization in terms of cosine distance. Furthermore, SAM topics are capable of assigning negative weight to terms and can model word absence/presence unlike previous models. Performance is evaluated empirically both subjectively as a topic model using human raters and across several disparate classification tasks, from natural language processing and computer vision.

[Full Paper] [Discussion]

Paper ID: 52

Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes

Marek Petrik (University of Massachusetts ); Gavin Taylor (Duke); Ron Parr (Duke); Shlomo Zilberstein (University of Massachusetts Amherst)

Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause the existing algorithms to overfit because of the limited number of samples. We address this shortcoming using L_1 regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems.

[Full Paper] [Discussion]

Paper ID: 76

Multi-agent Learning Experiments on Repeated Matrix Games

Bruno Bouzy (Université Paris Descartes); Marc Metivier (Université Paris Descartes)

This paper experimentally evaluates multi-agent learning algorithms playing repeated matrix games to maximize their cumulative return. Previous works assessed that Q-learning surpassed Nash-based multi-agent learning algorithms. Based on all-against-all repeated matrix game tournaments, this paper updates the state of the art of multi-agent learning experiments. In a first stage, it shows that M-Qubed, S and bandit-based algorithms such as UCB are the best algorithms on general-sum games, Exp3 being the best on cooperative games and zero-sum games. In a second stage, our experiments show that two features - forgetting the far past, and using recent history with states - improve the learning algorithms. Finally, the best algorithms are two new algorithms, Q-learning and UCB enhanced with the two features, and M-Qubed.

[Full Paper] [Discussion]

Paper ID: 77

Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds

Tobias Lang (TU Berlin); Marc Toussaint (Tu Berlin)

Inference in graphical models has emerged as a promising technique for planning. A recent approach to decision-theoretic planning in relational domains uses forward inference in dynamic Bayesian networks compiled from learned probabilistic relational rules. Inspired by work in non-relational domains with small state spaces, we derive a backpropagation method for such nets in relational domains starting from a goal state mixture distribution. We combine this with forward reasoning in a bidirectional two-filter approach. We perform experiments in a complex 3D simulated desktop environment with an articulated manipulator and realistic physics. Empirical results show that bidirectional probabilistic reasoning can lead to more efficient and accurate planning in comparison to pure forward reasoning.

[Full Paper] [Discussion]

Paper ID: 78

Causal filter selection in microarray data

Gianluca Bontempi (Université Libre De Bruxelles); Patrick Meyer (ULB)

The importance of bringing causality intoplay when designing feature selection methods is more and more acknowledged in the machine learning community. This paper proposes a filter approach based on information theory which aimsto prioritise direct causal relationships in feature selection problems where the ratio between the numberof features and the number of samples is high. This approach is based on thenotion of interaction which is shown to be informative about the relevance of an input subset as well asits causal relationship with the target.The resulting filter, called mIMR (min-Interaction Max-Relevance), is compared with state-of-the-art approaches. Classification results on 25 real microarray datasets show thatthe incorporation of causal aspects in the feature assessment is beneficial both for the resulting accuracy and stability. A toy example of causal discovery shows the effectivenessof the filter for identifying direct causal relationships.

[Full Paper] [Discussion]

Paper ID: 87

A Conditional Random Field for Multi-Instance Learning

Thomas Deselaers (ETH Zurich); Vittorio Ferrari (ETH Zurich)

We present MI-CRF, a conditional random field (CRF) model for multiple instance learning (MIL). MI-CRF models bags as nodes in a CRF with instances as their states. It combines discriminative unary instance classifiers and pairwise dissimilarity measures. We show that both forces improve the classification performance. Unlike other approaches, MI-CRF considers all bags jointly during training as well as during testing. This makes it possible to classify test bags in an imputation setup. The parameters of MI-CRF are learned using constraint generation. Furthermore, we show that MI-CRF can incorporate previous MIL algorithms to improve on their results. MI-CRF obtains competitive results on five standard MIL datasets.

[Full Paper] [Discussion]

Paper ID: 99

Supervised Aggregation of Classifiers using Artificial Prediction Markets

Nathan Lay (Florida State University); Adrian Barbu (Florida State University)

Prediction markets are used in real life to predict outcomes of interest such as presidential elections. In this work we introduce a mathematical theory for Artificial Prediction Markets for supervised classifier aggregation and probability estimation. We introduce the artificial prediction market as a novel way to aggregate classifiers. We derive the market equations to enforce total budget conservation, show the market price uniqueness and give efficient algorithms for computing it. We show how to train the market participants by updating their budgets using training examples. We introduce classifier specialization as a new differentiating characteristic between classifiers. Finally, we present experiments using random decision rules as specialized classifiers and show that the prediction market consistently outperforms Random Forest on real and synthetic data of varying degrees of difficulty.

[Full Paper] [Discussion]

Paper ID: 100

3D Convolutional Neural Networks for Human Action Recognition

Shuiwang Ji (Arizona State University); Wei Xu; Ming Yang; Kai Yu (NEC Research Princeton)

We consider the fully automated recognition of actions in uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled. Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels. We apply the developed model to recognize human actions in real-world environment, and it achieves superior performance without relying on handcrafted features.

[Full Paper] [Discussion]

Paper ID: 107

Asymptotic Analysis of Generative Semi-Supervised Learning

Joshua Dillon (Georgia Institute of Technolog); Krishnakumar Balasubramanian (Georgia Institute of Technology); Guy Lebanon (Georgia Tech)

Semi-supervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.

[Full Paper] [Discussion]

Paper ID: 115

Restricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate

Phil Long (Google); Rocco Servedio (Columbia)

Restricted Boltzmann Machines (RBMs) are a type of probability modelover the Boolean cube {-1,1\}^n that have recently received muchattention. We establish the intractability of two basiccomputational tasks involving RBMs, even if only a coarseapproximation to the correct output is required.We first show that assuming P != NP, for any fixed positive constant K (which may be arbitrarily large)there is no polynomial-time algorithm for the followingproblem: given an n-bit input string x and the parameters of aRBM M, output an estimate of the probability assigned to x byM that is accurate to within a multiplicative factor ofe^{Kn}. This hardness result holds even if the parametersof M are constrained to be at most psi(n) for anyfunction psi(n) = \omega(n), and if the number of hidden nodes ofM is at most n.We then show that assuming RP != NP,there is no polynomial-time randomized algorithm for thefollowing problem: given the parameters of an RBM M, generate arandom example from a probability distribution whose total variationdistance from the distribution defined by M is at most 1/12.

[Full Paper] [Discussion]

Paper ID: 117

Learning from Noisy Side Information by Generalized Maximum Entropy Model

Tianbao Yang (Michigan State University); Rong Jin (Michigan State University); Anil Jain (michigan State University)

We consider the problem of learning from noisy side information in the form of pairwise constraints. Although many algorithms have been developed to learn from side information, most of them assume perfect pairwise constraints. Given the pairwise constraints are often extracted from data sources such as paper citations, they tend to be noisy and inaccurate. In this paper, we introduce the generalization of maximum entropy model and propose a framework for learning from noisy side information based on the generalized maximum entropy model. The theoretic analysis shows that under certain assumption, the classification model trained from the noisy side information can be very close to the one trained from the perfect side information. Extensive empirical studies verify the effectiveness of the proposed framework.

[Full Paper] [Discussion]

Paper ID: 119

Finding Planted Partitions in Nearly Linear Time using Arrested Spectral Clustering

Nader Bshouty (Technion); Phil Long (Google)

We describe an algorithm for clustering using a similarity graph. Thealgorithm (a) runs in O(n log^3 n + m \log n) time on graphs withn vertices and m edges, and (b) with high probability, finds all``large enough'' clusters in a random graph generated according to theplanted partition model. We provide lower bounds that imply that our``large enough'' constraint cannot be improved much, even using acomputationally unbounded algorithm. We describe some experimentsrunning the algorithm and a few related algorithms on random graphswith partitions generated using a Chinese Restaurant Processes, andsome results of applying the algorithm to cluster DBLP titles.

[Full Paper] [Discussion]

Paper ID: 123

The Elastic Embedding Algorithm for Dimensionality Reduction

Miguel Carreira-Perpinan (UC Merced)

We propose a new dimensionality reduction method, the elastic embedding (EE), that optimises an intuitive, nonlinear objective function of the low-dimensional coordinates of the data. The method reveals a fundamental relation betwen a spectral method, Laplacian eigenmaps, and a nonlinear method, stochastic neighbour embedding; and shows that EE can be seen as learning both the coordinates and the affinities between data points. We give a homotopy method to train EE, characterise the critical value of the homotopy parameter, and study the method's behaviour. For a fixed homotopy parameter, we give a globally convergent iterative algorithm that is very effective and requires no user parameters. Finally, we give an extension to out-of-sample points. In standard datasets, EE obtains results as good or better than those of SNE, but more efficiently and robustly.

[Full Paper] [Discussion]

Paper ID: 125

Two-Stage Learning Kernel Algorithms

Corinna Cortes (Google); Mehryar Mohri (Courant Institute Nyu); Afshin Rostamizadeh (Courant Institute, NYU)

This paper examines two-stage techniques for learning kernels based on a notion of alignment. It presents a number of novel theoretical, algorithmic, and empirical results for alignment-based techniques. Our results build on previous work by Cristianini et al. (2001), but we adopt a different definition of kernel alignment and significantly extend that work in several directions: we give a novel and simple concentration bound for alignment between kernel matrices; show the existence of good predictors for kernels with high alignment, both for classification and for regression; give algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP; and report the results of extensive experiments with this alignment-based method in classification and regression tasks, which show an improvement both over the uniform combination of kernels and over other state-of-the-art learning kernel methods.

[Full Paper] [Discussion]

Paper ID: 132

Robust Graph Mode Seeking by Graph Shift

Hairong Liu (NUS); Shuicheng Yan (National University of Singapore)

In this paper, we study how to robustly computethe modes of a graph, namely the densesubgraphs, which characterize the underlyingcompact patterns and are thus useful formany applications. We first define the modesbased on graph density function, then proposethe graph shift algorithm, which startsfrom each vertex and iteratively shifts towardsthe nearest mode of the graph alonga certain trajectory. Both theoretic analysisand experiments show that graph shift algorithmis very efficient and robust, especiallywhen there exist large amount of noises andoutliers.

[Full Paper] [Discussion]

Paper ID: 137

Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi Supervised Learning

Matan Gavish (Stanford University); Boaz Nadler (Weizmann Institute Of Science); Ronald Coifman (Yale University)

Harmonic analysis, and in particular the relation between function smoothnessand approximate sparsity of its wavelet coefficients, has played a key role insignal processing and statistical inference for low dimensional data.In contrast, harmonic analysis has thus far had little impact in modern problemsinvolving high dimensional data, or data encoded as graphs or networks.The main contribution of this paper is the development of a harmonic analysisapproach, including both learning algorithms and supporting theory, applicable to these moregeneral settings. Given data (be it high dimensional, graph or network) that is representedby one or more hierarchical trees, we first construct multiscale wavelet-likeorthonormal baseson it.Second, we prove that in analogyto the Euclidean case, function smoothness with respectto a specific metric induced by the tree is equivalent to exponential rate of coefficient decay,that is, to approximate sparsity.These results readily translate to simple practicalalgorithms for various learning tasks. %, such as extension from partially labeled data and function denoising.We present an application to transductive semi-supervised learning.

[Full Paper] [Discussion]

Paper ID: 149

Deep Supervised T-Distributed Embedding

Renqiang Min (DCS, University of Toronto); Laurens van der Maaten (Tilburg University); Zineng Yuan (University of Toronto); Anthony Bonner (University of Toronto); Zhaolei Zhang (University of Toronto)

Deep learning has been successfully applied to learn non-linear feature mappings and to perform dimensionality reduction. In this paper, we present supervised embedding techniques that use a deep neural network to collapse classes. The network is pre-trained using a stack of Restricted Boltzmann Machines (RBMs), and finetuned using approaches that try to collapse classes. The finetuning is inspired by ideas from Neighborhood Components Analysis (NCA), but it uses a Student t-distribution to model the probabilities of pairwise data points belonging to the same class in the embedding. We investigate two types of objective functions: deep t-distributed MCML (dt-MCML) and deep t-distributed NCA (dt-NCA). Our experiments on two handwritten digit datasets reveal the strong performance of dt-MCML in supervised parametric data visualization, whereas dt-NCA outperforms alternative techniques when embeddings with more than two or three dimensions are constructed, e.g., to obtain good classification performances. Overall, our results demonstrate the advantage of using a deep architecture and a heavy-tailed t-distribution for measuring pairwise similarities in supervised embedding.

[Full Paper] [Discussion]

Paper ID: 168

A Nonparametric Information Theoretic Clustering Algorithm

Lev Faivishevsky (Bar Ilan University); Jacob Goldberger (Bar Ilan University)

In this paper we propose a novel clustering algorithm based on maximizing the mutual information between data points and clusters. Unlike previous methods, we neitherassume the data are given in terms of distributions nor impose any parametric model on the within-cluster distribution. Instead, we utilize a non-parametric estimation of the average cluster entropies and search for a clustering that maximizes the estimated mutual information between data points and clusters. The improved performance of the proposed algorithm is demonstrated on several standard datasets.

[Full Paper] [Discussion]

Paper ID: 170

Gaussian Process Change Point Models

Yunus Saatci (University of Cambridge); Ryan Turner; Carl Rasmussen

We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points.The model can be used to locate change points in an online manner;and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected.We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages.We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets.

[Full Paper] [Discussion]

Paper ID: 175

Dynamical Products of Experts for Modeling Financial Time Series

Yutian Chen (Univ. of California, Irvine); Max Welling (University Of California - Irvine)

Predicting the ``Value at Risk'' of a portfolio of stocks is of great significance in quantitative finance. We introduce a new class models, ``dynamical products of experts'' that treats the latent process over volatilities as an inverse Gamma process. We show that our multivariate volatility models significantly outperform all related Garch and stochastic volatility models which are in popular use in the quantitative finance community.

[Full Paper] [Discussion]

Paper ID: 176

The Margin Perceptron with Unlearning

Constantinos Panagiotakopoulos (Aristotle University of Thessaloniki); Petroula Tsampouka (Aristotle University of Thessaloniki)

We introduce into the classical Perceptron algorithm with margin a mechanism of unlearning which in the course of the regular update allows for a reduction of possible contributions from very well classified patterns to the weight vector. The resulting incremental classification algorithm, called Margin Perceptron with Unlearning (MPU), provably converges in a finite number of updates to any desirable chosen before running approximation of either the maximal margin or the optimal 1-norm soft margin solution. Moreover, an experimental comparative evaluation involving representative linear Support Vector Machines reveals that the MPU algorithm is very competitive.

[Full Paper] [Discussion]

Paper ID: 178

Sequential Projection Learning for Hashing with Compact Codes

Jun Wang (Columbia University); Sanjiv Kumar (Google Research Ny); Shih-Fu Chang

Hashing based Approximate Nearest Neighbor (ANN) search has attracted much attention due to its fast query time and drastically reduced storage. However, most of the hashing methods either use random projections or extract principal directions from the data to derive hash functions. The resulting embedding suffers from poor discrimination when compact codes are used. In this paper, we propose a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. The proposed method easily adapts to both unsupervised and semi-supervised scenarios and shows significant performance gains over the state-of-the-art methods on two large datasets containing up to 1 million points.

[Full Paper] [Discussion]

Paper ID: 179

Generalization Bounds for Learning Kernels

Corinna Cortes (Google); Mehryar Mohri (Courant Institute Nyu); Afshin Rostamizadeh (Courant Institute, NYU)

This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L_1 regularization admits only a \sqrt{\log p} dependency on the number of kernels, which is *tight* and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a non-negative combination of p base kernels with an L_2 regularization whose dependency on p is also*tight* and only in p^{1/4}. We present similar results for L_q regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds.

[Full Paper] [Discussion]

Paper ID: 180

Modeling Transfer Learning in Human Categorization with the Hierarchical Dirichlet Process

Kevin Canini (University of California); Mikhail Shashkov (Univeristy of California, Berkeley); Tom Griffiths (Berkeley)

Transfer learning can be described as the distillation of abstract knowledge from one learning domain or task and the reuse of that knowledge in a related domain or task. In categorization settings, transfer learning is the modification by past experience of prior expectations about what types of categories are likely to exist in the world. While transfer learning is an important and active research topic in machine learning, there have been few studies of transfer learning in human categorization. We propose an explanation for transfer learning effects in human categorization, implementing a model from the statistical machine learning literature -- the hierarchical Dirichlet process (HDP) -- to make empirical evaluations of its ability to explain these effects. We present two laboratory experiments which measure the degree to which people engage in transfer learning in a controlled setting, and we compare our model to their performance. We find that the HDP provides a good explanation for transfer learning exhibited by human learners.

[Full Paper] [Discussion]

Paper ID: 187

Convergence of Least Squares Temporal Difference Methods Under General Conditions

Huizhen Yu (Univ. of Helsinki)

We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the off-policy learning context and with the simulation-based least squares temporal difference algorithm, LSTD($\lambda$). We establish for the discounted cost criterion that the off-policy LSTD($\lambda$) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces.

[

Program

For Participants

For authors

Misc

ICML 2010 - Abstracts

Invited Application Track Jump to Main Track

Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine

Detecting Large-Scale System Problems by Mining Console Logs

The Role of Machine Learning in Business Optimization

Music Plus One and Machine Learning

Climbing the Tower of Babel: Unsupervised Multilingual Learning

FAB-MAP: Appearance-Based Place Recognition and Mapping using a Learned Visual Vocabulary Model

Discriminative Latent Variable Models for Object Detection

Main Track Jump to Invited Application Track

Large Graph Construction for Scalable Semi-Supervised Learning

Boosting Classifiers with Tightened L0-Relaxation Penalties

Variable Selection in Model-Based Clustering: To Do or To Facilitate

Modeling Interaction via the Principle of Maximum Causal Entropy

Multi-Task Learning of Gaussian Graphical Models

Spherical Topic Models

Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes

Multi-agent Learning Experiments on Repeated Matrix Games

Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds

Causal filter selection in microarray data

A Conditional Random Field for Multi-Instance Learning

Supervised Aggregation of Classifiers using Artificial Prediction Markets

3D Convolutional Neural Networks for Human Action Recognition

Asymptotic Analysis of Generative Semi-Supervised Learning

Restricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate

Learning from Noisy Side Information by Generalized Maximum Entropy Model

Finding Planted Partitions in Nearly Linear Time using Arrested Spectral Clustering

The Elastic Embedding Algorithm for Dimensionality Reduction

Two-Stage Learning Kernel Algorithms

Robust Graph Mode Seeking by Graph Shift

Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi Supervised Learning

Deep Supervised T-Distributed Embedding

A Nonparametric Information Theoretic Clustering Algorithm

Gaussian Process Change Point Models

Dynamical Products of Experts for Modeling Financial Time Series

The Margin Perceptron with Unlearning

Sequential Projection Learning for Hashing with Compact Codes

Generalization Bounds for Learning Kernels

Modeling Transfer Learning in Human Categorization with the Hierarchical Dirichlet Process

Convergence of Least Squares Temporal Difference Methods Under General Conditions

Invited Application Track
Jump to Main Track

Main Track
Jump to Invited Application Track