How to combine multiple notions of relevance in search?

by Michael Nielsen on February 4, 2012

In earlier posts I’ve described two different ways we can assess how relevant a given webpage is to a search query: (1) the cosine similarity measure; and (2) the PageRank, which is a query-independent measure of the importance of a page. While it’s good that we have multiple insights into what makes a webpage relevant, it also gives rise to a problem: how should we combine these two measures to determine the ranking of a particular webpage for a particular search query?

In fact, in practice the problem gets much more complex than this, because there are more than two useful notions of search relevance. For instance, we’d surely also wish to incorporate a relevance measure that quantifies how likely or unlikely a given page is to be spam. By doing that we could ensure that pages which are likely to be spam receive a much lower ranking. We might also wish to incorporate a measure which ranks a page based on how close together the words in the search query are on that page. Once you start to think about it, we humans combine a very large number of factors when assessing the usefulness of a webpage. This is reflected in the fact that, according to Google, their search engine combines not just two or three measures of relevance but more than 200 measures of relevance. How should we best combine all these multiple measures in order to determine how relevant a page is to a given query?

You might think that the right approach would be to think hard about the meaning of measures like cosine similarity and PageRank, and then on the basis of that understanding, to figure out optimal ways of combining those measures. This approach is certainly worth pursuing, but it suffers from a problem: it doesn’t scale very well. Even if you come up with a good way of combining cosine similarity and PageRank, how would you combine 200 different measures? It’s not so obvious. And if you decide to trial the addition of a 201st measure of relevance, how exactly should you incorporate it into your algorithm, and how should you check to see whether or not it improves search results?

In this post, I’ll describe an approach to combining multiple measures of relevance that doesn’t require us to consider the details of the individual measures. Instead, the procedure I describe lets the machine automatically learn how to combine different measures. It does this with the help of a set of training data, where humans have ranked some set of webpages according to their relevance to some set of training queries. The idea is to figure out the best way of combining the measures of relevance in order to reproduce the results of the training data. Whatever method of combination is found is then applied more broadly, to all queries, and all webpages. The big advantage of this machine learning approach is that it lets us easily combine many different notions of search relevance. But it also has some drawbacks, as we’ll see.

The post is based principally on Chapter 15 of the book about information retrieval by Manning, Raghavan, and Sch\”utze. The book is also available for free on the web.

Originally, I intended this post to be a mix of theory and working code to illustrate how the theory works in practice. This is the style I’ve used for many earlier posts, and is the style I intend to use whenever possible. However, I ran into a problem when I attempted to do that for the current post. The problem was that if I wanted to construct interesting examples (and ask interesting questions about those examples), I needed to add a lot of extra context in order for things to make sense. It would have tripled (or more) the length of an already long post. I ultimately decided the extra overhead wasn’t worth the extra insight. Instead, the post focuses on the theory. However, I have included a few pointers to libraries which make it easy to construct your own working code, if you’re so inclined. At some point I expect I’ll come back to this question, in a context where it makes much more sense to include working code.

As usual, I’ll finish the introduction with the caveat that I’m not an expert on any of this. I’m learning as I go, and there may be mistakes or misunderstandings in the post. Still, I hope the post is useful. At some point, I’ll stop adding this caveat to my posts, but I’m still a long way from being expert enough to do that! Also as per usual, the post will contain some simple exercises for the reader, some slightly harder problems, and also some problems for the author, which are things I’d like to understand better.

General approach

In this section, I’ll work through some simple hypothetical examples, using them to build up a set of heuristics about how to rank webpages. These heuristics will ultimately suggest a complete algorithm for learning how to rank webpages from a set of training data.

One small point about nomenclature: I’m going to switch from talking about “webpages” (or “pages”), and start referring instead to “documents”. In part this is because the document terminology is more standard. But it’s also because the techniques apply more broadly than the web.

To keep things simple and concrete in our hypothetical examples, we’ll assume that for any given query and document we have just two different measures of relevance, let’s say the cosine similarity and the PageRank (or some other similar measure, if we’re working with documents not from the web). We’ll call these features. And so for any given query and document pair we have a feature vector:

Actually, the PageRank of a document doesn’t depend on the query , but in general we’ll allow the features in the feature vector to depend on the query. Also, in general there will be more than two features, and so the feature vector might have quite a few components. But it turns out that the generalization beyond the two-dimensional case is straightforward, so we’ll stick with two dimensions for now.

Our broad goal can now be restated in the language of feature vectors. What we want is to find an algorithm which combines the different components of the feature vector in order to rank the document for any particular query . Our task is to find an algorithm which does a pretty good job doing this ranking.

To get some insight into how we should solve this problem, let’s suppose we have an extremely simple set of training data. We’ll suppose a human operator has ranked three documents for just a single query . Of course, real training data will need to involve many more documents and queries, but we can learn a lot by starting with this. We have three feature vectors for these three cases, . I’ve illustrated these in the following diagram – to avoid clutter, rather than write repeatedly in the diagram, I’ve labelled each vector simply by its document number, , as well as (in parentheses) by its order of relevance as ranked by a human operator:

There are a few reasonable observations:

If a feature vector is up and to the right of then it’s better along both axes (PageRank and cosine similarity). It seems reasonable to conclude that is a better result than .
Coversely, if is down and to the left of then should always be ranked below .
The hard cases are when is up and to the left of , or down and to the right, in which case it’s not entirely clear

Note, by the way, that I don’t want to claim that that these observations are “proveable” in any way: they’re just reasonable observations, at least for the particular features (cosine similarity and PageRank) that we’re using. The idea here is simply to figure out some reasonable heuristics which we will eventually combine to suggest an algorithm for learning how to rank webpages.

With the observations above as motivation we’ll adopt the heuristic that it is the vector of feature differences that determines whether should be ranked above , or vice versa. In other words, we’re going to require that the relative ranking of and is a function of the components of , and not of either individual feature vector alone. So the problem now becomes: how should we use to rank and ? A clue is provided by looking at the vectors of feature differences for our training data:

I haven’t labelled the vectors explicitly with and , but it’s pretty easy to figure out which is which, if you look carefully. Instead, I’ve labelled each feature difference vector with a filled in oval when is ranked better than , and with a star when is ranked better than . Examining the vectors carefully, we see that the vectors with an oval all “point the same way” as one another, while the vectors with a star also all point the same way as one another, but in the opposite direction. We can make this intuition more precise by saying that the two sets of vectors are separated into two half-spaces by a line:

So one way we could determine whether a feature difference vector is labelled by an oval or a cross is simply by determining which half-space it is in, i.e., by determing which side of the line it’s on. Or to reformulate it in our original terms: we can tell whether a webpage ranks higher than a webpage simply by determining which half-space the feature difference vector is in.

In higher-dimensional feature spaces this idea generalizes to separating the feature difference vectors into two half-spaces which are on opposite sides of a separating hyperplane. A convenient way of specifying this separating hyperplane, in any number of dimensions, is to introduce a normal vector to the hyperplane. I’ve illustrated such a normal vector below for two dimensions, but you should imagine that we’re working in higher dimensions:

The condition for a feature difference vector to be in (say) the upper half-space is that . To be in the lower half-space the condition is that . Summing everything up, we can check whether the document should be ranked above or below simply by computing the sign of .

The above observations suggest an algorithm for using the feature difference vectors to determine search relevance. It's a pretty obvious generalization of what I've just described, but at the risk of repeating myself I'll write it all out explicitly.

To start, we assume that a set of training data has been provided by human operators. Those operators have been given a set of training queries and training documents . For each training query they've ranked each training document in order of relevance to that query. They might decide, for example, that for the query , the document should be the top-ranked query, the second-ranked query, and so on. This training data provides us with a whole lot of feature difference vectors . These vectors can be divided up into two sets. The first set, which we'll call training data set , contains those feature difference vectors for which has been ranked more highly than . The second set, which we'll call training data set , contains those feature difference vectors for which has been ranked more highly than . We then find a separating hyperplane that separates these two data sets, i.e., with the upper half-space containing data set (where is ranked more highly than for ), and a lower half-space containing data set .

Suppose now that is any query – not necessarily a training query. And suppose and are any two documents, not just training documents. We rank above for query if the feature difference vector lies in the upper half-space. And we rank below if it lies in the lower half-space. This completes the description of the algorithm.

There are many problems with this basic algorithm. One problem becomes evident by going back to our primitive training data and adding an extra document:

Inspecting the feature difference vectors reveals a problem:

I’ve simplified the picture by showing only the feature difference vectors in data set , i.e., those for which ranks more highly than ; data set contains the negation of those vectors. It should be clear that there is no half-space which can be used to divide up the vectors into two sets. It’s not geometrically possible. The way we’ll deal with this is by modifying the algorithm in such a way as to find a half-space which approximately divides the two sets of feature difference vectors into half-spaces. I’ll explain how to do this approximate division later in the post.

Before getting to the approximate division, though, we’ll warm up by figuring out much more explicitly how to do the division into two half-spaces when it is possible. It’s all very well for me to glibly say that we should “figure out which half-space the vector is in”, but how can we actually do this in practice? In the next section I’ll introduce support vector machines, which are a way of doing this kind of division explicitly. Once we’ve understood the basics of support vector machines, we’ll come back to the question of how to divide two sets of vectors into approximate half-spaces.

Problems

Suppose the basic algorithm that I’ve described works, i.e., a division into half-spaces is possible. Suppose that for a particular query the document is ranked above and is ranked above . Prove that the algorithm will rank above .

Support vector machines

Support vector machines are a technique for partitioning two sets of vectors (data set and data set ) into half-spaces separated by a separating hyperplane. The technique is guaranteed to work whenever such a partitioning is possible, and whatsmore the partitioning is optimal, in a sense I’ll make precise. In this section I’ll describe briefly how support vector machines work.

As an aside, you’ll note that the notion of search (and related topics) wasn’t mentioned anywhere in the last paragraph. That’s because support vector machines aren’t about search. Instead, they’re a general technique for dividing sets of vectors into half-spaces, a technique which can be applied to many different problems in machine learning and artificial intelligence, not just search. So support vector machines are a useful technique to understand, even if you’re not especially interested in search. End of aside.

A priori if someone just gives you two sets of vectors, and , it’s not always going to be clear whether a partitioning into two half-spaces is possible. But let’s assume that such a partitioning is possible, and we’ll return later to what happens when it’s not. What we’d like to do is to find an explicit way of specifying a separating hyperplane:

Note that in the last section our separating hyperplanes passed through the origin. But for support vector machines we’ll also allow hyperplanes which don’t pass through the origin. One way to explicitly specify such a hyperplane is as the set of vectors satisfying the equation:

where is a vector normal to the separating hyperplane, and is a constant. Together, and specify the hyperplane. The goal of the support vector machine is to take data sets and , and use them to construct values for and which specify a separating hyperplane. Whatsmore, we’ll try to do this in such a way that we maximize the size of the “wedge” between the two data sets,

where we require that the two edges of the wedge (called support or supporting hyperplanes) be parallel to the separating hyperplane itself. In other words, the goal is to choose the separating hyperplane (i.e., to choose and ) in such a way that these two supporting hyperplanes are as far apart from one another as possible.

Let me mention, by the way, that I’m not mad keen on the notations we’ve been using, such as and . Unfortunately, it’s not going to get any better – we have some ‘s and ‘s in our near future, and I’m sorry to say that the origin of those notations won’t be much more transparent than the origins of and . I’m using what seems to be standard notation in the support vector machine literature, but it’s not very good notation, in my opinion: it has little or no mnemonic value, and is very non-uniform to boot. While I’m complaining, I’ll mention one more thing: the vector is sometimes known as the weight vector, again, for reasons which seem obscure. End of rant.

Let’s get back to the problem of choosing and so that the two supporting hyperplanes are as far apart from one another as possible. To do that, we need to figure out a way of expressing the distance between the two supporting hyperplanes in terms of and . Without any loss of generality, we’ll suppose that the separating hyperplane is halfway between the two supporting hyperplanes:

Let’s fix our attention on one of the two supporting hyperplanes, say, the one that is “higher up” in the picture above. We’ll suppose, also without any loss of generality, that this is the hyperplane supporting data set . By rescaling and if necessary, we can ensure that this support hyperplane has the equation

and so the constraint that this hyperplane support data set may be expressed as . This is an important constraint: it’s the constraint on data set .

Exercises

If you’re not comfortable with hyperplanes you may not immediately see why the kind of rescaling of and which I did in the last paragraph is always possible. If this is the case, then write out an explicit proof.

We can now determine the distance between the supporting and separating hyperplane by introducing a vector which is: (a) normal to both hyperplanes, and (b) connects the two hyperplanes. A bit more precisely, , where is in the supporting hyperplane and is the projection of onto the separating hyperplane:

Our goal is to find the length of . Taking the difference of the equations and , we have . Since and are both normal to the separating hyperplane, and thus parallel, it follows that , where I am omitting the in order to denote length. It follows that the distance from the separating to the supporting hyperplane is .

We chose the separating hyperplane to be halfway between the two supporting hyperplanes, and so the total size of the wedge is . As a result, maximizing the size of the wedge is equivalent to maximizing . This is subject to the constraint on data set , namely that for all the vectors in data set . There’s also a similar constraint for data set ,

for all the vectors in data set . The supporting hyperplane for data set satisfies this condition with equality.

Exercises

Prove that is the equation for the supporting hyperplane for data set .

There is a reasonably nice way of combining the two sets of constraint inequalities for data sets and . That’s to label each vector $\vec x_j$