Skip to content

How to combine multiple notions of relevance in search?

by Michael Nielsen on February 4, 2012

In earlier posts I’ve described two different ways we can assess how relevant a given webpage is to a search query: (1) the cosine similarity measure; and (2) the PageRank, which is a query-independent measure of the importance of a page. While it’s good that we have multiple insights into what makes a webpage relevant, it also gives rise to a problem: how should we combine these two measures to determine the ranking of a particular webpage for a particular search query?

In fact, in practice the problem gets much more complex than this, because there are more than two useful notions of search relevance. For instance, we’d surely also wish to incorporate a relevance measure that quantifies how likely or unlikely a given page is to be spam. By doing that we could ensure that pages which are likely to be spam receive a much lower ranking. We might also wish to incorporate a measure which ranks a page based on how close together the words in the search query are on that page. Once you start to think about it, we humans combine a very large number of factors when assessing the usefulness of a webpage. This is reflected in the fact that, according to Google, their search engine combines not just two or three measures of relevance but more than 200 measures of relevance. How should we best combine all these multiple measures in order to determine how relevant a page is to a given query?

You might think that the right approach would be to think hard about the meaning of measures like cosine similarity and PageRank, and then on the basis of that understanding, to figure out optimal ways of combining those measures. This approach is certainly worth pursuing, but it suffers from a problem: it doesn’t scale very well. Even if you come up with a good way of combining cosine similarity and PageRank, how would you combine 200 different measures? It’s not so obvious. And if you decide to trial the addition of a 201st measure of relevance, how exactly should you incorporate it into your algorithm, and how should you check to see whether or not it improves search results?

In this post, I’ll describe an approach to combining multiple measures of relevance that doesn’t require us to consider the details of the individual measures. Instead, the procedure I describe lets the machine automatically learn how to combine different measures. It does this with the help of a set of training data, where humans have ranked some set of webpages according to their relevance to some set of training queries. The idea is to figure out the best way of combining the measures of relevance in order to reproduce the results of the training data. Whatever method of combination is found is then applied more broadly, to all queries, and all webpages. The big advantage of this machine learning approach is that it lets us easily combine many different notions of search relevance. But it also has some drawbacks, as we’ll see.

The post is based principally on Chapter 15 of the book about information retrieval by Manning, Raghavan, and Sch\”utze. The book is also available for free on the web.

Originally, I intended this post to be a mix of theory and working code to illustrate how the theory works in practice. This is the style I’ve used for many earlier posts, and is the style I intend to use whenever possible. However, I ran into a problem when I attempted to do that for the current post. The problem was that if I wanted to construct interesting examples (and ask interesting questions about those examples), I needed to add a lot of extra context in order for things to make sense. It would have tripled (or more) the length of an already long post. I ultimately decided the extra overhead wasn’t worth the extra insight. Instead, the post focuses on the theory. However, I have included a few pointers to libraries which make it easy to construct your own working code, if you’re so inclined. At some point I expect I’ll come back to this question, in a context where it makes much more sense to include working code.

As usual, I’ll finish the introduction with the caveat that I’m not an expert on any of this. I’m learning as I go, and there may be mistakes or misunderstandings in the post. Still, I hope the post is useful. At some point, I’ll stop adding this caveat to my posts, but I’m still a long way from being expert enough to do that! Also as per usual, the post will contain some simple exercises for the reader, some slightly harder problems, and also some problems for the author, which are things I’d like to understand better.

General approach

In this section, I’ll work through some simple hypothetical examples, using them to build up a set of heuristics about how to rank webpages. These heuristics will ultimately suggest a complete algorithm for learning how to rank webpages from a set of training data.

One small point about nomenclature: I’m going to switch from talking about “webpages” (or “pages”), and start referring instead to “documents”. In part this is because the document terminology is more standard. But it’s also because the techniques apply more broadly than the web.

To keep things simple and concrete in our hypothetical examples, we’ll assume that for any given query and document we have just two different measures of relevance, let’s say the cosine similarity and the PageRank (or some other similar measure, if we’re working with documents not from the web). We’ll call these features. And so for any given query and document pair spacer we have a feature vector:

spacer

Actually, the PageRank of a document doesn’t depend on the query spacer , but in general we’ll allow the features in the feature vector to depend on the query. Also, in general there will be more than two features, and so the feature vector might have quite a few components. But it turns out that the generalization beyond the two-dimensional case is straightforward, so we’ll stick with two dimensions for now.

Our broad goal can now be restated in the language of feature vectors. What we want is to find an algorithm which combines the different components of the feature vector spacer in order to rank the document spacer for any particular query spacer . Our task is to find an algorithm which does a pretty good job doing this ranking.

To get some insight into how we should solve this problem, let’s suppose we have an extremely simple set of training data. We’ll suppose a human operator has ranked three documents spacer for just a single query spacer . Of course, real training data will need to involve many more documents and queries, but we can learn a lot by starting with this. We have three feature vectors for these three cases, spacer . I’ve illustrated these in the following diagram – to avoid clutter, rather than write spacer repeatedly in the diagram, I’ve labelled each vector simply by its document number, spacer , as well as (in parentheses) by its order of relevance as ranked by a human operator:

spacer

There are a few reasonable observations:

  • If a feature vector spacer is up and to the right of spacer then it’s better along both axes (PageRank and cosine similarity). It seems reasonable to conclude that spacer is a better result than spacer .

  • Coversely, if spacer is down and to the left of spacer then spacer should always be ranked below spacer .
  • The hard cases are when spacer is up and to the left of spacer , or down and to the right, in which case it’s not entirely clear

Note, by the way, that I don’t want to claim that that these observations are “proveable” in any way: they’re just reasonable observations, at least for the particular features (cosine similarity and PageRank) that we’re using. The idea here is simply to figure out some reasonable heuristics which we will eventually combine to suggest an algorithm for learning how to rank webpages.

With the observations above as motivation we’ll adopt the heuristic that it is the vector of feature differences spacer that determines whether spacer should be ranked above spacer , or vice versa. In other words, we’re going to require that the relative ranking of spacer and spacer is a function of the components of spacer , and not of either individual feature vector alone. So the problem now becomes: how should we use spacer to rank spacer and spacer ? A clue is provided by looking at the vectors of feature differences for our training data:

spacer

I haven’t labelled the vectors explicitly with spacer and spacer , but it’s pretty easy to figure out which is which, if you look carefully. Instead, I’ve labelled each feature difference vector spacer with a filled in oval when spacer is ranked better than spacer , and with a star when spacer is ranked better than spacer . Examining the vectors carefully, we see that the vectors with an oval all “point the same way” as one another, while the vectors with a star also all point the same way as one another, but in the opposite direction. We can make this intuition more precise by saying that the two sets of vectors are separated into two half-spaces by a line:

spacer

So one way we could determine whether a feature difference vector is labelled by an oval or a cross is simply by determining which half-space it is in, i.e., by determing which side of the line it’s on. Or to reformulate it in our original terms: we can tell whether a webpage spacer ranks higher than a webpage spacer simply by determining which half-space the feature difference vector spacer is in.

In higher-dimensional feature spaces this idea generalizes to separating the feature difference vectors into two half-spaces which are on opposite sides of a separating hyperplane. A convenient way of specifying this separating hyperplane, in any number of dimensions, is to introduce a normal vector spacer to the hyperplane. I’ve illustrated such a normal vector below for two dimensions, but you should imagine that we’re working in higher dimensions:

spacer

The condition for a feature difference vector to be in (say) the upper half-space is that spacer . To be in the lower half-space the condition is that spacer . Summing everything up, we can check whether the document spacer should be ranked above or below spacer simply by computing the sign of spacer .

The above observations suggest an algorithm for using the feature difference vectors to determine search relevance. It's a pretty obvious generalization of what I've just described, but at the risk of repeating myself I'll write it all out explicitly.

To start, we assume that a set of training data has been provided by human operators. Those operators have been given a set of training queries spacer and training documents spacer . For each training query they've ranked each training document in order of relevance to that query. They might decide, for example, that for the query spacer , the document spacer should be the top-ranked query, spacer the second-ranked query, and so on. This training data provides us with a whole lot of feature difference vectors spacer . These vectors can be divided up into two sets. The first set, which we'll call training data set spacer , contains those feature difference vectors for which spacer has been ranked more highly than spacer . The second set, which we'll call training data set spacer , contains those feature difference vectors for which spacer has been ranked more highly than spacer . We then find a separating hyperplane that separates these two data sets, i.e., with the upper half-space containing data set spacer (where spacer is ranked more highly than spacer for spacer ), and a lower half-space containing data set spacer .

Suppose now that spacer is any query – not necessarily a training query. And suppose spacer and spacer are any two documents, not just training documents. We rank spacer above spacer for query spacer if the feature difference vector spacer lies in the upper half-space. And we rank spacer below spacer if it lies in the lower half-space. This completes the description of the algorithm.

There are many problems with this basic algorithm. One problem becomes evident by going back to our primitive training data and adding an extra document:

spacer

Inspecting the feature difference vectors reveals a problem:

spacer

I’ve simplified the picture by showing only the feature difference vectors spacer in data set spacer , i.e., those for which spacer ranks more highly than spacer ; data set spacer contains the negation of those vectors. It should be clear that there is no half-space which can be used to divide up the vectors into two sets. It’s not geometrically possible. The way we’ll deal with this is by modifying the algorithm in such a way as to find a half-space which approximately divides the two sets of feature difference vectors into half-spaces. I’ll explain how to do this approximate division later in the post.

Before getting to the approximate division, though, we’ll warm up by figuring out much more explicitly how to do the division into two half-spaces when it is possible. It’s all very well for me to glibly say that we should “figure out which half-space the vector spacer is in”, but how can we actually do this in practice? In the next section I’ll introduce support vector machines, which are a way of doing this kind of division explicitly. Once we’ve understood the basics of support vector machines, we’ll come back to the question of how to divide two sets of vectors into approximate half-spaces.

Problems

  • Suppose the basic algorithm that I’ve described works, i.e., a division into half-spaces is possible. Suppose that for a particular query spacer the document spacer is ranked above spacer and spacer is ranked above spacer . Prove that the algorithm will rank spacer above spacer .

Support vector machines

Support vector machines are a technique for partitioning two sets of vectors (data set spacer and data set spacer ) into half-spaces separated by a separating hyperplane. The technique is guaranteed to work whenever such a partitioning is possible, and whatsmore the partitioning is optimal, in a sense I’ll make precise. In this section I’ll describe briefly how support vector machines work.

As an aside, you’ll note that the notion of search (and related topics) wasn’t mentioned anywhere in the last paragraph. That’s because support vector machines aren’t about search. Instead, they’re a general technique for dividing sets of vectors into half-spaces, a technique which can be applied to many different problems in machine learning and artificial intelligence, not just search. So support vector machines are a useful technique to understand, even if you’re not especially interested in search. End of aside.

A priori if someone just gives you two sets of vectors, spacer and spacer , it’s not always going to be clear whether a partitioning into two half-spaces is possible. But let’s assume that such a partitioning is possible, and we’ll return later to what happens when it’s not. What we’d like to do is to find an explicit way of specifying a separating hyperplane:

spacer

Note that in the last section our separating hyperplanes passed through the origin. But for support vector machines we’ll also allow hyperplanes which don’t pass through the origin. One way to explicitly specify such a hyperplane is as the set of vectors spacer satisfying the equation:

spacer

where spacer is a vector normal to the separating hyperplane, and spacer is a constant. Together, spacer and spacer specify the hyperplane. The goal of the support vector machine is to take data sets spacer and spacer , and use them to construct values for spacer and spacer which specify a separating hyperplane. Whatsmore, we’ll try to do this in such a way that we maximize the size of the “wedge” between the two data sets,

spacer

where we require that the two edges of the wedge (called support or supporting hyperplanes) be parallel to the separating hyperplane itself. In other words, the goal is to choose the separating hyperplane (i.e., to choose spacer and spacer ) in such a way that these two supporting hyperplanes are as far apart from one another as possible.

Let me mention, by the way, that I’m not mad keen on the notations we’ve been using, such as spacer and spacer . Unfortunately, it’s not going to get any better – we have some spacer ‘s and spacer ‘s in our near future, and I’m sorry to say that the origin of those notations won’t be much more transparent than the origins of spacer and spacer . I’m using what seems to be standard notation in the support vector machine literature, but it’s not very good notation, in my opinion: it has little or no mnemonic value, and is very non-uniform to boot. While I’m complaining, I’ll mention one more thing: the vector spacer is sometimes known as the weight vector, again, for reasons which seem obscure. End of rant.

Let’s get back to the problem of choosing spacer and spacer so that the two supporting hyperplanes are as far apart from one another as possible. To do that, we need to figure out a way of expressing the distance between the two supporting hyperplanes in terms of spacer and spacer . Without any loss of generality, we’ll suppose that the separating hyperplane is halfway between the two supporting hyperplanes:

spacer

Let’s fix our attention on one of the two supporting hyperplanes, say, the one that is “higher up” in the picture above. We’ll suppose, also without any loss of generality, that this is the hyperplane supporting data set spacer . By rescaling spacer and spacer if necessary, we can ensure that this support hyperplane has the equation

spacer

and so the constraint that this hyperplane support data set spacer may be expressed as spacer . This is an important constraint: it’s the constraint on data set spacer .

Exercises

  • If you’re not comfortable with hyperplanes you may not immediately see why the kind of rescaling of spacer and spacer which I did in the last paragraph is always possible. If this is the case, then write out an explicit proof.

We can now determine the distance between the supporting and separating hyperplane by introducing a vector spacer which is: (a) normal to both hyperplanes, and (b) connects the two hyperplanes. A bit more precisely, spacer , where spacer is in the supporting hyperplane and spacer is the projection of spacer onto the separating hyperplane:

spacer

Our goal is to find the length of spacer . Taking the difference of the equations spacer and spacer , we have spacer . Since spacer and spacer are both normal to the separating hyperplane, and thus parallel, it follows that spacer , where I am omitting the spacer in order to denote length. It follows that the distance from the separating to the supporting hyperplane is spacer .

We chose the separating hyperplane to be halfway between the two supporting hyperplanes, and so the total size of the wedge is spacer . As a result, maximizing the size of the wedge is equivalent to maximizing spacer . This is subject to the constraint on data set spacer , namely that spacer for all the vectors spacer in data set spacer . There’s also a similar constraint for data set spacer ,

spacer

for all the vectors spacer in data set spacer . The supporting hyperplane for data set spacer satisfies this condition with equality.

Exercises

  • Prove that spacer is the equation for the supporting hyperplane for data set spacer .

There is a reasonably nice way of combining the two sets of constraint inequalities for data sets spacer and spacer . That’s to label each vector

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.