Addendum on search and support vector machines
In the last post I described how to use support vector machines (SVMs) to combine multiple notions of search relevance. After posting I realized I could greatly simplify my discussion of an important subject in the final section of that post. What I can simplify is this: once you’ve found the SVM parameters, how should you use them to rank a long list of documents for a particular query, in practice?
Here’s the setup. We suppose we’ve already taken a bunch of training data, as in the last post, and used that training data to find a SVM with soft margin. That SVM can be used to rank any query and document. In particular, the SVM is specified by a vector , and we rank a document above a document for query if:
where is the feature difference vector. (In the last post a parameter also appeared in this condition, but as I noted in one of the exercises in that post, for the particular problem of search , which is why it doesn’t appear here.)
In the final section of the last post I did lots of perambulations with the condition [*] to figure out how to construct a linear order ranking documents, given a particular query . We can simplify life quite a bit by recalling that where is the feature vector for query and document . The condition above can then be rewritten as:
This condition suggests a much easier way of ranking documents: simply treat as a score of how good document is when the query is , so that we rank above whenever the score for is better than . We then just construct a linear ordering of the documents, based on the score. Or, if we prefer, we can find just the top 10 (or 20, or whatever) documents by iterating linearly over the documents, and keeping a running tally of the best 10 (or 20) found to date.
This is quite a bit simpler than the discussion at the end of my last post. It only works, though, because our classifer (the SVM) is a linear function of the feature difference vectors. It’s not difficult to construct other types of classifer for which it really wouldn’t be obvious how to construct this kind of score. For those classifiers you could still fall back on the analysis I did in the last post, so it wasn’t all wasted effort.
Incidentally, one thing I like about the equation [**] is that it makes it a lot clearer why is called the weight vector. It means that the score for page on query is just a linear combination of the different feature values, with weights given by the weight vector. That makes the notation and nomenclature “weight vector” much clearer, as well as making it a lot clearer what the vector means!