**Working notes ahead!** This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the style of those posts. Rather, it’s my personal working notes – me, thinking out loud – as I try to understand a particular question in some depth. You can think of it as a variation on open notebook science. As such, the notes go into much more detail than previous posts, as I try to wrap my head around all the details of the problem at hand. In writing this blog I expect to alternate between these two formats (working notes and broader reviews). When it’s my personal working notes, I’ll make that clear in the first paragraph or so, so people who just want carefully distilled highlights can skip the more specialized posts.

With that out of the way, let’s get on with the main problem. In the last post I described a way of measuring how relevant a document is to a given search query. That notion of relevance was based on the *cosine similarity* (essentially, the angle) between the *document vector* and the *query vector*. Those vectors, in turn, were defined based on a notion of the *importance* of a term in a document . In the post I used what I called the *standard importance function* (I’ll remind you how that’s defined in a moment). But although the standard importance function is widely used, that doesn’t mean it’s without problems, and in this post I describe a problem caused by the standard importance function.

The strategy in the post is to first describe a specific example that illustrates the problem with the standard importance function. The example is somewhat artificial, but, once we understand it, we’ll see that it’s part of a broader problem that is very real, and not at all artificial.

To describe the example, let’s suppose our search query contains two terms, “hobbit baggins”. We can represent this by a query vector whose components include a term for “hobbit” and a term for “baggins”. Recall from the last post that, if we’re using the standard importance function, then those components are given by the inverse document frequency for each term, . Here, logarithms are taken to base two, is the total number of documents in the search corpus, is the term (“hobbit” and “baggins” being the relevant terms in this example), and is the *document frequency* for term , i.e., the number of documents in the corpus in which occurs.

Now, both “hobbit” and “baggins” are likely to be pretty unusual terms in the corpus, and so the inverse document frequency will be pretty large. Suppose, for example, that approximately one in a thousand documents contains “hobbit”, and similarly for “baggins”. Then the inverse document frequency will be . So to a good approximation the query vector will be:

where I’ve written the query vector so the first component is the component for “hobbit”, the second component is the component for “baggins”, and the other components are for other terms – all those components are in this case, since those other terms don’t appear in the query. I’ve also written the query vector as a transposed row vector, because row vectors display better on the blog.

Now suppose we have a document that contains ten occurrences of “hobbit”, and ten occurrences of “baggins”. Using the standard importance function, its document vector will have the form

Here, I’ve used the fact that the component of in the “direction” of term is given by the standard importance function:

where is the term frequency of in document . I’ve also used to denote the document vector for the remainder of the document, i.e., for all the terms not involving “hobbit” or “baggins”.

Suppose, now, that we change the document, so that for some reason we replace all occurrences of “baggins” by “hobbit”, so there are now occurrences of “hobbit”, and none of “baggins”. Intuitively, it seems to me that the resulting modified document is much less relevant to the query “baggins hobbit”. The reason is that although it contains twice as many occurrences of “hobbit”, those extra occurrences don’t compensate for the complete absence of the term “baggins”. Put another way, I think a small decrease in the frequency of “baggins” should only be compensated for by a much larger increase in the frequency of “hobbit”. As we’ll see, though, the cosine similarity doesn’t reflect this intuition very well.

To compute the cosine similarity, note that the document vector for the modified document is:

Recall also that the cosine similarity for an arbitrary document is defined to be , where I use the convention that . In principle this notation could cause confusion, since we now use (or ) to refer to the query (or document), as well as the length of the corresponding query (or document) vector. In practice it should be clear from context which is meant.

We get a hint of the problem with the standard importance function when we observe that . I.e., changing all occurrences of “hobbit” to “baggins” has no effect at all on the inner product, simply moving weight in the inner product from one term to the other. And so the only difference in the cosine similarity is due to the change in length between and . One way of expressing this is to observe that:

Note furthermore that

and so we have:

The good news is that this gives us the right ordering – the document is more parallel to than is , since , and so occurrences (from ) will require there to be no fewer than occurrences of “baggins” for the score to remain as high (assuming the rest of the document is complex, i.e., ). This matches our intuition much better. Unfortunately, it’s by no means clear that it preserves whatever other properties of cosine similarity made it a good measure of relevance in the first place. That’s a question that most likely would require empirical user testing to resolve.

To wrap up, my takeaways from this are that: (1) the standard importance function makes it too easy to trade off query terms against one another in a document; (2) it’s easy to make *ad hoc* modifications to the standard importance function that remove this problem; but (3) it’s not clear which of those modifications would preserve the other qualities we want. The obvious thing to do is some empirical testing to see how satisfied users are with different measures. It’d also be good to think more deeply about the foundation of the problem: how *should* we define the relevance of a document to a given user query?

*Interested in more? Please follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon.*

The first time you use the importance function, right after “Using the standard importance function, its document vector will have the form”, you have the square root of tf_{t,d}. Is this a typo?

Oops – yes, thanks, now fixed.

Does the use of Latent Semantic Analysis (LSA) overcome some of the problems described here?

What do you have in mind? (LSA seems to be used in lots of different ways.) Are you thinking of LSA’s use in Latent Semantic Indexing? That seems like an interesting question, to which I unfortunately don’t know the answer – I’d need to look at the details. As I recall, the term-document matrix used in LSA uses a similar notion of importance to what I’ve been using, based on inverse document frequency. Of course, that doesn’t mean it suffers the same kind of problems.

I mean LSA as in Latent Semantic Indexing. LSI use the same notion of importance you mention to build the matrix, and then compute a k-rank approximation of that matrix. According to an “Introduction to Infromation Rertrieval”, by Manning et. al, section 18.4:

“When forced to squeeze the terms/documents down to a k-dimensional space, the SVD should bring together terms with similar cooccurrences. This intuition suggests, then, that

not only should retrieval quality not suffer too much from the dimension reduction, but in fact may improve.”

“Most surprisingly, a value of k in the low hundreds can actually increase precision on some query benchmarks. This appears to suggest that for a suitable value of k, LSI addresses some of the challenges of synonymy.”

By the way, there is a nice Python implementation of TF-IDF, LSI an others. It’s called Gensim (http://nlp.fi.muni.cz/projekty/gensim/index.html)

Interesting quote (and thanks for the project pointer!) – I’m not that far into Manning et al, yet.

My (hazy) recollection is that LSI retrieval deals quite well with things like the synonym problem – retrieving documents about automobiles when the query says “car”. But I’m not sure that it addresses the query term tradeoff problem that I’ve described in this post.