Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the style of those posts. Rather, it’s my personal working notes – me, thinking out loud – as I try to understand a particular question in some depth. You can think of it as a variation on open notebook science. As such, the notes go into much more detail than previous posts, as I try to wrap my head around all the details of the problem at hand. In writing this blog I expect to alternate between these two formats (working notes and broader reviews). When it’s my personal working notes, I’ll make that clear in the first paragraph or so, so people who just want carefully distilled highlights can skip the more specialized posts.

With that out of the way, let’s get on with the main problem. In the last post I described a way of measuring how relevant a document is to a given search query. That notion of relevance was based on the cosine similarity (essentially, the angle) between the document vector and the query vector. Those vectors, in turn, were defined based on a notion of the importance $\mbox{imp}(t,d)$ of a term $t$ in a document $d$. In the post I used what I called the standard importance function (I’ll remind you how that’s defined in a moment). But although the standard importance function is widely used, that doesn’t mean it’s without problems, and in this post I describe a problem caused by the standard importance function.

The strategy in the post is to first describe a specific example that illustrates the problem with the standard importance function. The example is somewhat artificial, but, once we understand it, we’ll see that it’s part of a broader problem that is very real, and not at all artificial.

To describe the example, let’s suppose our search query contains two terms, “hobbit baggins”. We can represent this by a query vector $\vec q$ whose components include a term for “hobbit” and a term for “baggins”. Recall from the last post that, if we’re using the standard importance function, then those components are given by the inverse document frequency for each term, $\log(N/{\rm df}_t)$. Here, logarithms are taken to base two, $N$ is the total number of documents in the search corpus, $t$ is the term (“hobbit” and “baggins” being the relevant terms in this example), and ${\rm df}_t$ is the document frequency for term $t$, i.e., the number of documents in the corpus in which $t$ occurs.

Now, both “hobbit” and “baggins” are likely to be pretty unusual terms in the corpus, and so the inverse document frequency will be pretty large. Suppose, for example, that approximately one in a thousand documents contains “hobbit”, and similarly for “baggins”. Then the inverse document frequency will be $\log(1000) \approx 10$. So to a good approximation the query vector will be: $\vec q = [10 10 0 0 \ldots]^T,$

where I’ve written the query vector so the first component is the component for “hobbit”, the second component is the component for “baggins”, and the other components are for other terms – all those components are $0$ in this case, since those other terms don’t appear in the query. I’ve also written the query vector as a transposed row vector, because row vectors display better on the blog.

Now suppose we have a document that contains ten occurrences of “hobbit”, and ten occurrences of “baggins”. Using the standard importance function, its document vector will have the form $\vec d_1 = [100 100 \vec d_r^T]^T.$

Here, I’ve used the fact that the component of $\vec d_1$ in the “direction” of term $t$ is given by the standard importance function: $\mbox{imp}(t,d) \equiv \mbox{tf}_{t,d} \log(N/\mbox{df}_t),$

where $\mbox{tf}_{t,d}$ is the term frequency of $t$ in document $d$. I’ve also used $\vec d_r$ to denote the document vector for the remainder of the document, i.e., for all the terms not involving “hobbit” or “baggins”.

Suppose, now, that we change the document, so that for some reason we replace all occurrences of “baggins” by “hobbit”, so there are now $20$ occurrences of “hobbit”, and none of “baggins”. Intuitively, it seems to me that the resulting modified document is much less relevant to the query “baggins hobbit”. The reason is that although it contains twice as many occurrences of “hobbit”, those extra occurrences don’t compensate for the complete absence of the term “baggins”. Put another way, I think a small decrease in the frequency of “baggins” should only be compensated for by a much larger increase in the frequency of “hobbit”. As we’ll see, though, the cosine similarity doesn’t reflect this intuition very well.

To compute the cosine similarity, note that the document vector for the modified document is: $\vec d_2 = [200 0 \vec d_r^T]^T.$

Recall also that the cosine similarity for an arbitrary document $d$ is defined to be $\cos \theta \equiv (\vec q \cdot \vec d) / q d$, where I use the convention that $q \equiv \|\vec q\|, d \equiv \|\vec d\|$. In principle this notation could cause confusion, since we now use $q$ (or $d$) to refer to the query (or document), as well as the length of the corresponding query (or document) vector. In practice it should be clear from context which is meant.

We get a hint of the problem with the standard importance function when we observe that $\vec q \cdot \vec d_1 = \vec q \cdot \vec d_2 = 2000$. I.e., changing all occurrences of “hobbit” to “baggins” has no effect at all on the inner product, simply moving weight in the inner product from one term to the other. And so the only difference in the cosine similarity is due to the change in length between $d_1$ and $d_2$. One way of expressing this is to observe that: $\frac{\cos \theta_1}{\cos \theta_2} = \frac{\vec q \cdot \vec d_1}{q d_1} \frac{q d_2}{\vec q \cdot \vec d_2} = \frac{d_2}{d_1}.$

Note furthermore that $d_1 = \sqrt{20000+d_r^2}; \,\,\,\, d_2 = \sqrt{40000+d_r^2},$

and so we have: $\frac{\cos \theta_1}{\cos \theta_2} = \sqrt{\frac{40000+d_r^2}{20000+d_r^2}} > 1.$

The good news is that this gives us the right ordering – the document $d_1$ is more parallel to $q$ than is $d_2$, since $\cos \theta_1 > \cos \theta_2$, and so $\theta_1 < \theta_2[/latex]. That's what we hoped for, and so at least the ordering of relevance will be correct. The bad news is that, for complex documents, the difference in cosine similarity is tiny. The reason is that for a long, complex document containing many unusual terms ("Gandalf", "Galadriel", "Saruman"), the value of $d_r^2$ may be enormous - let us say, conservatively, billions. In this approximation, we see that $\theta_1 \approx \theta_2$. By itself this isn't a problem. But it means that the ordering of $\theta_1$ and $\theta_2$ is enormously sensitive to changes in $\vec d_r$. If, when we modified the first document, we changed not just "hobbit" to "baggins", but also other terms, then that might cause the value of $d_r^2$ to decrease. If it decreased by more than $20000$ - easy to imagine, with just a few minor changes - then it would reverse the ordering of the two terms. And that doesn't seem like such a good thing. The example I have given may be artificial, but the broader phenomenon is not. More broadly, if we use the standard importance function when computing cosine similarity, then it becomes much too easy to trade off query terms in a document. For two-term queries, $q = t_1 t_2$, an increasing number of occurrences of $t_1$ can be traded off almost exactly by a corresponding decrease in the number of occurrences of $t_2$. So, for example, doubling the number of occurrences of $t_1$ allows us to (approximately) completely wipe out the occurrences of $t_2$, while making only a tiny difference in cosine similarity. In broad terms it's pretty clear how to address this problem - change the scoring function so that we can't trade off query terms in this way. It's also easy to see how we might go about changing the scoring function to do that. In fact, it's a bit too easy - there are so many ways we could go about doing it that it's not at all clear which will give the best results. Here's one trivial way: redefine the importance function to be: $\mbox{imp}(t,d) \equiv \sqrt{\mbox{tf}_{t,d}} \log(N/\mbox{df}_t).$ That is, modify the importance function so it has the square root of the term frequency in it rather than the term frequency. If we do this for the hobbit''-baggins'' example, then increasing the frequency of hobbit'' to [latex]20$ occurrences (from $10$) will require there to be no fewer than $4$ occurrences of “baggins” for the score to remain as high (assuming the rest of the document is complex, i.e., $d_r \gg 40000$). This matches our intuition much better. Unfortunately, it’s by no means clear that it preserves whatever other properties of cosine similarity made it a good measure of relevance in the first place. That’s a question that most likely would require empirical user testing to resolve.

To wrap up, my takeaways from this are that: (1) the standard importance function makes it too easy to trade off query terms against one another in a document; (2) it’s easy to make ad hoc modifications to the standard importance function that remove this problem; but (3) it’s not clear which of those modifications would preserve the other qualities we want. The obvious thing to do is some empirical testing to see how satisfied users are with different measures. It’d also be good to think more deeply about the foundation of the problem: how should we define the relevance of a document to a given user query?

Interested in more? Please follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon.

From → Uncategorized

1. The first time you use the importance function, right after “Using the standard importance function, its document vector will have the form”, you have the square root of tf_{t,d}. Is this a typo?

• Oops – yes, thanks, now fixed.

2. Does the use of Latent Semantic Analysis (LSA) overcome some of the problems described here?

• What do you have in mind? (LSA seems to be used in lots of different ways.) Are you thinking of LSA’s use in Latent Semantic Indexing? That seems like an interesting question, to which I unfortunately don’t know the answer – I’d need to look at the details. As I recall, the term-document matrix used in LSA uses a similar notion of importance to what I’ve been using, based on inverse document frequency. Of course, that doesn’t mean it suffers the same kind of problems.

• I mean LSA as in Latent Semantic Indexing. LSI use the same notion of importance you mention to build the matrix, and then compute a k-rank approximation of that matrix. According to an “Introduction to Infromation Rertrieval”, by Manning et. al, section 18.4:

“When forced to squeeze the terms/documents down to a k-dimensional space, the SVD should bring together terms with similar cooccurrences. This intuition suggests, then, that
not only should retrieval quality not suffer too much from the dimension reduction, but in fact may improve.”

“Most surprisingly, a value of k in the low hundreds can actually increase precision on some query benchmarks. This appears to suggest that for a suitable value of k, LSI addresses some of the challenges of synonymy.”

By the way, there is a nice Python implementation of TF-IDF, LSI an others. It’s called Gensim (http://nlp.fi.muni.cz/projekty/gensim/index.html)

• 