Documents as geometric objects: how to rank documents for full-text search

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

I’ve based my treatment on chapters 6 and 7 of a very good book about information retrieval, by Manning, Raghavan, and Schuetze. The book is also available for free on the web here.

It’s worth mentioning that although this post is a mixture of mathematics and algorithms, search is not a purely mathematical or algorithmic problem. That is, search is not like solving a quadratic equation or sorting a list: there is no right answer, no well-defined notion of what it means to “solve” the search problem. To put it another way, as pointed out by Peter Norvig in an interview in Peter Seibel’s book Coders at Work, it makes no sense to think about “proving” that Google is correct, as one might perhaps do for a sorting algorithm. And yet we would no doubt all agree that some search algorithms are better, and others are worse.

The vector space model explained

Suppose d is a document in the corpus of documents we’d like to search. As a trivial example of a corpus, later in this post I’ll use a corpus containing just four documents, which I’ve chosen to be product descriptions for the books Lord of the Rings, The Hobbit, The Silmarillion and Rainbows End (all taken from Amazon). A real corpus would be much larger. In any case, what we’re going to do is to associate with d a document vector, for which we’ll use the notation \vec d. You can think of the document vector \vec d as a geometric way of representing the document d.

To explain how \vec d is defined, we need first to define the vector space \vec d lives in. To do that we need to go through a few preparatory steps. Starting from our corpus, we’ll extract all the words in that corpus, and we’ll call them the terms that make up the corpus dictionary. To each term t in the dictionary we will associate a different unit vector \vec e_t in the vector space. So, for instance, if the word “Einstein” appears in the corpus, then there is a unit vector \vec e_{\rm Einstein}. We’ll further stiuplate that the unit vectors for different terms are orthogonal to one another. And so the vector space we’re working in – our geometric arena – is, by definition, the vector space spanned by all the unit vectors \vec e_t corresponding to the different terms in the dictionary.

Somewhat oddly, so far as I know this vector space doesn’t have a name. I think it should: it is, after all, the geometric arena in which the vector space model for documents works. I will call it document space.

With all the above in mind, we can now define the document vector:

\vec d \equiv \sum_{t} \mbox{imp}(t,d) \vec e_t.

Here, \mbox{imp}(t,d) is some function that measures the importance of term t in document d (more on this function below). The key point to take away for now is that the document vector \vec d is just the vector whose component in the direction \vec e_t of term t is a measure of the importance of term t in the document d. That’s a pretty simple and natural geometric notion. It means that if some term is very important in a document, say “Einstein”, then \vec d will have a large component in the \vec e_{\rm Einstein} direction. But if some term isn’t very important – say “turnips” – then the component in the \vec e_{\rm turnips} direction will be small.

How should we define the importance function \mbox{imp}(t,d)? There is, of course, no unique way of defining such a function, and below we’ll discuss several different concrete approaches to defining \mbox{imp}(t,d). But as a simple example, one way you could define importance is to set \mbox{imp}(t,d) equal to the frequency of the term t in the document d. Later we’ll come up with an even better measure, but this illustrates the flavour of how \mbox{imp}(t,d) is defined.

A caveat about the document vector \vec d is that typically doesn’t preserve all the information about the document. For instance, if we use frequency as a measure of importance, then the document vector for the document “She sells sea shells by the sea shore” will be the same as the document vector for the document “Sea shells by the sea shore she sells”, because the term frequencies are the same in both documents. So the document vector is only an imperfect representation of the original document. Still, provided you choose a reasonable function for importance, the document vector preserves quite a lot of information about the original document.

Finding documents relevant to a query

Having defined document vectors, we can now say how to find the document most relevant to a given search query, q. We do it by defining a query vector, \vec q, exactly as we defined the document vector. That is, we regard the query text q as a document, and construct the corresponding vector in document space:

\vec q \equiv \sum_{t} \mbox{imp}(t,q) \vec e_t

Now, how should me measure how relevant a document d is to a particular query q? One way is to simply consider the angle \theta between the query vector \vec q and the document vector \vec d, as defined by the equation:

\cos \theta = \frac{\vec q \cdot \vec d}{\| \vec q \| \| \vec d \|}.

This is called the cosine similarity between the document and the query. Our notion of relevance, then, is that the smaller the angle \theta is, the more relevant the document d is to the query q.

What makes the cosine similarity so beautiful is that it lets us use our geometric intuition to think about the problem of ranking documents. Want the top K ranked documents for the query? Just compute the angle \theta for every document in the corpus, and return the K documents whose angles are smallest. Want to find documents that are similar to a document d (“find documents like this one”)? Just look for the other documents whose document vectors have the smallest angle to \vec d.

Of course, using the angle is a very ad hoc choice. The justification is not theoretical, it is empirical: search engines which use cosine similarity on the whole leave their users pretty satisfied. If you can find another measure which leaves users more satisfied, then by all means, feel free to use that measure instead.

Problems

  • Is there a simple mathematical model of user behaviour and satisfication in which it’s possible to prove that cosine similarity is optimal for satisfaction, for some particular choice of the importance function? Alternately, can you find another measure which is optimal? Note: As with many good research problems, this problem is quite ill posed. Making it simultaneously well posed and soluble is part of the problem.

The importance function

We’re now in position to come back and think more carefully about how to define the importance function. Earlier, I mentioned that one way of definining \mbox{imp}(t,d) is as the frequency of term t in the document d. This is a pretty good measure, but has the problem that it over-emphasizes common words, like “the”, “and”, “but”, and so on. If I’m interested in comparing a document and a query, it’s not so useful to know that both contain the word “the”, even if the document contains “the” twenty times. On the other hand, if both document and query contain the word “hobbit”, that’s a much more pertinent fact, even if the document only contains “hobbit” once. The reason is that in most search corpi the term “hobbit” is a relatively rare term.

There’s an improved notion of importance that takes this difference between common and uncommon words in the corpus into account. To define this improved notion of importance, first let N be the total number of documents in the corpus. Let {\rm df}_t be the number of documents the term t appears in, known as the document frequency for term t. And finally, let {\rm tf}_{t,d} be the numbe of times t appears in document d, known as the term frequency. One common choice for the importance function is (logarithm to base 2):

\mbox{imp}(t,d) \equiv {\rm tf}_{t,d} * \log(N/{\rm df}_t)

While this isn’t the only possible choice for the importance function, it is the one we’ll focus on the most, and so we’ll call it the standard importance function.

How should we think about the standard importance function? Perhaps the strangest looking part of the function is the \log(N/{\rm df}_t) term, often called the inverse document frequency for t. In fact, the inverse document frequency isn’t so strange. Suppose that we’re trying to identify a document in the corpus. Initially, we know nothing about the identity of the document. Then we learn that the document contains the term t. The inverse document frequency \log(N/{\rm df}_t) is just the number of bits of information we acquire when we learn this information. If t is a very common term – say “the” or “and” – then this number will be tiny, because the document frequency {\rm df}_t is very close to the total number of documents, N. But if t is an uncommon term in the corpus – as the term “hobbit” is likely to be (unless the corpus is on Tolkien!) – then the document frequency {\rm df}_t will be small compared to N, and we’ll get quite a lot of information when we learn that the document contains t.

With this interpretation of \log(N/{\rm df}_t), we can now think of \mbox{imp}(t,d) as simply being the frequency of t in d, but rescaled according to a measure \log(N/{\rm df}_t) of the importance of t in the overall corpus. Of course, as with cosine similarity, this is a somewhat ad hoc choice; in no sense have we mathematically derived this measure from a mathematical model of user behaviour. Yet in practice it works pretty well as a measure of importance. Intuitively, with this importance function, two document vectors \vec d_1 and \vec d_2 will be nearly parallel when they make use of a very similar vocabulary, and, in particular, when they use a similar vocabulary of unusual words, and the unusual words occur with similar (relative) frequencies in the two documents.

Exercises

  • Suppose the document d_2 is just the document d_1 concatenated with itself 3 times. Show that d_1 and d_2 are parallel (i.e., \theta = 0) when the standard importance function is used.

A toy Python implementation of the vector space model

Let’s now take a look at a toy Python (v2.6) implementation of a search engine that uses the vector space model. The code is straightforward – just a few dozen lines of non-boilerplate. Perhaps the most interesting bits are the data structures used, which are mostly set up in the header of the program, and in initalize_terms_and_postings, initialize_document_frequencies and initialize_lengths. The computation of cosine similarity is done in similarity. Here’s the source code (see also the source at GitHub):

"""vsm.py implements a toy search engine to illustrate the vector
space model for documents.

It asks you to enter a search query, and then returns all documents
matching the query, in decreasing order of cosine similarity,
according to the vector space model."""

from collections import defaultdict
import math
import sys

# We use a corpus of four documents.  Each document has an id, and
# these are the keys in the following dict.  The values are the
# corresponding filenames.
document_filenames = {0 : "documents/lotr.txt",
                      1 : "documents/silmarillion.txt",
                      2 : "documents/rainbows_end.txt",
                      3 : "documents/the_hobbit.txt"}

# The size of the corpus
N = len(document_filenames)

# dictionary: a set to contain all terms (i.e., words) in the document
# corpus.
dictionary = set()

# postings: a defaultdict whose keys are terms, and whose
# corresponding values are the so-called "postings list" for that
# term, i.e., the list of documents the term appears in.
#
# The way we implement the postings list is actually not as a Python
# list.  Rather, it's as a dict whose keys are the document ids of
# documents that the term appears in, with corresponding values equal
# to the frequency with which the term occurs in the document.
#
# As a result, postings[term] is the postings list for term, and
# postings[term][id] is the frequency with which term appears in
# document id.
postings = defaultdict(dict)

# document_frequency: a defaultdict whose keys are terms, with
# corresponding values equal to the number of documents which contain
# the key, i.e., the document frequency.
document_frequency = defaultdict(int)

# length: a defaultdict whose keys are document ids, with values equal
# to the Euclidean length of the corresponding document vector.
length = defaultdict(float)

# The list of characters (mostly, punctuation) we want to strip out of
# terms in the document.
characters = " .,!#$%^&*();:\n\t\\\"?!{}[]<>"

def main():
    initialize_terms_and_postings()
    initialize_document_frequencies()
    initialize_lengths()
    while True:
        do_search()

def initialize_terms_and_postings():
    """Reads in each document in document_filenames, splits it into a
    list of terms (i.e., tokenizes it), adds new terms to the global
    dictionary, and adds the document to the posting list for each
    term, with value equal to the frequency of the term in the
    document."""
    global dictionary, postings
    for id in document_filenames:
        f = open(document_filenames[id],'r')
        document = f.read()
        f.close()
        terms = tokenize(document)
        unique_terms = set(terms)
        dictionary = dictionary.union(unique_terms)
        for term in unique_terms:
            postings[term][id] = terms.count(term) # the value is the
                                                   # frequency of the
                                                   # term in the
                                                   # document

def tokenize(document):
    """Returns a list whose elements are the separate terms in
    document.  Something of a hack, but for the simple documents we're
    using, it's okay.  Note that we case-fold when we tokenize, i.e.,
    we lowercase everything."""
    terms = document.lower().split()
    return [term.strip(characters) for term in terms]

def initialize_document_frequencies():
    """For each term in the dictionary, count the number of documents
    it appears in, and store the value in document_frequncy[term]."""
    global document_frequency
    for term in dictionary:
        document_frequency[term] = len(postings[term])

def initialize_lengths():
    """Computes the length for each document."""
    global length
    for id in document_filenames:
        l = 0
        for term in dictionary:
            l += imp(term,id)**2
        length[id] = math.sqrt(l)

def imp(term,id):
    """Returns the importance of term in document id.  If the term
    isn't in the document, then return 0."""
    if id in postings[term]:
        return postings[term][id]*inverse_document_frequency(term)
    else:
        return 0.0

def inverse_document_frequency(term):
    """Returns the inverse document frequency of term.  Note that if
    term isn't in the dictionary then it returns 0, by convention."""
    if term in dictionary:
        return math.log(N/document_frequency[term],2)
    else:
        return 0.0

def do_search():
    """Asks the user what they would like to search for, and returns a
    list of relevant documents, in decreasing order of cosine
    similarity."""
    query = tokenize(raw_input("Search query >> "))
    if query == []:
        sys.exit()
    # find document ids containing all query terms.  Works by
    # intersecting the posting lists for all query terms.
    relevant_document_ids = intersection(
            [set(postings[term].keys()) for term in query])
    if not relevant_document_ids:
        print "No documents matched all query terms."
    else:
        scores = sorted([(id,similarity(query,id))
                         for id in relevant_document_ids],
                        key=lambda x: x[1],
                        reverse=True)
        print "Score: filename"
        for (id,score) in scores:
            print str(score)+": "+document_filenames[id]

def intersection(sets):
    """Returns the intersection of all sets in the list sets. Requires
    that the list sets contains at least one element, otherwise it
    raises an error."""
    return reduce(set.intersection, [s for s in sets])

def similarity(query,id):
    """Returns the cosine similarity between query and document id.
    Note that we don't bother dividing by the length of the query
    vector, since this doesn't make any difference to the ordering of
    search results."""
    similarity = 0.0
    for term in query:
        if term in dictionary:
            similarity += inverse_document_frequency(term)*imp(term,id)
    similarity = similarity / length[id]
    return similarity

if __name__ == "__main__":
    main()

Try it out and see how it works. Here’s some typical output:

Search query >> lord of the rings
Score: filename
0.28036158811: documents/lotr.txt
0.0723574605292: documents/the_hobbit.txt

Not exactly unexpected, but gratifying: it correctly identifies The Lord of the Rings and The Hobbit as relevant, in the correct order, and omits The Silmarillion and Rainbows End. About the only surprise is the omission of The Silmarillion, which could plausibly have been in the list of results, but would presumably have ranked last. All in all, it’s about what we’d expect.

Improvements and variations

There are many improvements and variations possible for the toy program I presented above. I won’t describe them in detail here, but will briefly mention a few.

Performance: The toy algorithm I’ve described doesn’t work very well for large numbers of documents. The problem is that it computes the cosine similarity for every single document in the corpus. If you have ten billion documents in your document corpus, then the last thing you want to be doing is computing ten billion inner products. Fortunately, you don’t have to. It’s possible to take a lot of shortcuts that dramatically speed up the process. I won’t describe these in detail here, but here’s a simple idea that should start to stimulate more ideas: for each term t in the dictionary, it’s straightforward to precompute the top 1,000 (say) documents for that term. Then for a given multi-term query it’s pretty likely that the top search results will come from one of the pre-computed lists of top documents for the terms in the query.

Find what the user wants, not necessarily what they say they want: A user searching for “automobiles” probably also wants documents which talk about “cars”, one searching for “elefants” probably wants documents about “elephants”, and one looking for “moviemaking” might well be interested in documents about “moviemakers”. The system I described can and should be extended to deal with synonyms, with misspellings, and to ensure that we look for different forms of a word. I hope to deal with these topics in future posts.

Combine with information about a document’s importance: The model I’ve described treats all documents on an equal footing. In practice, we need to combine a notion of relevance like cosine similarity with some notion of the intrinsic importance of the document. A webpage from the New York Times shoud appear higher in search results than a webpage from a spammer, even if the spam page appears more relevant according to cosine similarity. So we need to extend the approach to incorporate some notion of relevance, such as PageRank.

The ultimate search engine: These are just a few of the many issues that must be dealt with by a search engine. Ultimately, the goal is to build a search engine that tells us what we need to know. That’s a very open-ended goal, and we can break it down in many different ways; I’ve just presented one analytical frame for thinking about it. In future posts I’ll return to investiage some of the issues raised above in more depth, and also to explore other ways of thinking about the problem of search.

Interested in more? Please subscribe to this blog, or follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon now.

6 comments

  1. Nice post Michael. I’ve been looking forward to hearing your explanation of how search engines work ever since the google technology stack days.

    1. Thanks, Rob. I should have a few more big picture posts on the subject over the next couple of months, too!

Comments are closed.