{"id":12,"date":"2011-07-07T15:11:43","date_gmt":"2011-07-07T19:11:43","guid":{"rendered":"https:\/\/michaelnielsen.org\/ddi\/?p=12"},"modified":"2011-07-07T16:13:29","modified_gmt":"2011-07-07T20:13:29","slug":"documents-as-geometric-objects-how-to-rank-documents-for-full-text-search","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/ddi\/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search\/","title":{"rendered":"Documents as geometric objects: how to rank documents for full-text search"},"content":{"rendered":"<p>When we type a query into a search engine &#8211; say &#8220;Einstein on relativity&#8221; &#8211; how does the search engine decide which documents to return?  When the document is on the web, part of the answer to that question is provided by the <a href=\"https:\/\/michaelnielsen.org\/blog\/lectures-on-the-google-technology-stack-1-introduction-to-pagerank\/\">PageRank   algorithm<\/a>, which analyses the link structure of the web to determine the importance of different webpages.  But what should we do when the documents aren&#8217;t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?<\/p>\n<p>In this post I explain the basic ideas of how to rank different documents according to their relevance.  The ideas used are very beautiful.  They are based on the fearsome-sounding <a href=\"http:\/\/en.wikipedia.org\/wiki\/Vector_space_model\">vector space   model<\/a> for documents.  Although it sounds fearsome, the vector space model is actually very simple.  The key idea is to transform search from a linguistic problem into a <em>geometric problem<\/em>.  Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space.  In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how <em>parallel<\/em> the query vector and the document vector are.  The more parallel the vectors, the more relevant the document is.<\/p>\n<p>This geometric way of treating documents turns out to be very powerful.  It&#8217;s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as <a href=\"http:\/\/lucene.apache.org\/java\/docs\/index.html\">Lucene<\/a>.  The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents.  What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.<\/p>\n<p>I&#8217;ve based my treatment on chapters 6 and 7 of a very good <a href=\"http:\/\/www.amazon.com\/Introduction-Information-Retrieval-Christopher-Manning\/dp\/0521865719\">book<\/a> about information retrieval, by Manning, Raghavan, and Schuetze.  The book is also available for free on the web <a href=\"http:\/\/nlp.stanford.edu\/IR-book\/information-retrieval-book.html\">here<\/a>.<\/p>\n<p>It&#8217;s worth mentioning that although this post is a mixture of mathematics and algorithms, search is not a purely mathematical or algorithmic problem.  That is, search is not like solving a quadratic equation or sorting a list: there is no right answer, no well-defined notion of what it means to &#8220;solve&#8221; the search problem.  To put it another way, as pointed out by Peter Norvig in an interview in Peter Seibel&#8217;s book <a href=\"http:\/\/www.codersatwork.com\/\">Coders at Work<\/a>, it makes no sense to think about &#8220;proving&#8221; that Google is correct, as one might perhaps do for a sorting algorithm.  And yet we would no doubt all agree that some search algorithms are better, and others are worse.<\/p>\n<h3>The vector space model explained<\/h3>\n<p>Suppose <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> is a document in the corpus of documents we&#8217;d like to search.  As a trivial example of a corpus, later in this post I&#8217;ll use a corpus containing just four documents, which I&#8217;ve chosen to be product descriptions for the books <em>Lord of the Rings<\/em>, <em>The   Hobbit<\/em>, <em>The Silmarillion<\/em> and <em>Rainbows End<\/em> (all taken from Amazon).  A real corpus would be much larger.  In any case, what we&#8217;re going to do is to associate with <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> a <em>document vector<\/em>, for which we&#8217;ll use the notation <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/>.  You can think of the document vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> as a geometric way of representing the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>.<\/p>\n<p>To explain how <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> is defined, we need first to define the vector space <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> lives in.  To do that we need to go through a few preparatory steps.  Starting from our corpus, we&#8217;ll extract all the words in that corpus, and we&#8217;ll call them the <em>terms<\/em> that make up the corpus <em>dictionary<\/em>.  To each term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the dictionary we will associate a different unit vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_t' title='\\vec e_t' class='latex' \/> in the vector space.  So, for instance, if the word &#8220;Einstein&#8221; appears in the corpus, then there is a unit vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_%7B%5Crm+Einstein%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_{\\rm Einstein}' title='\\vec e_{\\rm Einstein}' class='latex' \/>.  We&#8217;ll further stiuplate that the unit vectors for different terms are orthogonal to one another.  And so the vector space we&#8217;re working in &#8211; our geometric arena &#8211; is, by definition, the vector space spanned by all the unit vectors <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_t' title='\\vec e_t' class='latex' \/> corresponding to the different terms in the dictionary.<\/p>\n<p>Somewhat oddly, so far as I know this vector space doesn&#8217;t have a name.  I think it should: it is, after all, the geometric arena in which the vector space model for documents works.  I will call it <em>document space<\/em>.<\/p>\n<p>With all the above in mind, we can now define the document vector:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d+%5Cequiv+%5Csum_%7Bt%7D+%5Cmbox%7Bimp%7D%28t%2Cd%29+%5Cvec+e_t.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d \\equiv \\sum_{t} \\mbox{imp}(t,d) \\vec e_t.' title='\\vec d \\equiv \\sum_{t} \\mbox{imp}(t,d) \\vec e_t.' class='latex' \/>\n<p>Here, <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> is some function that measures the importance of term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> (more on this function below).  The key point to take away for now is that the document vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> is just the vector whose component in the direction <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_t' title='\\vec e_t' class='latex' \/> of term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> is a measure of the importance of term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>. That&#8217;s a pretty simple and natural geometric notion.  It means that if some term is very important in a document, say &#8220;Einstein&#8221;, then <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> will have a large component in the <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_%7B%5Crm+Einstein%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_{\\rm Einstein}' title='\\vec e_{\\rm Einstein}' class='latex' \/> direction.  But if some term isn&#8217;t very important &#8211; say &#8220;turnips&#8221; &#8211; then the component in the <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+e_%7B%5Crm+turnips%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec e_{\\rm turnips}' title='\\vec e_{\\rm turnips}' class='latex' \/> direction will be small.<\/p>\n<p>How should we define the importance function <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/>?  There is, of course, no unique way of defining such a function, and below we&#8217;ll discuss several different concrete approaches to defining <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/>.  But as a simple example, one way you could define importance is to set <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> equal to the frequency of the term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>.  Later we&#8217;ll come up with an even better measure, but this illustrates the flavour of how <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> is defined.<\/p>\n<p>A caveat about the document vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/> is that typically doesn&#8217;t preserve all the information about the document.  For instance, if we use frequency as a measure of importance, then the document vector for the document &#8220;She sells sea shells by the sea shore&#8221; will be the same as the document vector for the document &#8220;Sea shells by the sea shore she sells&#8221;, because the term frequencies are the same in both documents.  So the document vector is only an imperfect representation of the original document.  Still, provided you choose a reasonable function for importance, the document vector preserves quite a lot of information about the original document.<\/p>\n<h3>Finding documents relevant to a query<\/h3>\n<p>Having defined document vectors, we can now say how to find the document most relevant to a given search query, <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/>.  We do it by defining a <em>query vector<\/em>, <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q' title='\\vec q' class='latex' \/>, exactly as we defined the document vector. That is, we regard the query text <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/> as a document, and construct the corresponding vector in document space:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q+%5Cequiv+%5Csum_%7Bt%7D+%5Cmbox%7Bimp%7D%28t%2Cq%29+%5Cvec+e_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q \\equiv \\sum_{t} \\mbox{imp}(t,q) \\vec e_t' title='\\vec q \\equiv \\sum_{t} \\mbox{imp}(t,q) \\vec e_t' class='latex' \/>\n<p>Now, how should me measure how relevant a document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> is to a particular query <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/>?  One way is to simply consider the angle <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\theta' title='\\theta' class='latex' \/> between the query vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q' title='\\vec q' class='latex' \/> and the document vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/>, as defined by the equation:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ccos+%5Ctheta+%3D+%5Cfrac%7B%5Cvec+q+%5Ccdot+%5Cvec+d%7D%7B%5C%7C+%5Cvec+q+%5C%7C+%5C%7C+%5Cvec+d+%5C%7C%7D.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\cos \\theta = \\frac{\\vec q \\cdot \\vec d}{\\| \\vec q \\| \\| \\vec d \\|}.' title='\\cos \\theta = \\frac{\\vec q \\cdot \\vec d}{\\| \\vec q \\| \\| \\vec d \\|}.' class='latex' \/>\n<p>This is called the <em>cosine similarity<\/em> between the document and the query.  Our notion of relevance, then, is that the smaller the angle <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\theta' title='\\theta' class='latex' \/> is, the more relevant the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> is to the query <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/>.<\/p>\n<p>What makes the cosine similarity so beautiful is that it lets us use our geometric intuition to think about the problem of ranking documents.  Want the top <img src='https:\/\/s0.wp.com\/latex.php?latex=K&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K' title='K' class='latex' \/> ranked documents for the query?  Just compute the angle <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\theta' title='\\theta' class='latex' \/> for every document in the corpus, and return the <img src='https:\/\/s0.wp.com\/latex.php?latex=K&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K' title='K' class='latex' \/> documents whose angles are smallest.  Want to find documents that are similar to a document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> (&#8220;find documents like this one&#8221;)?  Just look for the other documents whose document vectors have the smallest angle to <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d' title='\\vec d' class='latex' \/>.<\/p>\n<p>Of course, using the angle is a very <em>ad hoc<\/em> choice.  The justification is not theoretical, it is empirical: search engines which use cosine similarity on the whole leave their users pretty satisfied.  If you can find another measure which leaves users more satisfied, then by all means, feel free to use that measure instead.<\/p>\n<h3>Problems<\/h3>\n<ul>\n<li> Is there a simple mathematical model of user behaviour and   satisfication in which it&#8217;s possible to prove that cosine similarity   is optimal for satisfaction, for some particular choice of the   importance function?  Alternately, can you find another measure   which is optimal?  <strong>Note:<\/strong> As with many good research   problems, this problem is quite ill posed.  Making it simultaneously   well posed and soluble is part of the problem. <\/ul>\n<h3>The importance function<\/h3>\n<p>We&#8217;re now in position to come back and think more carefully about how to define the importance function.  Earlier, I mentioned that one way of definining <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> is as the frequency of term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>.  This is a pretty good measure, but has the problem that it over-emphasizes common words, like &#8220;the&#8221;, &#8220;and&#8221;, &#8220;but&#8221;, and so on.  If I&#8217;m interested in comparing a document and a query, it&#8217;s not so useful to know that both contain the word &#8220;the&#8221;, even if the document contains &#8220;the&#8221; twenty times.  On the other hand, if both document and query contain the word &#8220;hobbit&#8221;, that&#8217;s a much more pertinent fact, even if the document only contains &#8220;hobbit&#8221; once. The reason is that in most search corpi the term &#8220;hobbit&#8221; is a relatively rare term.<\/p>\n<p>There&#8217;s an improved notion of importance that takes this difference between common and uncommon words in the corpus into account.  To define this improved notion of importance, first let <img src='https:\/\/s0.wp.com\/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' \/> be the total number of documents in the corpus.  Let <img src='https:\/\/s0.wp.com\/latex.php?latex=%7B%5Crm+df%7D_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\\rm df}_t' title='{\\rm df}_t' class='latex' \/> be the number of documents the term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> appears in, known as the <em>document   frequency<\/em> for term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/>.  And finally, let <img src='https:\/\/s0.wp.com\/latex.php?latex=%7B%5Crm+tf%7D_%7Bt%2Cd%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\\rm tf}_{t,d}' title='{\\rm tf}_{t,d}' class='latex' \/> be the numbe of times <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> appears in document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>, known as the <em>term   frequency<\/em>.  One common choice for the importance function is (logarithm to base <img src='https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2' title='2' class='latex' \/>):<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29+%5Cequiv+%7B%5Crm+tf%7D_%7Bt%2Cd%7D+%2A+%5Clog%28N%2F%7B%5Crm+df%7D_t%29+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d) \\equiv {\\rm tf}_{t,d} * \\log(N\/{\\rm df}_t) ' title='\\mbox{imp}(t,d) \\equiv {\\rm tf}_{t,d} * \\log(N\/{\\rm df}_t) ' class='latex' \/>\n<p>While this isn&#8217;t the only possible choice for the importance function, it is the one we&#8217;ll focus on the most, and so we&#8217;ll call it the <em>standard<\/em> importance function.<\/p>\n<p>How should we think about the standard importance function?  Perhaps the strangest looking part of the function is the <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28N%2F%7B%5Crm+df%7D_t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(N\/{\\rm df}_t)' title='\\log(N\/{\\rm df}_t)' class='latex' \/> term, often called the <em>inverse document frequency<\/em> for <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/>.  In fact, the inverse document frequency isn&#8217;t so strange.  Suppose that we&#8217;re trying to identify a document in the corpus.  Initially, we know nothing about the identity of the document.  Then we learn that the document contains the term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/>.  The inverse document frequency <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28N%2F%7B%5Crm+df%7D_t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(N\/{\\rm df}_t)' title='\\log(N\/{\\rm df}_t)' class='latex' \/> is just the number of bits of information we acquire when we learn this information.  If <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> is a very common term &#8211; say &#8220;the&#8221; or &#8220;and&#8221; &#8211; then this number will be tiny, because the document frequency <img src='https:\/\/s0.wp.com\/latex.php?latex=%7B%5Crm+df%7D_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\\rm df}_t' title='{\\rm df}_t' class='latex' \/> is very close to the total number of documents, <img src='https:\/\/s0.wp.com\/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' \/>.  But if <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> is an uncommon term in the corpus &#8211; as the term &#8220;hobbit&#8221; is likely to be (unless the corpus is on Tolkien!) &#8211; then the document frequency <img src='https:\/\/s0.wp.com\/latex.php?latex=%7B%5Crm+df%7D_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\\rm df}_t' title='{\\rm df}_t' class='latex' \/> will be small compared to <img src='https:\/\/s0.wp.com\/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' \/>, and we&#8217;ll get quite a lot of information when we learn that the document contains <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/>.<\/p>\n<p>With this interpretation of <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28N%2F%7B%5Crm+df%7D_t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(N\/{\\rm df}_t)' title='\\log(N\/{\\rm df}_t)' class='latex' \/>, we can now think of <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> as simply being the frequency of <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>, but rescaled according to a measure <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28N%2F%7B%5Crm+df%7D_t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(N\/{\\rm df}_t)' title='\\log(N\/{\\rm df}_t)' class='latex' \/> of the importance of <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the overall corpus.  Of course, as with cosine similarity, this is a somewhat <em>ad hoc<\/em> choice; in no sense have we mathematically derived this measure from a mathematical model of user behaviour.  Yet in practice it works pretty well as a measure of importance.  Intuitively, with this importance function, two document vectors <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_1' title='\\vec d_1' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_2' title='\\vec d_2' class='latex' \/> will be nearly parallel when they make use of a very similar vocabulary, and, in particular, when they use a similar vocabulary of <em>unusual<\/em> words, and the unusual words occur with similar (relative) frequencies in the two documents.<\/p>\n<h3>Exercises<\/h3>\n<ul>\n<li> Suppose the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_2' title='d_2' class='latex' \/> is just the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_1' title='d_1' class='latex' \/>   concatenated with itself <img src='https:\/\/s0.wp.com\/latex.php?latex=3&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='3' title='3' class='latex' \/> times.  Show that <img src='https:\/\/s0.wp.com\/latex.php?latex=d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_1' title='d_1' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=d_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_2' title='d_2' class='latex' \/> are   parallel (i.e., <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\theta = 0' title='\\theta = 0' class='latex' \/>) when the standard importance function   is used. <\/ul>\n<h3>A toy Python implementation of the vector space model<\/h3>\n<p>Let&#8217;s now take a look at a toy Python (v2.6) implementation of a search engine that uses the vector space model.  The code is straightforward &#8211; just a few dozen lines of non-boilerplate. Perhaps the most interesting bits are the data structures used, which are mostly set up in the header of the program, and in <tt>initalize_terms_and_postings<\/tt>, <tt>initialize_document_frequencies<\/tt> and <tt>initialize_lengths<\/tt>. The computation of cosine similarity is done in <tt>similarity<\/tt>.  Here&#8217;s the source code (see also the <a href=\"https:\/\/github.com\/mnielsen\/VSM\">source at GitHub<\/a>):<\/p>\n<pre>\r\n\"\"\"vsm.py implements a toy search engine to illustrate the vector\r\nspace model for documents.\r\n\r\nIt asks you to enter a search query, and then returns all documents\r\nmatching the query, in decreasing order of cosine similarity,\r\naccording to the vector space model.\"\"\"\r\n\r\nfrom collections import defaultdict\r\nimport math\r\nimport sys\r\n\r\n# We use a corpus of four documents.  Each document has an id, and\r\n# these are the keys in the following dict.  The values are the\r\n# corresponding filenames.\r\ndocument_filenames = {0 : \"documents\/lotr.txt\",\r\n                      1 : \"documents\/silmarillion.txt\",\r\n                      2 : \"documents\/rainbows_end.txt\",\r\n                      3 : \"documents\/the_hobbit.txt\"}\r\n\r\n# The size of the corpus\r\nN = len(document_filenames)\r\n\r\n# dictionary: a set to contain all terms (i.e., words) in the document\r\n# corpus.\r\ndictionary = set()\r\n\r\n# postings: a defaultdict whose keys are terms, and whose\r\n# corresponding values are the so-called \"postings list\" for that\r\n# term, i.e., the list of documents the term appears in.\r\n#\r\n# The way we implement the postings list is actually not as a Python\r\n# list.  Rather, it's as a dict whose keys are the document ids of\r\n# documents that the term appears in, with corresponding values equal\r\n# to the frequency with which the term occurs in the document.\r\n#\r\n# As a result, postings[term] is the postings list for term, and\r\n# postings[term][id] is the frequency with which term appears in\r\n# document id.\r\npostings = defaultdict(dict)\r\n\r\n# document_frequency: a defaultdict whose keys are terms, with\r\n# corresponding values equal to the number of documents which contain\r\n# the key, i.e., the document frequency.\r\ndocument_frequency = defaultdict(int)\r\n\r\n# length: a defaultdict whose keys are document ids, with values equal\r\n# to the Euclidean length of the corresponding document vector.\r\nlength = defaultdict(float)\r\n\r\n# The list of characters (mostly, punctuation) we want to strip out of\r\n# terms in the document.\r\ncharacters = \" .,!#$%^&*();:\\n\\t\\\\\\\"?!{}[]<>\"\r\n\r\ndef main():\r\n    initialize_terms_and_postings()\r\n    initialize_document_frequencies()\r\n    initialize_lengths()\r\n    while True:\r\n        do_search()\r\n\r\ndef initialize_terms_and_postings():\r\n    \"\"\"Reads in each document in document_filenames, splits it into a\r\n    list of terms (i.e., tokenizes it), adds new terms to the global\r\n    dictionary, and adds the document to the posting list for each\r\n    term, with value equal to the frequency of the term in the\r\n    document.\"\"\"\r\n    global dictionary, postings\r\n    for id in document_filenames:\r\n        f = open(document_filenames[id],'r')\r\n        document = f.read()\r\n        f.close()\r\n        terms = tokenize(document)\r\n        unique_terms = set(terms)\r\n        dictionary = dictionary.union(unique_terms)\r\n        for term in unique_terms:\r\n            postings[term][id] = terms.count(term) # the value is the\r\n                                                   # frequency of the\r\n                                                   # term in the\r\n                                                   # document\r\n\r\ndef tokenize(document):\r\n    \"\"\"Returns a list whose elements are the separate terms in\r\n    document.  Something of a hack, but for the simple documents we're\r\n    using, it's okay.  Note that we case-fold when we tokenize, i.e.,\r\n    we lowercase everything.\"\"\"\r\n    terms = document.lower().split()\r\n    return [term.strip(characters) for term in terms]\r\n\r\ndef initialize_document_frequencies():\r\n    \"\"\"For each term in the dictionary, count the number of documents\r\n    it appears in, and store the value in document_frequncy[term].\"\"\"\r\n    global document_frequency\r\n    for term in dictionary:\r\n        document_frequency[term] = len(postings[term])\r\n\r\ndef initialize_lengths():\r\n    \"\"\"Computes the length for each document.\"\"\"\r\n    global length\r\n    for id in document_filenames:\r\n        l = 0\r\n        for term in dictionary:\r\n            l += imp(term,id)**2\r\n        length[id] = math.sqrt(l)\r\n\r\ndef imp(term,id):\r\n    \"\"\"Returns the importance of term in document id.  If the term\r\n    isn't in the document, then return 0.\"\"\"\r\n    if id in postings[term]:\r\n        return postings[term][id]*inverse_document_frequency(term)\r\n    else:\r\n        return 0.0\r\n\r\ndef inverse_document_frequency(term):\r\n    \"\"\"Returns the inverse document frequency of term.  Note that if\r\n    term isn't in the dictionary then it returns 0, by convention.\"\"\"\r\n    if term in dictionary:\r\n        return math.log(N\/document_frequency[term],2)\r\n    else:\r\n        return 0.0\r\n\r\ndef do_search():\r\n    \"\"\"Asks the user what they would like to search for, and returns a\r\n    list of relevant documents, in decreasing order of cosine\r\n    similarity.\"\"\"\r\n    query = tokenize(raw_input(\"Search query >> \"))\r\n    if query == []:\r\n        sys.exit()\r\n    # find document ids containing all query terms.  Works by\r\n    # intersecting the posting lists for all query terms.\r\n    relevant_document_ids = intersection(\r\n            [set(postings[term].keys()) for term in query])\r\n    if not relevant_document_ids:\r\n        print \"No documents matched all query terms.\"\r\n    else:\r\n        scores = sorted([(id,similarity(query,id))\r\n                         for id in relevant_document_ids],\r\n                        key=lambda x: x[1],\r\n                        reverse=True)\r\n        print \"Score: filename\"\r\n        for (id,score) in scores:\r\n            print str(score)+\": \"+document_filenames[id]\r\n\r\ndef intersection(sets):\r\n    \"\"\"Returns the intersection of all sets in the list sets. Requires\r\n    that the list sets contains at least one element, otherwise it\r\n    raises an error.\"\"\"\r\n    return reduce(set.intersection, [s for s in sets])\r\n\r\ndef similarity(query,id):\r\n    \"\"\"Returns the cosine similarity between query and document id.\r\n    Note that we don't bother dividing by the length of the query\r\n    vector, since this doesn't make any difference to the ordering of\r\n    search results.\"\"\"\r\n    similarity = 0.0\r\n    for term in query:\r\n        if term in dictionary:\r\n            similarity += inverse_document_frequency(term)*imp(term,id)\r\n    similarity = similarity \/ length[id]\r\n    return similarity\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n<\/pre>\n<p>Try it out and see how it works.  Here&#8217;s some typical output:<\/p>\n<pre>\r\nSearch query >> lord of the rings\r\nScore: filename\r\n0.28036158811: documents\/lotr.txt\r\n0.0723574605292: documents\/the_hobbit.txt\r\n<\/pre>\n<p> Not exactly unexpected, but gratifying: it correctly identifies <em>The Lord of the Rings<\/em> and <em>The Hobbit<\/em> as relevant, in the correct order, and omits <em>The Silmarillion<\/em> and <em>Rainbows   End<\/em>.  About the only surprise is the omission of <em>The   Silmarillion<\/em>, which could plausibly have been in the list of results, but would presumably have ranked last.  All in all, it&#8217;s about what we&#8217;d expect.<\/p>\n<h3>Improvements and variations<\/h3>\n<p>There are many improvements and variations possible for the toy program I presented above.  I won&#8217;t describe them in detail here, but will briefly mention a few.<\/p>\n<p><strong>Performance:<\/strong> The toy algorithm I&#8217;ve described doesn&#8217;t work very well for large numbers of documents.  The problem is that it computes the cosine similarity for every single document in the corpus.  If you have ten billion documents in your document corpus, then the last thing you want to be doing is computing ten billion inner products.  Fortunately, you don&#8217;t have to.  It&#8217;s possible to take a lot of shortcuts that dramatically speed up the process.  I won&#8217;t describe these in detail here, but here&#8217;s a simple idea that should start to stimulate more ideas: for each term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in the dictionary, it&#8217;s straightforward to precompute the top 1,000 (say) documents for that term.  Then for a given multi-term query it&#8217;s pretty likely that the top search results will come from one of the pre-computed lists of top documents for the terms in the query.<\/p>\n<p><strong>Find what the user wants, not necessarily what they say they   want:<\/strong> A user searching for &#8220;automobiles&#8221; probably also wants documents which talk about &#8220;cars&#8221;, one searching for &#8220;elefants&#8221; probably wants documents about &#8220;elephants&#8221;, and one looking for &#8220;moviemaking&#8221; might well be interested in documents about &#8220;moviemakers&#8221;.  The system I described can and should be extended to deal with synonyms, with misspellings, and to ensure that we look for different forms of a word. I hope to deal with these topics in future posts.<\/p>\n<p><strong>Combine with information about a document&#8217;s importance:<\/strong> The model I&#8217;ve described treats all documents on an equal footing.  In practice, we need to combine a notion of relevance like cosine similarity with some notion of the intrinsic importance of the document.  A webpage from the <em>New York Times<\/em> shoud appear higher in search results than a webpage from a spammer, even if the spam page appears more relevant according to cosine similarity.  So we need to extend the approach to incorporate some notion of relevance, such as <a href=\"https:\/\/michaelnielsen.org\/blog\/lectures-on-the-google-technology-stack-1-introduction-to-pagerank\/\">PageRank<\/a>.<\/p>\n<p><strong>The ultimate search engine:<\/strong> These are just a few of the many issues that must be dealt with by a search engine.  Ultimately, the goal is to build a search engine that tells us <em>what we need to   know<\/em>.  That&#8217;s a very open-ended goal, and we can break it down in many different ways; I&#8217;ve just presented one analytical frame for thinking about it.  In future posts I&#8217;ll return to investiage some of the issues raised above in more depth, and also to explore other ways of thinking about the problem of search.<\/p>\n<p><em>Interested in more?  Please   <a href=\"https:\/\/michaelnielsen.org\/ddi\/feed\/\">subscribe to this     blog<\/a>, or <a href=\"http:\/\/twitter.com\/#!\/michael_nielsen\">follow me     on Twitter<\/a>.  My new book about open science,   <a href=\"http:\/\/www.amazon.com\/Reinventing-Discovery-New-Networked-Science\/dp\/product-description\/0691148902\">Reinventing     Discovery<\/a> will be published in October, and can be   <a href=\"http:\/\/www.amazon.com\/Reinventing-Discovery-New-Networked-Science\/dp\/product-description\/0691148902\">pre-ordered     at Amazon now<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When we type a query into a search engine &#8211; say &#8220;Einstein on relativity&#8221; &#8211; how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/ddi\/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search\/\">Continue reading <span class=\"screen-reader-text\">Documents as geometric objects: how to rank documents for full-text search<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-12","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/12","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/comments?post=12"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/12\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/media?parent=12"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/categories?post=12"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/tags?post=12"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}