{"id":14,"date":"2011-07-11T14:24:29","date_gmt":"2011-07-11T18:24:29","guid":{"rendered":"https:\/\/michaelnielsen.org\/ddi\/?p=14"},"modified":"2011-07-11T17:39:26","modified_gmt":"2011-07-11T21:39:26","slug":"a-problem-with-the-standard-importance-function-trading-off-query-terms-against-one-another","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/ddi\/a-problem-with-the-standard-importance-function-trading-off-query-terms-against-one-another\/","title":{"rendered":"A problem with the standard importance function?  Trading off query terms against one another"},"content":{"rendered":"<p><strong>Working notes ahead!<\/strong>  This post is different to my last two posts.  Those posts were broad reviews of topics of general interest (at least if you&#8217;re interested in data-driven intelligence) &#8211; the <a href=\"https:\/\/michaelnielsen.org\/ddi\/pregel\/\">Pregel<\/a> graph framework, and the <a href=\"https:\/\/michaelnielsen.org\/ddi\/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search\/\">vector   space model<\/a> of documents.  This post is not a review or distillation of a topic in the style of those posts.  Rather, it&#8217;s my personal working notes &#8211; me, thinking out loud &#8211; as I try to understand a particular question in some depth.  You can think of it as a variation on <a href=\"http:\/\/drexel-coas-elearning.blogspot.com\/2006\/09\/open-notebook-science.html\">open   notebook science<\/a>.  As such, the notes go into much more detail than previous posts, as I try to wrap my head around all the details of the problem at hand.  In writing this blog I expect to alternate between these two formats (working notes and broader reviews).  When it&#8217;s my personal working notes, I&#8217;ll make that clear in the first paragraph or so, so people who just want carefully distilled highlights can skip the more specialized posts.<\/p>\n<p>With that out of the way, let&#8217;s get on with the main problem.  In the <a href=\"https:\/\/michaelnielsen.org\/ddi\/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search\/\">last   post<\/a> I described a way of measuring how relevant a document is to a given search query.  That notion of relevance was based on the <em>cosine similarity<\/em> (essentially, the angle) between the <em>document vector<\/em> and the <em>query vector<\/em>.  Those vectors, in turn, were defined based on a notion of the <em>importance<\/em> <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Bimp%7D%28t%2Cd%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{imp}(t,d)' title='\\mbox{imp}(t,d)' class='latex' \/> of a term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in a document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>.  In the post I used what I called the <em>standard importance function<\/em> (I&#8217;ll remind you how that&#8217;s defined in a moment).  But although the standard importance function is widely used, that doesn&#8217;t mean it&#8217;s without problems, and in this post I describe a problem caused by the standard importance function.<\/p>\n<p>The strategy in the post is to first describe a specific example that illustrates the problem with the standard importance function.  The example is somewhat artificial, but, once we understand it, we&#8217;ll see that it&#8217;s part of a broader problem that is very real, and not at all artificial.<\/p>\n<p>To describe the example, let&#8217;s suppose our search query contains two terms, &#8220;hobbit baggins&#8221;.  We can represent this by a query vector <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q' title='\\vec q' class='latex' \/> whose components include a term for &#8220;hobbit&#8221; and a term for &#8220;baggins&#8221;.  Recall from the <a href=\"https:\/\/michaelnielsen.org\/ddi\/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search\/\">last   post<\/a> that, if we&#8217;re using the standard importance function, then those components are given by the inverse document frequency for each term, <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28N%2F%7B%5Crm+df%7D_t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(N\/{\\rm df}_t)' title='\\log(N\/{\\rm df}_t)' class='latex' \/>.  Here, logarithms are taken to base two, <img src='https:\/\/s0.wp.com\/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' \/> is the total number of documents in the search corpus, <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> is the term (&#8220;hobbit&#8221; and &#8220;baggins&#8221; being the relevant terms in this example), and <img src='https:\/\/s0.wp.com\/latex.php?latex=%7B%5Crm+df%7D_t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\\rm df}_t' title='{\\rm df}_t' class='latex' \/> is the <em>document frequency<\/em> for term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/>, i.e., the number of documents in the corpus in which <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> occurs.<\/p>\n<p>Now, both &#8220;hobbit&#8221; and &#8220;baggins&#8221; are likely to be pretty unusual terms in the corpus, and so the inverse document frequency will be pretty large.  Suppose, for example, that approximately one in a thousand documents contains &#8220;hobbit&#8221;, and similarly for &#8220;baggins&#8221;. Then the inverse document frequency will be <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%281000%29+%5Capprox+10&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(1000) \\approx 10' title='\\log(1000) \\approx 10' class='latex' \/>. So to a good approximation the query vector will be:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q+%3D+%5B10+10+0+0+%5Cldots%5D%5ET%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q = [10 10 0 0 \\ldots]^T,' title='\\vec q = [10 10 0 0 \\ldots]^T,' class='latex' \/>\n<p>where I&#8217;ve written the query vector so the first component is the component for &#8220;hobbit&#8221;, the second component is the component for &#8220;baggins&#8221;, and the other components are for other terms &#8211; all those components are <img src='https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' \/> in this case, since those other terms don&#8217;t appear in the query.  I&#8217;ve also written the query vector as a transposed row vector, because row vectors display better on the blog.<\/p>\n<p>Now suppose we have a document that contains ten occurrences of &#8220;hobbit&#8221;, and ten occurrences of &#8220;baggins&#8221;.  Using the standard importance function, its document vector will have the form<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_1+%3D+%5B100+100+%5Cvec+d_r%5ET%5D%5ET.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_1 = [100 100 \\vec d_r^T]^T.' title='\\vec d_1 = [100 100 \\vec d_r^T]^T.' class='latex' \/>\n<p>Here, I&#8217;ve used the fact that the component of <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_1' title='\\vec d_1' class='latex' \/> in the &#8220;direction&#8221; of term <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> is given by the standard importance function:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+%5Cmbox%7Bimp%7D%28t%2Cd%29+%5Cequiv+%5Cmbox%7Btf%7D_%7Bt%2Cd%7D+%5Clog%28N%2F%5Cmbox%7Bdf%7D_t%29%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' \\mbox{imp}(t,d) \\equiv \\mbox{tf}_{t,d} \\log(N\/\\mbox{df}_t),' title=' \\mbox{imp}(t,d) \\equiv \\mbox{tf}_{t,d} \\log(N\/\\mbox{df}_t),' class='latex' \/>\n<p>where <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cmbox%7Btf%7D_%7Bt%2Cd%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\mbox{tf}_{t,d}' title='\\mbox{tf}_{t,d}' class='latex' \/> is the term frequency of <img src='https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' \/> in document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>. I&#8217;ve also used <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_r' title='\\vec d_r' class='latex' \/> to denote the document vector for the remainder of the document, i.e., for all the terms not involving &#8220;hobbit&#8221; or &#8220;baggins&#8221;.  <\/p>\n<p>Suppose, now, that we change the document, so that for some reason we replace all occurrences of &#8220;baggins&#8221; by &#8220;hobbit&#8221;, so there are now <img src='https:\/\/s0.wp.com\/latex.php?latex=20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='20' title='20' class='latex' \/> occurrences of &#8220;hobbit&#8221;, and none of &#8220;baggins&#8221;.  Intuitively, it seems to me that the resulting modified document is much less relevant to the query &#8220;baggins hobbit&#8221;.  The reason is that although it contains twice as many occurrences of &#8220;hobbit&#8221;, those extra occurrences don&#8217;t compensate for the complete absence of the term &#8220;baggins&#8221;.  Put another way, I think a small decrease in the frequency of &#8220;baggins&#8221; should only be compensated for by a much larger increase in the frequency of &#8220;hobbit&#8221;.  As we&#8217;ll see, though, the cosine similarity doesn&#8217;t reflect this intuition very well.<\/p>\n<p>To compute the cosine similarity, note that the document vector for the modified document is:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+d_2+%3D+%5B200+0+%5Cvec+d_r%5ET%5D%5ET.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec d_2 = [200 0 \\vec d_r^T]^T.' title='\\vec d_2 = [200 0 \\vec d_r^T]^T.' class='latex' \/>\n<p>Recall also that the cosine similarity for an arbitrary document <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/> is defined to be <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ccos+%5Ctheta+%5Cequiv+%28%5Cvec+q+%5Ccdot+%5Cvec+d%29+%2F+q+d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\cos \\theta \\equiv (\\vec q \\cdot \\vec d) \/ q d' title='\\cos \\theta \\equiv (\\vec q \\cdot \\vec d) \/ q d' class='latex' \/>, where I use the convention that <img src='https:\/\/s0.wp.com\/latex.php?latex=q+%5Cequiv+%5C%7C%5Cvec+q%5C%7C%2C+d+%5Cequiv+%5C%7C%5Cvec+d%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q \\equiv \\|\\vec q\\|, d \\equiv \\|\\vec d\\|' title='q \\equiv \\|\\vec q\\|, d \\equiv \\|\\vec d\\|' class='latex' \/>.  In principle this notation could cause confusion, since we now use <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/> (or <img src='https:\/\/s0.wp.com\/latex.php?latex=d&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d' title='d' class='latex' \/>) to refer to the query (or document), as well as the length of the corresponding query (or document) vector.  In practice it should be clear from context which is meant.<\/p>\n<p>We get a hint of the problem with the standard importance function when we observe that <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cvec+q+%5Ccdot+%5Cvec+d_1+%3D+%5Cvec+q+%5Ccdot+%5Cvec+d_2+%3D+2000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\vec q \\cdot \\vec d_1 = \\vec q \\cdot \\vec d_2 = 2000' title='\\vec q \\cdot \\vec d_1 = \\vec q \\cdot \\vec d_2 = 2000' class='latex' \/>.  I.e., changing all occurrences of &#8220;hobbit&#8221; to &#8220;baggins&#8221; has no effect at all on the inner product, simply moving weight in the inner product from one term to the other.  And so the only difference in the cosine similarity is due to the change in length between <img src='https:\/\/s0.wp.com\/latex.php?latex=d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_1' title='d_1' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=d_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_2' title='d_2' class='latex' \/>.  One way of expressing this is to observe that:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B%5Ccos+%5Ctheta_1%7D%7B%5Ccos+%5Ctheta_2%7D+%3D++%5Cfrac%7B%5Cvec+q+%5Ccdot+%5Cvec+d_1%7D%7Bq+d_1%7D+%5Cfrac%7Bq+d_2%7D%7B%5Cvec+q+%5Ccdot+%5Cvec+d_2%7D+%3D+%5Cfrac%7Bd_2%7D%7Bd_1%7D.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\frac{\\cos \\theta_1}{\\cos \\theta_2} =  \\frac{\\vec q \\cdot \\vec d_1}{q d_1} \\frac{q d_2}{\\vec q \\cdot \\vec d_2} = \\frac{d_2}{d_1}.' title='\\frac{\\cos \\theta_1}{\\cos \\theta_2} =  \\frac{\\vec q \\cdot \\vec d_1}{q d_1} \\frac{q d_2}{\\vec q \\cdot \\vec d_2} = \\frac{d_2}{d_1}.' class='latex' \/>\n<p> Note furthermore that<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=d_1+%3D+%5Csqrt%7B20000%2Bd_r%5E2%7D%3B+%5C%2C%5C%2C%5C%2C%5C%2C+d_2+%3D+%5Csqrt%7B40000%2Bd_r%5E2%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_1 = \\sqrt{20000+d_r^2}; \\,\\,\\,\\, d_2 = \\sqrt{40000+d_r^2}, ' title='d_1 = \\sqrt{20000+d_r^2}; \\,\\,\\,\\, d_2 = \\sqrt{40000+d_r^2}, ' class='latex' \/>\n<p>and so we have:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B%5Ccos+%5Ctheta_1%7D%7B%5Ccos+%5Ctheta_2%7D+%3D+%5Csqrt%7B%5Cfrac%7B40000%2Bd_r%5E2%7D%7B20000%2Bd_r%5E2%7D%7D+%3E+1.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\frac{\\cos \\theta_1}{\\cos \\theta_2} = \\sqrt{\\frac{40000+d_r^2}{20000+d_r^2}} &gt; 1.' title='\\frac{\\cos \\theta_1}{\\cos \\theta_2} = \\sqrt{\\frac{40000+d_r^2}{20000+d_r^2}} &gt; 1.' class='latex' \/>\n<p> The good news is that this gives us the right ordering &#8211; the document <img src='https:\/\/s0.wp.com\/latex.php?latex=d_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_1' title='d_1' class='latex' \/> is more parallel to <img src='https:\/\/s0.wp.com\/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' \/> than is <img src='https:\/\/s0.wp.com\/latex.php?latex=d_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_2' title='d_2' class='latex' \/>, since <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ccos+%5Ctheta_1+%3E+%5Ccos+%5Ctheta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\cos \\theta_1 &gt; \\cos \\theta_2' title='\\cos \\theta_1 &gt; \\cos \\theta_2' class='latex' \/>, and so <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_1+%3C+%5Ctheta_2%26%2391%3B%2Flatex%26%2393%3B.++That%27s+what+we+hoped+for%2C+and+so+at+least+the+ordering+of+relevance+will+be+correct.++++The+bad+news+is+that%2C+for+complex+documents%2C+the+difference+in+cosine+similarity+is+tiny.++The+reason+is+that+for+a+long%2C+complex+document+containing+many+unusual+terms+%28%22Gandalf%22%2C+%22Galadriel%22%2C+%22Saruman%22%29%2C+the+value+of+%26%2391%3Blatex%26%2393%3Bd_r%5E2%26%2391%3B%2Flatex%26%2393%3B+may+be+enormous+-+let+us+say%2C+conservatively%2C+billions.++In+this+approximation%2C+we+see+that+%26%2391%3Blatex%26%2393%3B%5Ctheta_1+%5Capprox+%5Ctheta_2%26%2391%3B%2Flatex%26%2393%3B.++By+itself+this+isn%27t+a+problem.++But+it+means+that+the+ordering+of+%26%2391%3Blatex%26%2393%3B%5Ctheta_1%26%2391%3B%2Flatex%26%2393%3B+and+%26%2391%3Blatex%26%2393%3B%5Ctheta_2%26%2391%3B%2Flatex%26%2393%3B+is+enormously+sensitive+to+changes+in+%26%2391%3Blatex%26%2393%3B%5Cvec+d_r%26%2391%3B%2Flatex%26%2393%3B.++If%2C+when+we+modified+the+first+document%2C+we+changed+not+just+%22hobbit%22+to+%22baggins%22%2C+but+also+other+terms%2C+then+that+might+cause+the+value+of+%26%2391%3Blatex%26%2393%3Bd_r%5E2%26%2391%3B%2Flatex%26%2393%3B+to+decrease.+If+it+decreased+by+more+than+%26%2391%3Blatex%26%2393%3B20000%26%2391%3B%2Flatex%26%2393%3B+-+easy+to+imagine%2C+with+just+a+few+minor+changes+-+then+it+would+reverse+the+ordering+of+the+two+terms.++And+that+doesn%27t+seem+like+such+a+good+thing.++++The+example+I+have+given+may+be+artificial%2C+but+the+broader+phenomenon+is+not.++More+broadly%2C+if+we+use+the+standard+importance+function+when+computing+cosine+similarity%2C+then+it+becomes+much+too+easy+to+trade+off+query+terms+in+a+document.++For+two-term+queries%2C+%26%2391%3Blatex%26%2393%3Bq+%3D+t_1+t_2%26%2391%3B%2Flatex%26%2393%3B%2C+an+increasing+number+of+occurrences+of+%26%2391%3Blatex%26%2393%3Bt_1%26%2391%3B%2Flatex%26%2393%3B+can+be+traded+off+almost+exactly+by+a+corresponding+decrease+in+the+number+of+occurrences+of+%26%2391%3Blatex%26%2393%3Bt_2%26%2391%3B%2Flatex%26%2393%3B.++So%2C+for+example%2C+doubling+the+number+of+occurrences+of+%26%2391%3Blatex%26%2393%3Bt_1%26%2391%3B%2Flatex%26%2393%3B+allows+us+to+%28approximately%29+completely+wipe+out+the+occurrences+of+%26%2391%3Blatex%26%2393%3Bt_2%26%2391%3B%2Flatex%26%2393%3B%2C+while+making+only+a+tiny+difference+in+cosine+similarity.++++In+broad+terms+it%27s+pretty+clear+how+to+address+this+problem+-+change+the+scoring+function+so+that+we+can%27t+trade+off+query+terms+in+this+way.++It%27s+also+easy+to+see+how+we+might+go+about+changing+the+scoring+function+to+do+that.++In+fact%2C+it%27s+a+bit+too+easy+-+there+are+so+many+ways+we+could+go+about+doing+it+that+it%27s+not+at+all+clear+which+will+give+the+best+results.++Here%27s+one+trivial+way%3A+redefine+the+importance+function+to+be%3A++%26%2391%3Blatex%26%2393%3B%5Cmbox%7Bimp%7D%28t%2Cd%29+%5Cequiv+%5Csqrt%7B%5Cmbox%7Btf%7D_%7Bt%2Cd%7D%7D+%5Clog%28N%2F%5Cmbox%7Bdf%7D_t%29.%26%2391%3B%2Flatex%26%2393%3B+++That+is%2C+modify+the+importance+function+so+it+has+the+%3Cem%3Esquare+++root%3C%2Fem%3E+of+the+term+frequency+in+it+rather+than+the+term+frequency.+If+we+do+this+for+the+%60%60hobbit%27%27-%60%60baggins%27%27+example%2C+then+increasing+the+frequency+of+%60%60hobbit%27%27+to+%5Blatex%5D20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\theta_1 &lt; \\theta_2&#091;\/latex&#093;.  That&#039;s what we hoped for, and so at least the ordering of relevance will be correct.    The bad news is that, for complex documents, the difference in cosine similarity is tiny.  The reason is that for a long, complex document containing many unusual terms (&quot;Gandalf&quot;, &quot;Galadriel&quot;, &quot;Saruman&quot;), the value of &#091;latex&#093;d_r^2&#091;\/latex&#093; may be enormous - let us say, conservatively, billions.  In this approximation, we see that &#091;latex&#093;\\theta_1 \\approx \\theta_2&#091;\/latex&#093;.  By itself this isn&#039;t a problem.  But it means that the ordering of &#091;latex&#093;\\theta_1&#091;\/latex&#093; and &#091;latex&#093;\\theta_2&#091;\/latex&#093; is enormously sensitive to changes in &#091;latex&#093;\\vec d_r&#091;\/latex&#093;.  If, when we modified the first document, we changed not just &quot;hobbit&quot; to &quot;baggins&quot;, but also other terms, then that might cause the value of &#091;latex&#093;d_r^2&#091;\/latex&#093; to decrease. If it decreased by more than &#091;latex&#093;20000&#091;\/latex&#093; - easy to imagine, with just a few minor changes - then it would reverse the ordering of the two terms.  And that doesn&#039;t seem like such a good thing.    The example I have given may be artificial, but the broader phenomenon is not.  More broadly, if we use the standard importance function when computing cosine similarity, then it becomes much too easy to trade off query terms in a document.  For two-term queries, &#091;latex&#093;q = t_1 t_2&#091;\/latex&#093;, an increasing number of occurrences of &#091;latex&#093;t_1&#091;\/latex&#093; can be traded off almost exactly by a corresponding decrease in the number of occurrences of &#091;latex&#093;t_2&#091;\/latex&#093;.  So, for example, doubling the number of occurrences of &#091;latex&#093;t_1&#091;\/latex&#093; allows us to (approximately) completely wipe out the occurrences of &#091;latex&#093;t_2&#091;\/latex&#093;, while making only a tiny difference in cosine similarity.    In broad terms it&#039;s pretty clear how to address this problem - change the scoring function so that we can&#039;t trade off query terms in this way.  It&#039;s also easy to see how we might go about changing the scoring function to do that.  In fact, it&#039;s a bit too easy - there are so many ways we could go about doing it that it&#039;s not at all clear which will give the best results.  Here&#039;s one trivial way: redefine the importance function to be:  &#091;latex&#093;\\mbox{imp}(t,d) \\equiv \\sqrt{\\mbox{tf}_{t,d}} \\log(N\/\\mbox{df}_t).&#091;\/latex&#093;   That is, modify the importance function so it has the &lt;em&gt;square   root&lt;\/em&gt; of the term frequency in it rather than the term frequency. If we do this for the ``hobbit&#039;&#039;-``baggins&#039;&#039; example, then increasing the frequency of ``hobbit&#039;&#039; to [latex]20' title='\\theta_1 &lt; \\theta_2&#091;\/latex&#093;.  That&#039;s what we hoped for, and so at least the ordering of relevance will be correct.    The bad news is that, for complex documents, the difference in cosine similarity is tiny.  The reason is that for a long, complex document containing many unusual terms (&quot;Gandalf&quot;, &quot;Galadriel&quot;, &quot;Saruman&quot;), the value of &#091;latex&#093;d_r^2&#091;\/latex&#093; may be enormous - let us say, conservatively, billions.  In this approximation, we see that &#091;latex&#093;\\theta_1 \\approx \\theta_2&#091;\/latex&#093;.  By itself this isn&#039;t a problem.  But it means that the ordering of &#091;latex&#093;\\theta_1&#091;\/latex&#093; and &#091;latex&#093;\\theta_2&#091;\/latex&#093; is enormously sensitive to changes in &#091;latex&#093;\\vec d_r&#091;\/latex&#093;.  If, when we modified the first document, we changed not just &quot;hobbit&quot; to &quot;baggins&quot;, but also other terms, then that might cause the value of &#091;latex&#093;d_r^2&#091;\/latex&#093; to decrease. If it decreased by more than &#091;latex&#093;20000&#091;\/latex&#093; - easy to imagine, with just a few minor changes - then it would reverse the ordering of the two terms.  And that doesn&#039;t seem like such a good thing.    The example I have given may be artificial, but the broader phenomenon is not.  More broadly, if we use the standard importance function when computing cosine similarity, then it becomes much too easy to trade off query terms in a document.  For two-term queries, &#091;latex&#093;q = t_1 t_2&#091;\/latex&#093;, an increasing number of occurrences of &#091;latex&#093;t_1&#091;\/latex&#093; can be traded off almost exactly by a corresponding decrease in the number of occurrences of &#091;latex&#093;t_2&#091;\/latex&#093;.  So, for example, doubling the number of occurrences of &#091;latex&#093;t_1&#091;\/latex&#093; allows us to (approximately) completely wipe out the occurrences of &#091;latex&#093;t_2&#091;\/latex&#093;, while making only a tiny difference in cosine similarity.    In broad terms it&#039;s pretty clear how to address this problem - change the scoring function so that we can&#039;t trade off query terms in this way.  It&#039;s also easy to see how we might go about changing the scoring function to do that.  In fact, it&#039;s a bit too easy - there are so many ways we could go about doing it that it&#039;s not at all clear which will give the best results.  Here&#039;s one trivial way: redefine the importance function to be:  &#091;latex&#093;\\mbox{imp}(t,d) \\equiv \\sqrt{\\mbox{tf}_{t,d}} \\log(N\/\\mbox{df}_t).&#091;\/latex&#093;   That is, modify the importance function so it has the &lt;em&gt;square   root&lt;\/em&gt; of the term frequency in it rather than the term frequency. If we do this for the ``hobbit&#039;&#039;-``baggins&#039;&#039; example, then increasing the frequency of ``hobbit&#039;&#039; to [latex]20' class='latex' \/> occurrences (from <img src='https:\/\/s0.wp.com\/latex.php?latex=10&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='10' title='10' class='latex' \/>) will require there to be no fewer than <img src='https:\/\/s0.wp.com\/latex.php?latex=4&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='4' title='4' class='latex' \/> occurrences of &#8220;baggins&#8221; for the score to remain as high (assuming the rest of the document is complex, i.e., <img src='https:\/\/s0.wp.com\/latex.php?latex=d_r+%5Cgg+40000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='d_r \\gg 40000' title='d_r \\gg 40000' class='latex' \/>).  This matches our intuition much better.  Unfortunately, it&#8217;s by no means clear that it preserves whatever other properties of cosine similarity made it a good measure of relevance in the first place.  That&#8217;s a question that most likely would require empirical user testing to resolve.<\/p>\n<p>To wrap up, my takeaways from this are that: (1) the standard importance function makes it too easy to trade off query terms against one another in a document; (2) it&#8217;s easy to make <em>ad hoc<\/em> modifications to the standard importance function that remove this problem; but (3) it&#8217;s not clear which of those modifications would preserve the other qualities we want.  The obvious thing to do is some empirical testing to see how satisfied users are with different measures.  It&#8217;d also be good to think more deeply about the foundation of the problem: how <em>should<\/em> we define the relevance of a document to a given user query?<\/p>\n<p>  <em>Interested in more?  Please <a href=\"htp:\/\/www.michaelnielsen.org\/ddi\/feed\/>subscribe to this blog<\/a>, or <a href=\"http:\/\/twitter.com\/\\#!\/michael_nielsen\">follow me on Twitter<\/a>.  My new book about open science, <a href=\"http:\/\/www.amazon.com\/Reinventing-Discovery-New-Networked-Science\/dp\/product-description\/0691148902\">Reinventing Discovery<\/a> will be published in October, and can be <a href=\"http:\/\/www.amazon.com\/Reinventing-Discovery-New-Networked-Science\/dp\/product-description\/0691148902\">pre-ordered at Amazon<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you&#8217;re interested in data-driven intelligence) &#8211; the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/ddi\/a-problem-with-the-standard-importance-function-trading-off-query-terms-against-one-another\/\">Continue reading <span class=\"screen-reader-text\">A problem with the standard importance function?  Trading off query terms against one another<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-14","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/14","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/comments?post=14"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/14\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/media?parent=14"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/categories?post=14"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/tags?post=14"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}