{"id":511,"date":"2008-12-09T15:34:14","date_gmt":"2008-12-09T19:34:14","guid":{"rendered":"http:\/\/michaelnielsen.org\/blog\/?p=511"},"modified":"2008-12-09T15:55:14","modified_gmt":"2008-12-09T19:55:14","slug":"lectures-on-the-google-technology-stack-2-building-our-pagerank-intuition","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/blog\/lectures-on-the-google-technology-stack-2-building-our-pagerank-intuition\/","title":{"rendered":"Lectures on the Google Technology Stack 2: Building our PageRank Intuition"},"content":{"rendered":"<p>This is one in a series of posts about the Google Technology Stack &#8211; PageRank,  MapReduce, and so on &#8211; based on a lecture series I&#8217;m giving in 2008 and 2009.   If you&#8217;d like to know more about the series, please see the <a href=\"http:\/\/michaelnielsen.org\/blog\/?page_id=503\">course syllabus<\/a>.  There is a FriendFeed room for the series <a href=\"http:\/\/friendfeed.com\/rooms\/lecture-course-on-the-google-techno\">here<\/a>. <\/p>\n<p>  In today&#8217;s lecture, we&#8217;ll build our intuition about PageRank.  We&#8217;ll figure out the minimal and maximal possible values for PageRank; study how PageRank works for a web with a random link structure, seeing a sort of &#8220;central limit theorem&#8221; for PageRank emerge; and find out how spammers can spam PageRank, and how to defeat them. <\/p>\n<h2>Building our PageRank Intuition<\/h2>\n<p>In the last lecture we defined PageRank, but we haven&#8217;t yet developed a deep understanding. In this lecture we&#8217;ll build our intuition, working through many basic examples of PageRank in action, and answering many fundamental questions.<\/p>\n<h3>The minimal and maximal values for PageRank<\/h3>\n<p>Recall from last lecture that the PageRank vector [tex]q[\/tex] is an [tex]N[\/tex]-dimensional probability distribution, with the probability [tex]q_j[\/tex] measuring the importance of webpage [tex]j[\/tex].  Because [tex]q[\/tex] is a probability distribution, the total of all the probabilities must add up to one, and so the average PageRank is [tex]1\/N[\/tex].<\/p>\n<p>It&#8217;s interesting that we can get this precise value for the average PageRank, but what we&#8217;d really like is more detailed information about the way PageRank behaves for typical pages.  Let&#8217;s start by studying the <em>minimal<\/em> and <em>maximal<\/em> possible values for PageRank.<\/p>\n<p>Intuitively, we expect that the minimal value for PageRank occurs when a page isn&#8217;t linked to by any other page.  This intuition can be made more rigorous as follows.  Recall our definition of PageRank using a websurfer randomly surfing through webpages.  At any given step of their random walk, the crazy websurfer has a probability at least [tex]t P_j[\/tex] of teleporting to vertex [tex]j[\/tex], and thus we must have<\/p>\n<p>[tex]   q_j \\geq t P_j [\/tex]<\/p>\n<p>for all pages [tex]j[\/tex].  Furthermore, it&#8217;s possible for the PageRank to be equal to this value if, for example, no other pages link to page [tex]j[\/tex]. Thus [tex]tP_j[\/tex] is the minimal value for the PageRank of page [tex]j[\/tex].  A more algebraic way of putting this argument is to recall from the last lecture that [tex]tP = (I-sG)q[\/tex], and thus [tex]t P_j = q_j &#8211; s(Gq)_j \\leq q_j[\/tex], since [tex](Gq)_j \\geq 0[\/tex].  For the original PageRank algorithm, [tex]P_j = 1\/N[\/tex], and so the minimal PageRank is [tex]t\/N[\/tex] for all webpages, a factor [tex]t[\/tex] smaller than the average PageRank.<\/p>\n<h3>Exercises<\/h3>\n<ul>\n<li> We will say that webpage [tex]k[\/tex] <em>eventually leads<\/em> to webpage   [tex]j[\/tex] if there is a link path leading from [tex]k[\/tex] to [tex]j[\/tex].  Show that [tex]q_j   = tP_j[\/tex] if and only if [tex]P_k = 0[\/tex] for all pages [tex]k[\/tex] that eventually   lead to page [tex]j[\/tex].  Obviously, this can&#8217;t happen for the original   PageRank algorithm, where [tex]P_j = 1\/N[\/tex], unless there are no pages at   all linking to page [tex]j[\/tex].  Generically, therefore, most pages have   PageRank greater than this minimum. <\/ul>\n<p>We can use the value for the minimal PageRank to deduce the maximal possible PageRank.  Because the sum of all PageRanks is [tex]1[\/tex], the maximal possible PageRank for page [tex]j[\/tex] occurs if all other PageRanks have their minimal value, [tex]tP_k[\/tex].  In this case [tex]q_j = 1-\\sum_{k\\neq   j} t P_k = 1-t(1-P_j) = s+tP_j[\/tex], and so we have<\/p>\n<p>[tex]   q_j \\leq s + t P_j. [\/tex]<\/p>\n<p>An example where maximal PageRank occurs is a web with a &#8220;star&#8221; topology of links, with the central webpage, which we&#8217;ll label [tex]1[\/tex], having only a single outgoing link, to itself, and every other webpage [tex]2,\\ldots,N[\/tex] having a single link going to page [tex]1[\/tex].  Pages [tex]2,\\ldots,N[\/tex] have no incoming links, and thus have minimal PageRank [tex]tP_j[\/tex], and so page [tex]1[\/tex] has maximal PageRank, [tex]s+tP_1[\/tex].<\/p>\n<h3>Exercises<\/h3>\n<ul>\n<li> In the previous example show explicitly that [tex]q =   (s+tP_1,tP_2,\\ldots,tP_N)[\/tex] satisfies the equation [tex]Mq = q[\/tex].\n<li> In the star topology example, we required that webpage [tex]1[\/tex] link   to itself.  The reason was so that page [tex]1[\/tex] was not a dangling page,   and thus treated by PageRank as effectively linked to every page.   What happens to the PageRank of page [tex]1[\/tex] if we make it dangling? <\/ul>\n<h3>Typical values of PageRank<\/h3>\n<p>It&#8217;s all very well to know the maximal and minimal values of PageRank, but in practice those situations are rather unusual.  What can we say about typical values of PageRank?  To answer this question, let&#8217;s build a model of the web, and study its properties.  We&#8217;ll assume a very simple model of a web with just 5,000 pages, with each page randomly linking to 10 others.  The following code plots a histogram of the PageRanks, for the case [tex]s = 0.85[\/tex].<\/p>\n<pre> \r\n# Generates a 5,000 page web, with each page randomly \r\n# linked to 10 others, computes the PageRank vector, and \r\n# plots a histogram of PageRanks.\r\n# \r\n# Runs under python 2.5 with the numpy and matplotlib \r\n# libraries.  If these are not already installed, you'll \r\n# need to install them.\r\n\r\nfrom numpy import * \r\nimport random \r\nimport matplotlib.pyplot as plt\r\n\r\n# Returns an n-dimensional column vector with m random \r\n# entries set to 1.  There must be a more elegant way to \r\n# do this! \r\ndef randomcolumn(n,m):\r\n  values = random.sample(range(0,n-1),m)\r\n  rc = mat(zeros((n,1)))\r\n  for value in values:\r\n    rc[value,0] = 1\r\n  return rc\r\n\r\n# Returns the G matrix for an n-webpage web where every \r\n# page is linked to m other pages. \r\ndef randomG(n,m):   \r\n  G = mat(zeros((n,n)))\r\n  for j in range(0,n-1):\r\n    G[:,j] = randomcolumn(n,m)\r\n  return (1.0\/m)*G\r\n\r\n# Returns the PageRank vector for a web with parameter s \r\n# and whose link structure is described by the matrix G. \r\ndef pagerank(G,s):\r\n  n = G.shape[0]\r\n  id = mat(eye(n))\r\n  teleport = mat(ones((n,1)))\/n\r\n  return (1-s)*((id-s*G).I)*teleport\r\n\r\nG = randomG(5000,20)\r\npr = pagerank(G,0.85)\r\nprint std(pr)\r\nplt.hist(pr,50) \r\nplt.show() \r\n<\/pre>\n<h3>Problems<\/h3>\n<ul>\n<li> The randomcolumn function in the code above isn&#8217;t very elegant.   Can you find a better way of implementing it? <\/ul>\n<p>The result of running the code is the following histogram of PageRank values &#8211; the horizontal axis is PageRank, while the vertical axis is the frequency with which it occurs:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/michaelnielsen.org\/blog\/wp-content\/uploads\/2008\/12\/random_pagerank_m10_s85.jpg\" width=\"400px\"><\/p>\n<p>We see from this graph that PageRank has a mean of [tex]1\/5000[\/tex], as we expect, and an empirical standard deviation of [tex]0.000055[\/tex].  Roughly speaking, it looks Gaussian, although, of course, PageRank is bounded above and below, and so can&#8217;t literally be Gaussian.  <\/p>\n<p>Suppose now that we change the number of links per page to [tex]100[\/tex], but keep everything else the same.  Then the resulting histogram is:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/michaelnielsen.org\/blog\/wp-content\/uploads\/2008\/12\/random_pagerank_m100_s85.jpg\" width=\"400px\"><\/p>\n<p>This also looks Gaussian, but has a smaller standard deviation of [tex]0.000017[\/tex].  Notice that while we approach the minimal possible value of PageRank ([tex]t\/N = 0.00003[\/tex]) in the first graph, in neither graph do we come anywhere near the maximal possible value of PageRank, which is just a tad over [tex]s = 0.85[\/tex].<\/p>\n<p>Looking at these results, and running more numerical experiments in a similar vein, we see that as the number of outgoing links is increased, the standard deviation in PageRank seems to decrease.  A bit of playing around suggests that the standard deviation decreases as one over the square root of the number of outgoing links, at least when the number of outgoing links is small compared with the total number of pages.  More numerical experimentation suggests that the standard deviation is also proportional to [tex]s\/N[\/tex].  As a result, I conjecture:<\/p>\n<p><strong>Conjecture:<\/strong>  Consider a random web with [tex]n[\/tex] pages and [tex]m[\/tex] outgoing links per   page.  Then for [tex]1 \\ll m \\ll n[\/tex] the PageRank distribution approaches   a Gaussian with mean [tex]1\/n[\/tex] and standard deviation [tex]s\/(n\\sqrt{m})[\/tex].<\/p>\n<p>For the case [tex]n = 5000, m = 10, s = 0.85[\/tex] this suggests a standard deviation of [tex]0.000054[\/tex], while for the case [tex]n = 5000, m = 100, s = 0.85[\/tex] it suggests a standard deviation of [tex]0.000017[\/tex].  Both these results are in close accord with the earlier empirical findings.  The further numerical investigations I&#8217;ve done (not a lot, it must be said) also confirm it.<\/p>\n<p>Incidentally, you might wonder why I chose to use a web containing [tex]5,000[\/tex] pages.  The way I&#8217;ve computed PageRank in the program above, we compute the matrix inverse of a [tex]5,000[\/tex] by [tex]5,000[\/tex] matrix. Assuming we&#8217;re using [tex]64[\/tex]-bit floating point arithmetic, that means just storing the matrix requires [tex]200[\/tex] megabytes.  Not surprisingly, it turns out that my small laptop can&#8217;t cope with [tex]10,000[\/tex] webpages, so I stuck with [tex]5,000[\/tex].  Still, it&#8217;s notable that at that size the program still ran quite quickly.  In the next lecture we&#8217;ll see how to cope with much larger sets of webpages.<\/p>\n<h3>Exercises<\/h3>\n<ul>\n<li> Modify the program above to run with [tex]m=1[\/tex].  You&#8217;ll find that   the PageRank distribution does not look Gaussian.  What conjecture   do you think describes the PageRank distribution in this case?   What&#8217;s a good reason we might <em>not<\/em> expect Gaussian behaviour   for small values of [tex]m[\/tex]?\n<li> Do you trust the Python code to produce the correct outcomes?   One of the problems in using numeric libraries as black boxes is   that it can be hard to distinguish between real results and   numerical artifacts.  How would you determine which is the case for   the code above?  (For the record, I do believe the results, but have   plans to do more checks in later lectures!)\n<\/ul>\n<h3>Problems<\/h3>\n<ul>\n<li> Can you prove or disprove the above conjecture?  If it&#8217;s wrong,   is there another similar type of result you can prove for PageRank,   a sort of &#8220;central limit theorem&#8221; for PageRank?  Certainly, the   empirical results strongly suggest that some type of result in this   vein is true!  I expect that a good strategy for attacking this   problem is to first think about the problem from first principles,   and, if you&#8217;re getting stuck, to consult some of the literature   about random walks on random graphs.\n<li> What happens if instead of linking to a fixed number of   webpages, the random link structure is governed by a probability   distribution?  You might like to do some computational experiments,   and perhaps formulate (and, ideally, prove) a conjecture for this   case.\n<li> Following up on the last problem, I believe that a much better   model of the web than the one we&#8217;ve used is to assume a power-law   distribution of incoming (not outgoing) links to every page. Can you   formulate (and, ideally, prove) a type of central limit theorem for   PageRank in this case?  I expect to see a much broader distribution   of PageRanks than the Gaussian we saw in the case of a fixed number   of outgoing links.\n<\/ul>\n<h3>A simple technique to spam PageRank<\/h3>\n<p>One of the difficulties faced by search engines is spammers who try to artificially inflate the prominence of webpages by pumping up their PageRank.  To defeat this problem, we need to understand how such inflation can be done, and find ways of avoiding it.  Here&#8217;s one simple technique for getting a high PageRank.<\/p>\n<p>Suppose the web contains [tex]N[\/tex] pages, and we build a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Link_farm\">link farm<\/a> containing [tex]M[\/tex] additional pages, in the star topology described earlier.  That is, we number our link farm pages [tex]1,\\ldots,M[\/tex], make page [tex]1[\/tex] link only to itself, and all the other pages link only to page [tex]1[\/tex].  Then if we use the original PageRank, with uniform teleportation probabilities [tex]1\/(N+M)[\/tex], a similar argument to earlier shows that the PageRank for page [tex]1[\/tex] is<\/p>\n<p>[tex]   (s + t\/M) \\frac{M}{M+N}. [\/tex]<\/p>\n<p>As we increase the size, [tex]M[\/tex], of the link farm it eventually becomes comparable to the size of the rest of the web, [tex]N[\/tex], and this results in an enormous PageRank for page [tex]1[\/tex] of the link farm.  This is not just a theoretical scenario.  In the past, <a href=\"http:\/\/googlesystem.blogspot.com\/2006\/06\/google-index-flooded-with-spam.html\">spammers   have built link farms containing billions of pages<\/a>, using a technique similar to that described above to inflate their PageRank. They can then sell links from page 1 to other pages on the web, essentially renting out some of their PageRank.<\/p>\n<p>How could we defeat this strategy?  If we assume that we can detect the link farm being built, then a straightforward way of dealing with it is to modify the teleportation vector so that all pages in the link farm are assigned teleportation probabilities of zero, eliminating all the spurious PageRank accumulated due to the large number of pages in the link farm.  Unfortunately, this still leaves the problem of detecting the link farm to begin with.  That&#8217;s not so easy, and I won&#8217;t discuss it in this lecture, although I hope to come back to it in a later lecture.<\/p>\n<h3>Exercises<\/h3>\n<ul>\n<li> Prove that the PageRank for page [tex]1[\/tex] of the link farm is, as   claimed above, [tex](s+t\/M) M\/(M+N)[\/tex].\n<li> The link topology described for the link farm is not very   realistic.  To get all the pages into the Google index the link farm   would need to be crawled by Google&#8217;s webcrawler.  But the crawler   would never discover most of the pages, since they aren&#8217;t linked to   by any other pages.  Can you think of a more easily discoverable   link farm topology that still gains at least one of the pages a   large PageRank?\n<\/ul>\n<p> That concludes the second lecture.  Next lecture, we&#8217;ll look at how to build a more serious implementation of PageRank, one that can be applied to a web containing millions of pages, not just thousands of pages. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is one in a series of posts about the Google Technology Stack &#8211; PageRank, MapReduce, and so on &#8211; based on a lecture series I&#8217;m giving in 2008 and 2009. If you&#8217;d like to know more about the series, please see the course syllabus. There is a FriendFeed room for the series here. In&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/blog\/lectures-on-the-google-technology-stack-2-building-our-pagerank-intuition\/\">Continue reading <span class=\"screen-reader-text\">Lectures on the Google Technology Stack 2: Building our PageRank Intuition<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[57],"tags":[],"class_list":["post-511","post","type-post","status-publish","format-standard","hentry","category-gts","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts\/511","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/comments?post=511"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/posts\/511\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/media?parent=511"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/categories?post=511"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/blog\/wp-json\/wp\/v2\/tags?post=511"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}