Using your laptop to compute PageRank for millions of webpages

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web as a whole, simply because the web contains so many webpages. While just a few lines of code can be used to implement PageRank on collections of a few thousand webpages, it’s trickier to compute PageRank for larger sets of pages.

The underlying problem is that the most direct way to compute the PageRank of [tex]n[/tex] webpages involves inverting an [tex]n \times n[/tex] matrix. Even when [tex]n[/tex] is just a few thousand, this means inverting a matrix containing millions or tens of millions of floating point numbers. This is possible on a typical personal computer, but it’s hard to go much further.

In this post, I describe how to compute PageRank for collections containing millions of webpages. My little laptop easily coped with two million pages, using about 650 megabytes of RAM and a few hours of computation – at a guess it would max out at about 4 to 5 million pages. More powerful machines could cope with upwards of 10 million pages. And that’s with a program written in Python, an interpreted language not noted for its efficiency! Of course, this number still falls a long way short of the size of the entire web, but it’s a lot better than the naive approach. In a future post I’ll describe a more scalable parallel implementation that can be used to compute PageRank for hundreds of millions of pages.

Let’s quickly run through the basic facts we need to know about PageRank. The idea is that we number webpages [tex]1,\ldots,n[/tex]. For webpage number [tex]j[/tex] there is an associated PageRank [tex]q_j[/tex] which measures the importance of page [tex]j[/tex]. The PageRank is a number between [tex]0[/tex] and [tex]1[/tex]; the bigger the PageRank, the more important the page. Furthermore, the vector [tex]q = (q_1,\ldots,q_n)[/tex] of all the PageRanks is a probability distribution, that is, the PageRanks sum to one.

How is the PageRank vector [tex]q[/tex] computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector [tex]q[/tex] can be defined by the equation (explanation below):

[tex] q \leftarrow M^j P. [/tex]

What this equation represents is a starting distribution [tex]P[/tex] for the crazy websurfer, and then [tex]j[/tex] steps of “surfing”, with each action of [tex]M[/tex] representing how the distribution changes in a single step. [tex]P[/tex] is an [tex]n[/tex]-dimensional vector, and [tex]M[/tex] is an [tex]n \times n[/tex] matrix whose entries reflect the link structure of the web in a way I’ll describe below. The PageRank [tex]q[/tex] is defined in the limit of large [tex]j[/tex] – in our examples, convergence occurs for [tex]j[/tex] in the range [tex]20[/tex] to [tex]30[/tex]. You might wonder how [tex]P[/tex] is chosen, but part of the magic of PageRank is that it doesn’t matter how [tex]P[/tex] is chosen, provided it’s a probability distribution. The intuition here is that the starting distribution for the websurfer doesn’t matter to their long-run behaviour. We’ll therefore choose the uniform probability distribution, [tex]P = (1/n,1/n,\ldots)[/tex].

How is the matrix [tex]M[/tex] defined? It can be broken up into three pieces

[tex] M = s A + s D + t E, [/tex]

which correspond to, respectively, a contribution [tex]sA[/tex] representing the crazy websurfer randomly picking links to follow, a contribution [tex]sD[/tex] due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case, and finally a contribution [tex]tE[/tex] representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set [tex]s = 0.85[/tex] and [tex]t = 1-s = 0.15[/tex] as the respective probabilities for randomly selecting a link and teleporting. See here for discussion of the reasons for this choice.

The matrix [tex]A[/tex] describes the link structure of the web. In particular, suppose we define [tex]\#(j)[/tex] to be the number of links outbound from page [tex]j[/tex]. Then [tex]A_{kj}[/tex] is defined to be [tex]0[/tex] if page [tex]j[/tex] does not link to [tex]k[/tex], and [tex]1/\#(j)[/tex] if page [tex]j[/tex] does link to page [tex]k[/tex]. Stated another way, the entries of the [tex]j[/tex]th column of [tex]A[/tex] are zero, except at locations corresponding to outgoing links, where they are [tex]1/\#(j)[/tex]. The intuition is that [tex]A[/tex] describes the action of a websurfer at page [tex]j[/tex] randomly choosing an outgoing link.

The matrix [tex]D[/tex] is included to deal with the problem of dangling pages, i.e., pages with no outgoing links, where it is ambigous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if [tex]j[/tex] is a dangling page, then the [tex]j[/tex]th column of [tex]D[/tex] should have all its entries [tex]1/n[/tex], otherwise all the entries should be zero. A compact way of writing this is

[tex] D = e d^T/ n, [/tex]

where [tex]d[/tex] is the vector of dangling pages, i.e., the [tex]j[/tex]th entry of [tex]d[/tex] is [tex]1[/tex] if page [tex]j[/tex] is dangling, and otherwise is zero. [tex]e[/tex] is the vector whose entries are all [tex]1[/tex]s.

The final piece of [tex]M[/tex] is the matrix [tex]E[/tex], describing the bored websurfer teleporting somewhere else at random. This matrix has entries [tex]1/n[/tex] everywhere, representing a uniform probability of going to another webpage.

Exercises

This exercise is for people who’ve read my earlier post introducing PageRank. In that post you’ll see that I’ve broken up [tex]M[/tex] in a slightly different way. You should verify that the breakup in the current post and the breakup in the old post result in the same matrix [tex]M[/tex].

Now, as we argued above, you quickly run into problems in computing the PageRank vector if you naively try to store all [tex]n \times n = n^2[/tex] elements in the matrix [tex]M[/tex] or [tex]M^j[/tex]. Fortunately, it’s possible to avoid doing this. The reason is that the matrices [tex]A, D[/tex] and [tex]E[/tex] all have a special structure. [tex]A[/tex] is a very sparse matrix, since most webpages only link to a tiny fraction of the web as a whole, and so most entries are zero. [tex]D[/tex] is not ordinarily sparse, but it does have a very simple structure, determined by the vector (not matrix!) of dangling pages, [tex]d[/tex], and this makes it easy to do vector multiplications involving [tex]D[/tex]. Finally, the matrix [tex]E[/tex] is even more simple, and it’s straightforward to compute the action of [tex]E[/tex] on a vector.

The idea used in our algorithm is therefore to compute [tex]M^jP[/tex] by repeatedly multiplying the vector [tex]P[/tex] by [tex]M[/tex], and with each multiplication using the special form [tex]M = sA+sD+tE[/tex] to avoid holding anything more than [tex]n[/tex]-dimensional vectors in memory. This is the critical step that lets us compute PageRank for a web containing millions of webpages.

The crux of the algorithm, therefore, is to determine [tex](Mv)_j[/tex] for each index [tex]j[/tex], where [tex]v[/tex] is an arbitrary vector. We do this by computing [tex](Av)_j, (Dv)_j[/tex] and [tex](Ev)_j[/tex] separately, and then summing the results, with the appropriate prefactors, [tex]s[/tex] or [tex]t[/tex].

To compute [tex](Av)_j[/tex], just observe that [tex](Av)_j = \sum_k A_{jk} v_k[/tex]. This can be computed by summing over all the pages [tex]k[/tex] which have links to page [tex]j[/tex], with [tex]A_{jk} = 1/\#(k)[/tex] in each case. We see that the total number of operations involved in computing all the entries in [tex]Av[/tex] is [tex]\Theta(l)[/tex], where [tex]l[/tex] is the total number of links in the web graph, a number that scales far more slowly than [tex]n^2[/tex]. Furthermore, the memory consumed is [tex]\Theta(n)[/tex].

I’m being imprecise in analysing the number of operations and the memory consumption, not worrying about small constant factors. If we were really trying to optimize our algorithm, we’d be careful to keep track of exactly what resources are consumed, rather than ignoring those details. However, as we’ll see below, the advantages of the approach we’re now considering over the approach based on matrix inversion are enormous, and so ignoring a few small factors is okay for the purposes of comparison.

To compute [tex]Dv[/tex], observe that [tex]Dv = e (d \cdot v) / n[/tex], and so it suffices to compute the inner product [tex]d \cdot v[/tex], which in the worst case requires [tex]\Theta(n)[/tex] operations, and [tex]\Theta(n)[/tex] memory, to store [tex]d[/tex].

Finally, to compute [tex](Ev)_j[/tex] is easy, since it is just [tex](e \cdot v)/n[/tex], which requires virtually no additional computation or memory.

It follows that we can compute [tex]Mv[/tex] using a total number of operations [tex]\Theta(n+l)[/tex], and total memory [tex]\Theta(n)[/tex], and thus approximate the PageRank vector

[tex] q \approx M^j P [/tex]

with a total number of operations [tex]\Theta(j (n+l))[/tex], and with memory consumption [tex]\Theta(n)[/tex].

These are imposingly large numbers. Suppose we have [tex]n = 10^6[/tex] webpages, [tex]l = 10^7[/tex] links, and use [tex]j = 10^2[/tex] iterations. Then the total amount of memory involved is on the order of [tex]10^6[/tex] floating point numbers, and the total number of operations is on the order of [tex]10^9[/tex].

On the other hand, if we’d used the method that required us to store the full matrix [tex]M[/tex], we’d have needed to store on the order of [tex]10^{12}[/tex] floating pointing numbers. Furthermore using Gaussian elimination the matrix inversion requires on the order of [tex]10^{18}[/tex] operations, give or take a few (but only a few!) orders of magnitude. More efficient methods for matrix inverstion are known, but the numbers remain daunting.

The code

Let’s think about how we’d code this algorithm up in Python. If you’re not already familiar with Python, then a great way to learn is “Dive into Python”, an excellent free online introductory book about Python by Mark Pilgrim. You can read the first five chapters in just a few hours, and that will provide you with enough basic understanding of Python to follow most code, and to do a lot of coding yourself. Certainly, it’s more than enough for all the examples in this post series.

The most important part of our implementation of PageRank is to choose our representation of the web in a compact way, and so that the information we need to compute [tex]M^jP[/tex] can be retrieved quickly. This is achieved by the following definition for a class web, whose instances represent possible link structures for the web:

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

Suppose g is an object of type web, created using, e.g., g = web(3), to create a web of 3 pages. g.size = 3 for this web. g.in_links is a dictionary whose keys are numbers for the webpages, 0,1,2, and with corresponding values being lists of the inbound webpages. These lists are initially set to contain no webpages:

g.in_links = {0 : [], 1 : [], 2 : []}

Note that in Python lists are indexed [tex]0,1,2,\ldots[/tex], and this makes it convenient to number pages [tex]0,\ldots,n-1[/tex], rather than [tex]1,\ldots,n[/tex], as we have been doing up to now.

The advantages of storing the in-links as a dictionary of lists are two-fold. First, by using this representation, we store no more information than we absolutely need. E.g., we are not storing information about the absence of links, as we would be if we stored the entire adjacency matrix. Second, Python dictionaries are a convenient way to store the information, because they don’t require any ordering information, and so are in many ways faster to manipulate than lists.

You might be wondering why there is no attribute g.out_links in our class definition. The glib answer is that we don’t need it – after all, it can, in principle, be reconstructed from g.in_links. A better answer is that in real applications it would quite sensible to also have g.out_links, but storing that extra dictionary would consume a lot of extra memory, and we avoid it since we can compute PageRank without it. An even better answer would be to make use of a data structure that gives us the speed and convenience provided by both g.in_links and g.out_links, but only requires links to be stored once. This can certainly be done, but requires more work, and so I’ve avoided it.

Let’s finish our description of the attributes of g. g.number_out_links is a dictionary such that g.number_out_links[j] is the number of out-links from page j. Initially these values are all set to zero, but they will be incremented when links are added. We need to store the number of outgoing links in order to compute the probabilities [tex]1/\#(k)[/tex] used in the PageRank algorithm, as described earlier. While this could in principle be computed from the dictionary g.in_links, in practice that would be extremely slow.

Finally, g.dangling_pages is a dictionary whose key set is the set of dangling webpages. The values of the keys are placeholders which are never used, and so we use True as a placeholder value. Initially, all webpages are dangling, but we remove keys when links are added.

With these data structures in place, the rest of the program is a pretty straightforward implementation of the algorithm described earlier. Here’s the code:

# Generates a random web link structure, and finds the
# corresponding PageRank vector.  The number of inbound
# links for each page is controlled by a power law
# distribution.
#
# This code should work for up to a few million pages on a
# modest machine.
#
# Written and tested using Python 2.5.

import numpy
import random

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

def paretosample(n,power=2.0):
  '''Returns a sample from a truncated Pareto distribution
  with probability mass function p(l) proportional to
  1/l^power.  The distribution is truncated at l = n.'''
  m = n+1
  while m > n: m = numpy.random.zipf(power)
  return m

def random_web(n=1000,power=2.0):
  '''Returns a web object with n pages, and where each
  page k is linked to by L_k random other pages.  The L_k
  are independent and identically distributed random
  variables with a shifted and truncated Pareto
  probability mass function p(l) proportional to
  1/(l+1)^power.'''
  g = web(n)
  for k in xrange(n):
    lk = paretosample(n+1,power)-1
    values = random.sample(xrange(n),lk)
    g.in_links[k] = values
    for j in values: 
      if g.number_out_links[j] == 0: g.dangling_pages.pop(j)
      g.number_out_links[j] += 1
  return g

def step(g,p,s=0.85):
  '''Performs a single step in the PageRank computation,
  with web g and parameter s.  Applies the corresponding M
  matrix to the vector p, and returns the resulting
  vector.'''
  n = g.size
  v = numpy.matrix(numpy.zeros((n,1)))
  inner_product = sum([p[j] for j in g.dangling_pages.keys()])
  for j in xrange(n):
    v[j] = s*sum([p[k]/g.number_out_links[k] 
    for k in g.in_links[j]])+s*inner_product/n+(1-s)/n
  # We rescale the return vector, so it remains a
  # probability distribution even with floating point
  # roundoff.
  return v/numpy.sum(v)  

def pagerank(g,s=0.85,tolerance=0.00001):
  '''Returns the PageRank vector for the web g and
  parameter s, where the criterion for convergence is that
  we stop when M^(j+1)P-M^jP has length less than
  tolerance, in l1 norm.'''
  n = g.size
  p = numpy.matrix(numpy.ones((n,1)))/n
  iteration = 1
  change = 2
  while change > tolerance:
    print "Iteration: %s" % iteration
    new_p = step(g,p,s)
    change = numpy.sum(numpy.abs(p-new_p))
    print "Change in l1 norm: %s" % change
    p = new_p
    iteration += 1
  return p

print '''Now computing the PageRank vector for a random
web containing 10000 pages, with the number of inbound
links to each page controlled by a Pareto power law
distribution with parameter 2.0, and with the criterion
for convergence being a change of less than 0.0001 in l1
norm over a matrix multiplication step.'''

g = random_web(10000,2.0) # works up to several million
                          # pages.
pr = pagerank(g,0.85,0.0001)

As written, the code computes PageRank for a 10,000 page web, and I recommend you start with that number of pages, to check that the code runs. Once that’s done, you can change the parameter in the second last line to change the number of pages. If you do that, you may wish to change the tolerance 0.0001 in the final line, decreasing it to ensure a tolerance more appropriate to the number of pages, say [tex]10^{-7}[/tex].

What determines the tolerance that should be used? I have used the [tex]l_1[/tex] norm to quantify the rate of convergence in the code, but that is a rather arbitrary measure. In particular, an alternate measure might be to look at the magnitude of the differences between individual elements, computing [tex]\max_j |(Mv-v)_j|[/tex], and requiring that it be substantially smaller than the minimal possible value for PageRank, [tex]t/n[/tex]. This is less conservative, and arguably a more relevant measure; the code is easily changed.

Exercises

The code above is complex enough that bugs are likely. As a test, use an alternate method to compute PageRank for small web sizes, and compare your results against the output from the code above. Do you believe the code is correct?
Modify the program described in this lecture so that it works with a non-uniform teleportation vector, as described in an earlier post.
Python is a great language, but it’s not designed with speed in mind. Write a program in C to compute the PageRank for a random web. Can you compute the PageRank for ten million pages? What’s the largest number you’re able to compute on a single machine?

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog here.

23 comments

Pingback: Geeks3D Daily News Round-up (2008.12.24) And Merry Xmas! | The Geeks Of 3D
Brad Cater says:

December 24, 2008 at 10:31 am

Very cool! Just remember to end your print statements with quotes and the value to be printed in the “while change > tolerance” loop. I’m going to experiment a bit using Shed Skin (http://code.google.com/p/shedskin/) to speed it up a bit.
Michael Nielsen says:

December 24, 2008 at 12:29 pm

Brad – Thanks! Turns out there is a bug in the script I use to convert my source file (which is LaTeX) to WordPress, and that messes up the Python. I’ll fix it. I’ll be interested to hear how your experiment with Shed Skin goes.
Pingback: WWGD: Understanding Google’s Technology Stack
John Sidles says:

December 24, 2008 at 7:08 pm

Oh boy! Now we’re getting to a portion of this lecture series that I was *definitely* looking forward to, namely, O(n) matrix-vector multiplication algorithms.

For reasons that (AFAICT) nobody clearly understands, in recent decades O(n) matrix-vector multiplication algorithms have appeared ubiquitously in modern physics, chemistry, engineering, and information theory. Which is good, because with the help of these algorithms (as Michael observes), systems having n~10^6-10^8 degrees of freedom become tractable to analyze and/or simulate.

A key issue (whose answer is not obvious to me from the lecture) is whether M is Hermitian (apparently not) and/or can be related to a Hermitian matrix via an O(n) pre-conditioning algorithm (which is the bit that I’m not clear on at present).

When this is true, then M (typically) can be regarded as a metric on a (typically) tensor network space, and the act of matrix-vector multiplication can be viewed (typically) as the fast computation of the so-called musical isomorphism between one-forms and vectors.

It turns out that this geometric point of view is congenial in many large-scale simulation and information-processing algorithms, but (despite my modest efforts) it has never yet inspired in me any geometric understanding of Google’s PageRank algorithms … and yet I feel that it should.

Anyway, that’s how one reader is processing these fine lectures … a reader who would be pathetically grateful for rays of geometric illumination.

Just to mention, a fine textbook relating to these issues is Anne Greenbaum’s Iterative methods for solving linear systems.
————-
@book{Author = {Greenbaum, Anne}, Publisher = {Society for Industrial and Applied Mathematics (SIAM)}, Series = {Frontiers in Applied Mathematics}, Title = {Iterative methods for solving linear systems}, Volume = {17}, Year = {1997}}
Pingback: Coderies | taggle.org
Pingback: What is Authority? | The Marketing Technology Blog, Indianapolis
Pingback: What is Authority? | Fidicaro.net
Henry Haselgrove says:

January 25, 2009 at 5:06 am

For those of you who like Matlab, Matlab’s inbuilt sparse matrix type appears to be a fairly efficient way to compute pagerank. For example, this Matlab function that I wrote computes a 2 million page pagerank in about 45 seconds: http://users.on.net/~henry/pagerank1.m

Apparently the “scipy” package provides sparse matrix types in python. I’ll give it a go sometime to see whether that is just as fast.
Michael Nielsen says:

January 25, 2009 at 3:10 pm

Thanks for that, Henry! I’ve reshared your comment to the FriendFeed room. It definitely suggests I should use sparse matrices if possible in my distributed implementation.
Daniel Smilkov says:

September 30, 2009 at 2:55 pm

You know,

instead of using column matrices, you can use n-dim arrays:
v = numpy.matrix(numpy.zeros((n,1)))
becomes
v = numpy.zeros(n,dtype=numpy.float32)

and
p = numpy.matrix(numpy.ones((n,1)))/n
becomes
p = numpy.ones(n,dtype=numpy.float32)/n

and putting the initialization of the v vector out of step function, initializing it only ones in pagerank function,

you can get a lot of speed-up.
Michael Nielsen says:

September 30, 2009 at 5:06 pm

Daniel – Thanks for the note. When preparing the notes I considered doing something similar, but decided the presentation would be a little easier to understand without it. It’s been a while, but as I recall, my line of thought was simply that I wanted to avoid numpy, probably because the notes were originally prepared for people not at all familiar with numpy. The purpose of the notes is, of course, pedagogical – production code would be vastly different.
Pingback: Michael Nielsen Â» Using your laptop to compute PageRank for millions of webpages : Popular Links : eConsultant
Pingback: links for 2010-05-27 « Blarney Fellow
Pingback: Data Scientists and Artists are the new upper class – and will be in the future « The Apps Lifestyle
Pingback: Quora
Pingback: MapReduce: What are some good class projects for Machine Learning using MapReduce? - Quora
Jorge Catumba Ruiz says:

February 25, 2012 at 8:30 pm

Hi, I want to use your code for implement the PageRank method in Sage with other methods similar to it. Have I your permission? If yes, I will put your name as author.

Thanks in advance!
Michael Nielsen says:

February 27, 2012 at 3:42 pm

Hi Jorge – Two things:

(1) My code was for educational purposes, and is not suited to production use. E.g., there are edge cases not dealt with, the data structures are deliberately chosen for clarity in exposition, not for utility in execution, and so on. I would not release the code for production use; frankly, it would need to be rewritten from scratch.

(2) The patent for PageRank is held by Google, and I’m not sure whether Sage could incorporate it in any case.

So I’m afraid I’d prefer not.
Jorge Catumba Ruiz says:

February 28, 2012 at 9:26 pm

Ok, thanks.
Pingback: Data Week: Becoming a data scientist - O'Reilly Radar
Pingback: Introducing NodeZoo.com | Richard Rodger
Pingback: Class projects for Hadoop | Digital thoughts

Comments are closed.