# Using your laptop to compute PageRank for millions of webpages

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web as a whole, simply because the web contains so many webpages. While just a few lines of code can be used to implement PageRank on collections of a few thousand webpages, it’s trickier to compute PageRank for larger sets of pages.

The underlying problem is that the most direct way to compute the PageRank of webpages involves inverting an matrix. Even when is just a few thousand, this means inverting a matrix containing millions or tens of millions of floating point numbers. This is possible on a typical personal computer, but it’s hard to go much further.

In this post, I describe how to compute PageRank for collections containing millions of webpages. My little laptop easily coped with two million pages, using about 650 megabytes of RAM and a few hours of computation – at a guess it would max out at about 4 to 5 million pages. More powerful machines could cope with upwards of 10 million pages. And that’s with a program written in Python, an interpreted language not noted for its efficiency! Of course, this number still falls a long way short of the size of the entire web, but it’s a lot better than the naive approach. In a future post I’ll describe a more scalable parallel implementation that can be used to compute PageRank for hundreds of millions of pages.

Let’s quickly run through the basic facts we need to know about PageRank. The idea is that we number webpages . For webpage number there is an associated PageRank which measures the importance of page . The PageRank is a number between and ; the bigger the PageRank, the more important the page. Furthermore, the vector of all the PageRanks is a probability distribution, that is, the PageRanks sum to one.

How is the PageRank vector computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector can be defined by the equation (explanation below):

What this equation represents is a starting distribution for the crazy websurfer, and then steps of “surfing”, with each action of representing how the distribution changes in a single step. is an -dimensional vector, and is an matrix whose entries reflect the link structure of the web in a way I’ll describe below. The PageRank is defined in the limit of large – in our examples, convergence occurs for in the range to . You might wonder how is chosen, but part of the magic of PageRank is that it doesn’t matter how is chosen, provided it’s a probability distribution. The intuition here is that the starting distribution for the websurfer doesn’t matter to their long-run behaviour. We’ll therefore choose the uniform probability distribution, .

How is the matrix defined? It can be broken up into three pieces

which correspond to, respectively, a contribution representing the crazy websurfer randomly picking links to follow, a contribution due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case, and finally a contribution representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set and as the respective probabilities for randomly selecting a link and teleporting. See here for discussion of the reasons for this choice.

The matrix describes the link structure of the web. In particular, suppose we define to be the number of links outbound from page . Then is defined to be if page does not link to , and if page does link to page . Stated another way, the entries of the th column of are zero, except at locations corresponding to outgoing links, where they are . The intuition is that describes the action of a websurfer at page randomly choosing an outgoing link.

The matrix is included to deal with the problem of dangling pages, i.e., pages with no outgoing links, where it is ambigous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if is a dangling page, then the th column of should have all its entries , otherwise all the entries should be zero. A compact way of writing this is

where is the vector of dangling pages, i.e., the th entry of is if page is dangling, and otherwise is zero. is the vector whose entries are all s.

The final piece of is the matrix , describing the bored websurfer teleporting somewhere else at random. This matrix has entries everywhere, representing a uniform probability of going to another webpage.

### Exercises

- This exercise is for people who’ve read my earlier post introducing PageRank. In that post you’ll see that I’ve broken up in a slightly different way. You should verify that the breakup in the current post and the breakup in the old post result in the same matrix .

Now, as we argued above, you quickly run into problems in computing the PageRank vector if you naively try to store all elements in the matrix or . Fortunately, it’s possible to avoid doing this. The reason is that the matrices and all have a special structure. is a very sparse matrix, since most webpages only link to a tiny fraction of the web as a whole, and so most entries are zero. is not ordinarily sparse, but it does have a very simple structure, determined by the vector (not matrix!) of dangling pages, , and this makes it easy to do vector multiplications involving . Finally, the matrix is even more simple, and it’s straightforward to compute the action of on a vector.

The idea used in our algorithm is therefore to compute by repeatedly multiplying the vector by , and with each multiplication using the special form to avoid holding anything more than -dimensional vectors in memory. This is the critical step that lets us compute PageRank for a web containing millions of webpages.

The crux of the algorithm, therefore, is to determine for each index , where is an arbitrary vector. We do this by computing and separately, and then summing the results, with the appropriate prefactors, or .

To compute , just observe that . This can be computed by summing over all the pages which have links to page , with in each case. We see that the total number of operations involved in computing all the entries in is , where is the total number of links in the web graph, a number that scales far more slowly than . Furthermore, the memory consumed is .

I’m being imprecise in analysing the number of operations and the memory consumption, not worrying about small constant factors. If we were really trying to optimize our algorithm, we’d be careful to keep track of exactly what resources are consumed, rather than ignoring those details. However, as we’ll see below, the advantages of the approach we’re now considering over the approach based on matrix inversion are enormous, and so ignoring a few small factors is okay for the purposes of comparison.

To compute , observe that , and so it suffices to compute the inner product , which in the worst case requires operations, and memory, to store .

Finally, to compute is easy, since it is just , which requires virtually no additional computation or memory.

It follows that we can compute using a total number of operations , and total memory , and thus approximate the PageRank vector

with a total number of operations , and with memory consumption .

These are imposingly large numbers. Suppose we have webpages, links, and use iterations. Then the total amount of memory involved is on the order of floating point numbers, and the total number of operations is on the order of .

On the other hand, if we’d used the method that required us to store the full matrix , we’d have needed to store on the order of floating pointing numbers. Furthermore using Gaussian elimination the matrix inversion requires on the order of operations, give or take a few (but only a few!) orders of magnitude. More efficient methods for matrix inverstion are known, but the numbers remain daunting.

### The code

Let’s think about how we’d code this algorithm up in Python. If you’re not already familiar with Python, then a great way to learn is “Dive into Python”, an excellent free online introductory book about Python by Mark Pilgrim. You can read the first five chapters in just a few hours, and that will provide you with enough basic understanding of Python to follow most code, and to do a lot of coding yourself. Certainly, it’s more than enough for all the examples in this post series.

The most important part of our implementation of PageRank is to choose our representation of the web in a compact way, and so that the information we need to compute can be retrieved quickly. This is achieved by the following definition for a class `web`, whose instances represent possible link structures for the web:

class web: def __init__(self,n): self.size = n self.in_links = {} self.number_out_links = {} self.dangling_pages = {} for j in xrange(n): self.in_links[j] = [] self.number_out_links[j] = 0 self.dangling_pages[j] = True

Suppose `g` is an object of type `web`, created using, e.g., `g = web(3)`, to create a web of 3 pages. `g.size = 3` for this web. `g.in_links` is a dictionary whose keys are numbers for the webpages, `0,1,2`, and with corresponding values being lists of the inbound webpages. These lists are initially set to contain no webpages:

g.in_links = {0 : [], 1 : [], 2 : []}

Note that in Python lists are indexed , and this makes it convenient to number pages , rather than , as we have been doing up to now.

The advantages of storing the in-links as a dictionary of lists are two-fold. First, by using this representation, we store no more information than we absolutely need. E.g., we are not storing information about the *absence* of links, as we would be if we stored the entire adjacency matrix. Second, Python dictionaries are a convenient way to store the information, because they don’t require any ordering information, and so are in many ways faster to manipulate than lists.

You might be wondering why there is no attribute `g.out_links` in our class definition. The glib answer is that we don’t need it – after all, it can, in principle, be reconstructed from `g.in_links`. A better answer is that in real applications it would quite sensible to also have `g.out_links`, but storing that extra dictionary would consume a lot of extra memory, and we avoid it since we can compute PageRank without it. An even better answer would be to make use of a data structure that gives us the speed and convenience provided by both `g.in_links` and `g.out_links`, but only requires links to be stored once. This can certainly be done, but requires more work, and so I’ve avoided it.

Let’s finish our description of the attributes of `g`. `g.number_out_links` is a dictionary such that `g.number_out_links[j]` is the number of out-links from page `j`. Initially these values are all set to zero, but they will be incremented when links are added. We need to store the number of outgoing links in order to compute the probabilities used in the PageRank algorithm, as described earlier. While this could in principle be computed from the dictionary `g.in_links`, in practice that would be extremely slow.

Finally, `g.dangling_pages` is a dictionary whose key set is the set of dangling webpages. The values of the keys are placeholders which are never used, and so we use `True` as a placeholder value. Initially, all webpages are dangling, but we remove keys when links are added.

With these data structures in place, the rest of the program is a pretty straightforward implementation of the algorithm described earlier. Here’s the code:

# Generates a random web link structure, and finds the # corresponding PageRank vector. The number of inbound # links for each page is controlled by a power law # distribution. # # This code should work for up to a few million pages on a # modest machine. # # Written and tested using Python 2.5. import numpy import random class web: def __init__(self,n): self.size = n self.in_links = {} self.number_out_links = {} self.dangling_pages = {} for j in xrange(n): self.in_links[j] = [] self.number_out_links[j] = 0 self.dangling_pages[j] = True def paretosample(n,power=2.0): '''Returns a sample from a truncated Pareto distribution with probability mass function p(l) proportional to 1/l^power. The distribution is truncated at l = n.''' m = n+1 while m > n: m = numpy.random.zipf(power) return m def random_web(n=1000,power=2.0): '''Returns a web object with n pages, and where each page k is linked to by L_k random other pages. The L_k are independent and identically distributed random variables with a shifted and truncated Pareto probability mass function p(l) proportional to 1/(l+1)^power.''' g = web(n) for k in xrange(n): lk = paretosample(n+1,power)-1 values = random.sample(xrange(n),lk) g.in_links[k] = values for j in values: if g.number_out_links[j] == 0: g.dangling_pages.pop(j) g.number_out_links[j] += 1 return g def step(g,p,s=0.85): '''Performs a single step in the PageRank computation, with web g and parameter s. Applies the corresponding M matrix to the vector p, and returns the resulting vector.''' n = g.size v = numpy.matrix(numpy.zeros((n,1))) inner_product = sum([p[j] for j in g.dangling_pages.keys()]) for j in xrange(n): v[j] = s*sum([p[k]/g.number_out_links[k] for k in g.in_links[j]])+s*inner_product/n+(1-s)/n # We rescale the return vector, so it remains a # probability distribution even with floating point # roundoff. return v/numpy.sum(v) def pagerank(g,s=0.85,tolerance=0.00001): '''Returns the PageRank vector for the web g and parameter s, where the criterion for convergence is that we stop when M^(j+1)P-M^jP has length less than tolerance, in l1 norm.''' n = g.size p = numpy.matrix(numpy.ones((n,1)))/n iteration = 1 change = 2 while change > tolerance: print "Iteration: %s" % iteration new_p = step(g,p,s) change = numpy.sum(numpy.abs(p-new_p)) print "Change in l1 norm: %s" % change p = new_p iteration += 1 return p print '''Now computing the PageRank vector for a random web containing 10000 pages, with the number of inbound links to each page controlled by a Pareto power law distribution with parameter 2.0, and with the criterion for convergence being a change of less than 0.0001 in l1 norm over a matrix multiplication step.''' g = random_web(10000,2.0) # works up to several million # pages. pr = pagerank(g,0.85,0.0001)

As written, the code computes PageRank for a 10,000 page web, and I recommend you start with that number of pages, to check that the code runs. Once that’s done, you can change the parameter in the second last line to change the number of pages. If you do that, you may wish to change the tolerance 0.0001 in the final line, decreasing it to ensure a tolerance more appropriate to the number of pages, say .

What determines the tolerance that should be used? I have used the norm to quantify the rate of convergence in the code, but that is a rather arbitrary measure. In particular, an alternate measure might be to look at the magnitude of the differences between individual elements, computing , and requiring that it be substantially smaller than the minimal possible value for PageRank, . This is less conservative, and arguably a more relevant measure; the code is easily changed.

### Exercises

- The code above is complex enough that bugs are likely. As a test, use an alternate method to compute PageRank for small web sizes, and compare your results against the output from the code above. Do you believe the code is correct?
- Modify the program described in this lecture so that it works with a non-uniform teleportation vector, as described in an earlier post.
- Python is a great language, but it’s not designed with speed in mind. Write a program in C to compute the PageRank for a random web. Can you compute the PageRank for ten million pages? What’s the largest number you’re able to compute on a single machine?

**About this post:** This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog here.

### Trackbacks and Pingbacks

- Geeks3D Daily News Round-up (2008.12.24) And Merry Xmas! | The Geeks Of 3D
- WWGD: Understanding Google’s Technology Stack
- Coderies | taggle.org
- What is Authority? | The Marketing Technology Blog, Indianapolis
- What is Authority? | Fidicaro.net
- Michael Nielsen ยป Using your laptop to compute PageRank for millions of webpages : Popular Links : eConsultant
- links for 2010-05-27 « Blarney Fellow
- Data Scientists and Artists are the new upper class – and will be in the future « The Apps Lifestyle
- Quora
- MapReduce: What are some good class projects for Machine Learning using MapReduce? - Quora
- Data Week: Becoming a data scientist - O'Reilly Radar
- Introducing NodeZoo.com | Richard Rodger
- Class projects for Hadoop | Digital thoughts

Comments are closed.

Very cool! Just remember to end your print statements with quotes and the value to be printed in the “while change > tolerance” loop. I’m going to experiment a bit using Shed Skin (http://code.google.com/p/shedskin/) to speed it up a bit.

Brad – Thanks! Turns out there is a bug in the script I use to convert my source file (which is LaTeX) to WordPress, and that messes up the Python. I’ll fix it. I’ll be interested to hear how your experiment with Shed Skin goes.

Oh boy! Now we’re getting to a portion of this lecture series that I was *definitely* looking forward to, namely,

O(n) matrix-vector multiplication algorithms.For reasons that (AFAICT) nobody clearly understands, in recent decades

O(n) matrix-vector multiplication algorithms have appeared ubiquitously in modern physics, chemistry, engineering, and information theory. Which is good, because with the help of these algorithms (as Michael observes), systems havingn~10^6-10^8 degrees of freedom become tractable to analyze and/or simulate.A key issue (whose answer is not obvious to me from the lecture) is whether

Mis Hermitian (apparently not) and/or can be related to a Hermitian matrix via anO(n) pre-conditioning algorithm (which is the bit that I’m not clear on at present).When this is true, then

M(typically) can be regarded as a metric on a (typically) tensor network space, and the act of matrix-vector multiplication can be viewed (typically) as the fast computation of the so-calledmusical isomorphismbetween one-forms and vectors.It turns out that this geometric point of view is congenial in many large-scale simulation and information-processing algorithms, but (despite my modest efforts) it has never yet inspired in me any geometric understanding of Google’s PageRank algorithms … and yet I feel that it should.

Anyway, that’s how one reader is processing these fine lectures … a reader who would be pathetically grateful for rays of geometric illumination.

Just to mention, a fine textbook relating to these issues is Anne Greenbaum’s

Iterative methods for solving linear systems.————-

@book{Author = {Greenbaum, Anne}, Publisher = {Society for Industrial and Applied Mathematics (SIAM)}, Series = {Frontiers in Applied Mathematics}, Title = {Iterative methods for solving linear systems}, Volume = {17}, Year = {1997}}

For those of you who like Matlab, Matlab’s inbuilt sparse matrix type appears to be a fairly efficient way to compute pagerank. For example, this Matlab function that I wrote computes a 2 million page pagerank in about 45 seconds: http://users.on.net/~henry/pagerank1.m

Apparently the “scipy” package provides sparse matrix types in python. I’ll give it a go sometime to see whether that is just as fast.

Thanks for that, Henry! I’ve reshared your comment to the FriendFeed room. It definitely suggests I should use sparse matrices if possible in my distributed implementation.

You know,

instead of using column matrices, you can use n-dim arrays:

v = numpy.matrix(numpy.zeros((n,1)))

becomes

v = numpy.zeros(n,dtype=numpy.float32)

and

p = numpy.matrix(numpy.ones((n,1)))/n

becomes

p = numpy.ones(n,dtype=numpy.float32)/n

and putting the initialization of the v vector out of step function, initializing it only ones in pagerank function,

you can get a lot of speed-up.

Daniel – Thanks for the note. When preparing the notes I considered doing something similar, but decided the presentation would be a little easier to understand without it. It’s been a while, but as I recall, my line of thought was simply that I wanted to avoid numpy, probably because the notes were originally prepared for people not at all familiar with numpy. The purpose of the notes is, of course, pedagogical – production code would be vastly different.

Hi, I want to use your code for implement the PageRank method in Sage with other methods similar to it. Have I your permission? If yes, I will put your name as author.

Thanks in advance!

Hi Jorge – Two things:

(1) My code was for educational purposes, and is not suited to production use. E.g., there are edge cases not dealt with, the data structures are deliberately chosen for clarity in exposition, not for utility in execution, and so on. I would not release the code for production use; frankly, it would need to be rewritten from scratch.

(2) The patent for PageRank is held by Google, and I’m not sure whether Sage could incorporate it in any case.

So I’m afraid I’d prefer not.

Ok, thanks.