The PageRank distribution for the web

The PageRank for a webpage is a probability between 0 and 1. The general idea is that PageRank quantifies the importance of the page: the bigger the probability the more important the page. I was curious about what the PageRank distribution is across the web as a whole, and so I built a very simple model to investigate the question.

Here’s the three assumptions that went into the model.

1. I use just 200 webpages. The qualitative results don’t seem to vary much as the number of webpages increases further, and the results get both harder to interpret and harder to generate, so 200 seems like a good size.

2. I assume that the number [tex]L[/tex] of inbound links to any given webpage is governed by a Pareto probability distribution [tex]p(L) \propto 1/L^a[/tex], with [tex]a = 2[/tex]. This assumption is based on a paper by Adamic and Huberman. Note that the data in that paper is a decade old, and more recent data should really be used. (An independently interesting question is how that exponent is changing over time, and why it changes.)

3. I assume that the number of inbound links to each webpage is an independent random variable.

With these assumptions, the histogram of PageRanks for a typical random web looks like this:

Aggregating over multiple runs gives:

There are a few notable things about these histograms.

First, most pages have PageRank near the minimal possible value of [tex]0.15 * 1/200 = 0.00075[/tex] (see here for an explanation of why that’s the minimal possible value.

Second, the distribution of PageRanks drops off very fast. Because there are 200 webpages, and the PageRanks must sum to one (being probabilities), the “average” PageRank must be [tex]0.005[/tex]. You can see from the histograms that the distribution has already dropped off quite a bit by the time you get to this PageRank: most pages have a PageRank quite a bit below the average. It’s a few very high PageRank pages that restore the average.

Third, the page with the highest PageRank had a PageRank approximately 20 times higher than average PageRank.

I haven’t done the analysis, but it looks pretty likely that the distribution of PageRanks for this model is itself approximated by a power law distribution. Curious.

This post is part of an ongoing series about the Google Technology Stack, covering technologies such as PageRank, MapReduce, the Google File System, and Bigtable. Posts appear on this blog once a week; there is an associated FriendFeed room for discussion.

8 comments

Artur Adib says:

December 13, 2008 at 7:19 pm

I know this is trivial, but how about using a log-log scale for your plots? We can hardly see the data in this linear scale. Perhaps this will also reveal your power-law.

You might also enjoy taking a look at what folks from my alma mater did on the subject:

“Using PageRank to Characterize Web Structure”

http://www.internetmathematics.org/volumes/3/1/Pandurangan.pdf

I’m really enjoying your lectures though. Keep up the good work.

-Artur
http://arturadib.blogspot.com
oz says:

December 14, 2008 at 8:18 pm

Isn’t the page-rank of a page simply the fraction
of time a random-walker spends on this page?

Assuming this, if the network was undirected, then the page-rank is exactly proportional to the degree of the node. Thus if the degree is taken from a power-law distribution, it immediately implies that the page-rank follows a power-law with the same exponent.
For a directed graph i guess there are deviations from this, but my intuition is that if edges are drawn randomly it wouldn’t change much thus it’s not surprising that you still get a power-law distribution. It’s interesting and easy to check if the exponent of the page-rank dist. is the same as for the in-degree dist.
justkeeper says:

December 14, 2008 at 8:34 pm

You definitely should have a look at this article,http://arxivblog.com/?p=558 , which is very interesting.
Michael Nielsen says:

December 14, 2008 at 9:18 pm

Oz – PageRank isn’t generally proportional to the indegree. Getting a measure that isn’t directly related to indegree was part of Page and Brin’s motivation, because indegree is so easy to game by spammers. With that said, some people have noticed a reasonable correlation between the two measures in many instances.
Michael Nielsen says:

December 14, 2008 at 9:24 pm

Artur – I’m brand new to Python, and spent about 20 minutes twiddling with libraries trying to get log-log before I posted. I’ll get it later, once I’ve read through all the documentation for the numeric library. Thanks for the link to the paper, which I hadn’t seen.
Artur Adib says:

December 14, 2008 at 11:02 pm

Do you use a Mac? If so, check out this nice little plotting program (which I actually use for research):

http://plot.micw.eu/

You seem to be stepping into “Barabasi grounds”, which is pretty exciting. Have you read his book “Linked”? It’s a gentle introduction to how power law behavior emerges in graph theory. I enjoyed it.

-A.
Michael Nielsen says:

December 14, 2008 at 11:10 pm

No, I don’t use a mac. In any case, I’m not in any hurry. I intend to work my way through the documentation for the Python scientific libraries, and I’ll come to it.

I’ve read a bunch of Barabasi’s papers, and by other people (Watts, Newman etc) working on power laws and graphs etc. It’s certainly interesting stuff! I haven’t yet read “Linked”, although it’s been sitting on my bookcase for a few years, and I should probably take it down.
Pingback: Tecnologia de Internet (Locaweb) » Blog Archive » RailsConf’09: painel do primeiro dia

Comments are closed.