The PageRank for a webpage is a probability between 0 and 1. The general idea is that PageRank quantifies the importance of the page: the bigger the probability the more important the page. I was curious about what the PageRank distribution is across the web as a whole, and so I built a very simple model to investigate the question.
Here’s the three assumptions that went into the model.
1. I use just 200 webpages. The qualitative results don’t seem to vary much as the number of webpages increases further, and the results get both harder to interpret and harder to generate, so 200 seems like a good size.
2. I assume that the number [tex]L[/tex] of inbound links to any given webpage is governed by a Pareto probability distribution [tex]p(L) \propto 1/L^a[/tex], with [tex]a = 2[/tex]. This assumption is based on a paper by Adamic and Huberman. Note that the data in that paper is a decade old, and more recent data should really be used. (An independently interesting question is how that exponent is changing over time, and why it changes.)
3. I assume that the number of inbound links to each webpage is an independent random variable.
With these assumptions, the histogram of PageRanks for a typical random web looks like this:
Aggregating over multiple runs gives:
There are a few notable things about these histograms.
First, most pages have PageRank near the minimal possible value of [tex]0.15 * 1/200 = 0.00075[/tex] (see here for an explanation of why that’s the minimal possible value.
Second, the distribution of PageRanks drops off very fast. Because there are 200 webpages, and the PageRanks must sum to one (being probabilities), the “average” PageRank must be [tex]0.005[/tex]. You can see from the histograms that the distribution has already dropped off quite a bit by the time you get to this PageRank: most pages have a PageRank quite a bit below the average. It’s a few very high PageRank pages that restore the average.
Third, the page with the highest PageRank had a PageRank approximately 20 times higher than average PageRank.
I haven’t done the analysis, but it looks pretty likely that the distribution of PageRanks for this model is itself approximated by a power law distribution. Curious.
This post is part of an ongoing series about the Google Technology Stack, covering technologies such as PageRank, MapReduce, the Google File System, and Bigtable. Posts appear on this blog once a week; there is an associated FriendFeed room for discussion.