How changing the structure of the web changes PageRank

Suppose I add a hyperlink from a webpage w to a webpage w'. In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web. For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, https://michaelnielsen.org, then not only will that increase the PageRank of my homepage, it will also increase (though in a smaller way) the PageRank of pages my homepage links to, and then the pages they link to, and so on, a cascading network of ever-smaller changes to PageRank.

In this post I investigate how PageRank changes when the structure of the web changes. We’ll look at the impact of adding and removing multiple links from the web, and also at what happens when entirely new webpages are created. In each case we’ll derive quantitative bounds on the way in which PageRank can change.

I’ve no doubt that most or all of the results I describe here are already known – PageRank has been studied in great detail in the academic literature, and presumably the bounds I describe (or better) have been obtained by others. This post is for my own pleasure (and to improve my own understanding) in attacking these questions from first principles; it is, essentially, a cleaned up version of my notes on the problem, with the hope that the notes may also be of interest to others. For an entree into the academic literature on these questions, see here.

The post requires a little undergraduate-level linear algebra to follow, as well as some basic familiarity with how PageRank is defined and calculated. I’ve written an introduction to PageRank here (see also here, here, here and here). However, I’ve included a short summary below that should be enough to follow this post.

One final note before we get into the analysis: I understand that there are some people reading my blog who are interested in search engine optimization (SEO). If you’re from that community, then I should warn you in advance that this post doesn’t address questions like how to link in such a way as to boost the PageRank of a particular page. Rather, the results I obtain are general bounds on global changes in PageRank as a result of local changes to the structure of the web. Understand such global changes is of great interest if you’re trying to understand how a search engine using PageRank might work (which is my concern), but probably of less interest if your goal is to boost the PageRank of individual pages (SEO’s concern).

PageRank summary

PageRank is defined by imagining a crazy websurfer, who surfs the web by randomly following links on any given page. They interrupt this random websurfing occasionally to “teleport” to a page chosen completely at random from the web. The probability of teleporting t in any given step is assumed to be around t = 0.15, in line with the original PageRank paper. The PageRank of a given webpage w is defined to be the long-run probability p_w of the surfer being on the page. A high PageRank means the page is important; a low PageRank means the page is not important.

One minor wrinkle in this definition of PageRank is what our websurfer should do when they arrive at a dangling webpage, i.e., a webpage with no outgoing links. Obviously, they can’t just select a link to follow at random – there’s nothing to select! So, instead, in this situation what they do is always teleport to a randomly chosen webpage. Another, equivalent way of thinking about this is to imagine that we add outgoing links from the dangling page to every single other page. In this imagined world, our crazy websurfer simply chooses a webpage at random, regardless of whether they are teleporting or not.

With the above definition of the PageRank of a page in mind, the PageRank vector is defined to be the vector p whose components are the PageRanks of all the different webpages. If we number the pages 0, 1, 2, \ldots, N-1 then the component p_0 is the PageRank of page number 0, p_1 is the PageRank of page number 1, and so on. The PageRank vector satisfies the PageRank equation, p = Mp, where M is an N \times N matrix which we’ll call the PageRank matrix. The element M_{jk} of M is just the probability that a crazy websurfer on page j will go to page k. This probability is t/N if there is no link between pages j and k, i.e., it’s just the teleporation probability. The probability is t/N + (1-t)/l if there is a link between page j and k, where l is the number of outbound links from page j. Note that in writing these formulae I’ve assumed that j is not a dangling page. If it is, then we will use the convention that l = N, i.e., the page is linked to every other page, and so M_{jk} = 1/N.

In this post I’ll consider several scenarios where we imagine altering the structure of the web in some way – by adding a link, deleting a link, adding a page, and so on. I’ll use p to denote the PageRank vector before the alteration, and q to denote the PageRank vector after the alteration. What we’re interested in is understanding the change from p to q. The main quantity we’ll study is the total change \sum_j |q_j-p_j| in PageRank across all webpages. This quantity is derived from the l_1 norm for vectors, \|v\|_1 \equiv \sum_j |v_j|. We’ll drop the subscript 1 in the norm notation \|\cdot\|_1, since we’ll only be using the l_1 norm, never the Euclidean norm, which is more conventionally denoted \|\cdot\|.

A key fact about the l_1 norm and the PageRank matrix is that when we apply the PageRank matrix to any two probability vectors, r and s, they are guaranteed to get closer by a factor of (1-t):

 \|M(r-s)\| \leq (1-t) \|r-s\|.

Intuitively, applying the PageRank matrix is like our crazy websurfer doing a single step. And so if we think of r and s as possible starting distributions for the crazy websurfer, then this inequality shows that these distributions gradually get closer and closer, before finally converging to PageRank. We’ll call this inequality the contractivity property for PageRank. I discuss the contractivity property and its consequences in much more detail in this post.

What happens when we add a single link to the web?

Suppose we add a single new link to the web, from a page w to a page w'. How does the PageRank vector change? In this section we’ll obtain a bound on the total change, \|q-p\|. We’ll obtain this bound under the assumption that w isn’t a dangling page, i.e., a page with no outgoing links. As we’ll see, dangling pages create some complications, which we’ll deal with in a later section.

To obtain the bound on \|p-q\|, let M be the PageRank matrix before the new link is inserted, so the PageRank equation is p = Mp, and let M' be the PageRank matrix after the new link is inserted, when the PageRank equation becomes q = M'q. Note that we have q-p = M'q -Mp. If we introduce a new matrix \Delta \equiv M'-M, then this equation may be rewritten:

 q-p = M'q-(M'-\Delta)p = M'(q-p) + \Delta p.

Taking norms on both sides, and using the triangle inequality:

   \|q-p\| \leq \|M'(q-p)\| + \| \Delta p\|.

Using the contractivity property, \|M'(q-p)\| \leq (1-t) \|q-p\|, and so:

   \|q-p\| \leq (1-t) \|q-p\| + \| \Delta p\|.

Rearranging, we obtain:

   [*] \,\,\,\, \|q-p\| \leq \frac{\| \Delta p \|}{t}.

Up to this point we haven’t used in any way the fact that we’re adding merely a single link to the web. The inequality [*] holds no matter how we change the structure of the web, and so in later sections we’ll use [*] (or an analagous inequality) to analyse more complex situations.

To analyse what happens in the specific case when we add a single new link, we will compute \| \Delta p\|. Recall that \Delta = M'-M is the change between the two PageRank matrices. Suppose that w starts with l outgoing hyperlinks (where l > 0, reflecting the fact that w is not a dangling page). By numbering pages appropriately, we can assume that w is page number 0, that the new link is from page 0 to page l+1, and that before the link was inserted, the page 0 was linked to pages 1,2,\ldots,l. Under this numbering of pages, the only column of M which changes in M' is the first column, corresponding to the new link from page 0 to page l+1. Before the link was added, this column had entries t/N + (1-t)/l for the rows corresponding to pages 1 through l, i.e., for the outgoing links, and t/N everywhere else. After the link is added, this column has entries t/N+(1-t)/(l+1) for the rows corresponding to pages 1 through l+1, and t/N everywhere else. Combining these facts and doing a little algebra we find that the entries in the first column of \Delta are:

   (1-t) \left[ \begin{array}{c}       0 \\       \frac{-1}{l(l+1)} \\       \vdots \\        \frac{-1}{l(l+1)} \\       \frac{1}{l+1} \\       0 \\       \vdots \\       0       \end{array} \right].

The other columns of \Delta are all zero, because M and M' don’t change in those other columns. Substituting for \Delta in [*] and simplifying, we see that [*] becomes:

   [**] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0}{l}

This is a strong result. The inequality [**] tells us that the total PageRank vector changes at most in proportion to the PageRank p_0 of the page to which the outbound link is being added. Ordinarily, p_0 will be tiny – it’s a probability in an N-element probability distribution, and typically such probabilities are of size 1/N. As a result, we learn that the total variation in PageRank will be miniscule in the typical case. [**] also tells us that the total variation in PageRank scales inversely with the number of outbound links. So adding an extra outbound link to a page which already has a large number of links will have little effect on the overall PageRank.

Incidentally, going carefully over the derivation of [**] we can see why we needed to assume that the page w was not a dangling page. We get a hint that something must be wrong from the final result: the right-hand side of [**] diverges for a dangling page, since l = 0. In fact, during the derivation of [**] we assumed that M‘s first column had entries t/N + (1-t)/l for the rows corresponding to pages 1 through l, and t/N everywhere else. In fact, M‘s first column has entries 1/N everywhere in the case of a dangling page. We could fix this problem right now by redoing the analysis for the l=0 case, but I’ll skip it, because we’ll get the result for free as part of a later analysis we need to do anyway.

One final note: in the derivation of [**] I assumed that the existing links from page 0 are to pages 1,\ldots,l, and that the new link is to page l+1. This presumes that all the links are to pages other than the original page, i.e., it’s not self-linking. Of course, in practice many pages are self-linking, and there’s no reason that couldn’t be the case here. However, it’s easy to redo the analysis for the case when the page is self-linking, and if we do so it turns out that we arrive at the same result, [**].

Problems

  • Verify the assertion in the last paragraph.
  • The inequality [**] bounds the change \|q-p\| in terms of the initial PageRank of the linking page, p_0. A very similar derivation can be used to bound it in terms of the final PageRank, q_0. Prove that \|p-q\| \leq \frac{(1-t)}{t} \frac{2q_0}{l}.

Problems for the author

  • Is it possible to saturate the bound [**]? My guess is yes (or close to, maybe within a constant factor), but I haven’t explicitly proved this. What is the best possible bound of this form?

We’ve obtained a bound on the variation in l_1 norm produced by adding a link to the web. We might also wonder whether we can say anything about the change in the PageRank of individual pages, i.e., whether we can bound |q_j-p_j|? This is the type of question that is likely to be of interest to people who run webpages, or to people interested in search engine optimization. Note that it’s really three separate questions: we’d like bounds on |q_j-p_j| when (1) j is the source page for the new link; (2) j is the target page for the new link; and (3) j is neither the source nor the target. Unfortunately, I don’t have any such bounds to report, beyond the obvious observation that in all three cases, |q_j-p_j| \leq \|q-p\|, and so at least the bound on the right-hand side of [**] always applies. But it’d obviously be nice to have a stronger result!

Problems for the author

  • Can I find bounds on |q_j-p_j| for the three scenarios described above? (Or perhaps for some different way of specifying the properties of j?, e.g., when j is another page linked to by the source page.)

What happens when we add and remove multiple links from the same page?

Suppose now that instead of adding a single link from a page, we remove m existing links outbound from that page, and add n new links outbound from the page, for a total of l' = l-m+n links. How does the PageRank change? Exactly as in our earlier analysis, we can show:

   \|q-p\| \leq \frac{\| \Delta p \|}{t},

where \Delta is the change in the PageRank matrix caused by the modified link structure. Assuming as before that the page is not a dangling page, we number the pages so that: (1) the links are outbound from page 0; (2) the links to pages 1,\ldots, l-m are preserved, before and after; (3) the links to pages l-m+1,\ldots,l are removed; and (4) the links to pages l+1,\ldots,l+n are new links. Then all the columns of \Delta are zero, except the first column, which has entries:

   (1-t) \left[ \begin{array}{c}       0 \\       \frac{l-l'}{ll'} \\       \vdots \\       \frac{l-l'}{ll'} \\       \frac{-1}{l} \\       \vdots \\       \frac{-1}{l} \\       \frac{1}{l'} \\       \vdots \\       \frac{1}{l'} \\       0 \\       \vdots \\       0       \end{array} \right]

As a result, we have

   \|q-p\| \leq \frac{(1-t)p_0}{t} \left[ \frac{|l-l'|(l-m)}{ll'}+ \frac{m}{l}     + \frac{n}{l'} \right].

To simplify the quantity on the right we analyse two cases separately: what happens when l' \geq l, and what happens when l' <l[/latex].  In the case [latex]l' \geq l[/latex] we get after some algebra  [latex]   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0n}{l'}, [/latex]  which generalizes our earlier inequality [**].  In the case [latex]l' < l[/latex] we get  [latex]   \|q-p\| \leq \frac{(1-t)}{t} \frac{2 p_0m}{l}. [/latex]  Observing that [latex]l' \geq l[/latex] is equivalent to [latex]n \geq m[/latex], we may combine these two inequalities into a single unified inequality, generalizing our earlier result [**]:  [latex]   [***] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 \max(m,n)}{\max(l,l')}. [/latex]    <h3>What about adding links to a dangling page?</h3>  Let's come back to the question we ducked earlier, namely, how PageRank changes when we add a single link to a dangling page.  Recall that in this case the crazy websurfer model of PageRank is modified so it's as though there are really [latex]N outgoing links from the dangling page. Because of this, the addition of a single link to a dangling page is equivalent to the deletion of m = N-1 links from a page that started with l = N outgoing links. From the inequality [***], we see that:

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-1)}{N}.

To generalize further, suppose instead that we’d added n links to a dangling page. Then the same method of analysis shows:

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-n)}{N}.

In general, we can always deal with the addition of n links to a dangling page as being equivalent to the deletion of m = N-n links from a page with N outgoing links. Because of this, in the remainder of this post we will use the convention of always treating dangling pages as though they have l=N outgoing links.

What happens when we add and remove links from different pages?

In the last section we assumed that the links were all being added or removed from the same page. In this section we generalize these results to the case where the links aren’t all from the same page. To do this, we’ll start by considering the case where just two pages are involved, which we label 0 and 1. We suppose m_j outbound links are added to page j (j = 0 or 1), and n_j outbound links are removed. Observe that M' = M + \Delta_0+\Delta_1, where \Delta_0 and \Delta_1 are the changes in the PageRank matrix due to the modifications to page 0 and page 1, respectively. Then a similar argument to before shows that:

   \|q-p\| \leq \frac{\|(\Delta_0+\Delta_1)p\|}{t}.

Applying the triangle inequality we get:

   \|q-p\| \leq \frac{\|\Delta_0p\|}{t} + \frac{\|\Delta_1p\|}{t}.

And so we can reuse our earlier analysis to conclude that:

   \|q-p\| \leq \frac{2(1-t)}{t} \left[ \frac{\max(m_0,n_0)p_0}{\max(l_0,l_0′)}     + \frac{\max(m_1,n_1)p_1}{\max(l_1,l_1′)} \right],

where l_j and l_j' are just the number of links outbound from page j, before and after the modifications to the link structure, respectively. This expression is easily generalized to the case where we are modifying more than two pages, giving:

   \|q-p\| \leq \frac{2(1-t)}{t} \sum_j \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}.

This inequality is a quite general bound which can be applied even when very extensive modifications are made to the structure of the web. Note that, as discussed in the last section, if any of the pages are dangling pages then we treat the addition of n_j links to that page as really being deleting m = N-n_j links from a page with l_j = N outgoing links, and so the corresponding term in the sum is (N-n_j)p_j/N. Indeed, we can rewrite the last inequality, splitting up the right-hand side into a sum \sum_{j:{\rm nd}} over pages j which are not dangling, and a sum \sum_{j:{\rm d}} over dangling pages j to obtain:

   \|q-p\| \leq \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right).

Modified teleportation step

The original PageRank paper describes a way of modifying PageRank to use a teleportation step which doesn’t take us to a page chosen uniformly at random. Instead, some non-uniform probability distribution is used for teleportation. I don’t know for sure whether Google uses this idea, but I wouldn’t be suprised if they use it as a way of decreasing the PageRank of suspected spam and link-farm pages, by making them less likely to be the target of teleportation. It’s possible to redo all the analyses to date, and to prove that the bounds we’ve obtained hold even with this modified teleportation step.

Problems

  • Prove the assertion in the last paragraph.

What happens when we add a new page to the web?

So far, our analysis has involved only links added between existing pages. How about when a new page is added to the web? How does the PageRank change then? To analyse this question we’ll begin by focusing on understanding how PageRank changes when a single page is created, with no incoming or outgoing links. Obviously, this is rather unrealistic, but by understanding this simple case first we’ll be in better position to understand more realistic cases.

It’s perhaps worth commenting on why we’d expect the PageRank to change at all when we create a new page with no incoming or outgoing links. After all, it seems as though the random walk taken by our crazy websurfer won’t be changed by the addition of this new and completely isolated page. In fact, things do change. The reason is because the teleportation step is modified. Instead of going to a page chosen uniformly at random from N pages, each with probability 1/N, teleportation now takes us to a page chosen uniformly at random from the full set of N+1 pages, each with probability 1/(N+1). It is this change which causes a change in the PageRank.

A slight quirk in analysing the change in the PageRank vector is that while the initial PageRank vector p is an N-dimensional vector, the new PageRank vector q is an N+1-dimensional vector. Because of the different dimensionalities, the quantity \|q-p\| is not even defined! We resolve this problem by extending p into the N+1-dimensional space, adding a PageRank of 0 for the new page, page number N. We’ll denote this extended version of p by p_e. We also need to extend M so that it becomes an N+1 by N+1 matrix M_e satisfying the PageRank equation M_ep_e = p_e for the extended p. One way of doing this is to extend M by adding a row of zeros at the bottom, and 1/N everywhere in the final column, except the final entry:

   M_e =   \left[ \begin{array}{cccc}          &        &   & \frac{1}{N} \\         & M      &   & \vdots \\         &        &   & \frac{1}{N} \\       0 & \ldots & 0 & 0       \end{array} \right]

With the extended p and M it is easy to verify that the PageRank equation M_ep_e = p_e is satisfied. As in our earlier discussions, we have the inequality

   \|q-p_e\| \leq \frac{\|\Delta p_e \|}{t},

where now \Delta = M'-M_e. Computing \Delta, we have:

   \Delta = \left[ \begin{array}{cccc}       \frac{t}{N+1}-\frac{t}{N} & \ldots & \frac{t}{N+1}-\frac{t}{N} & \frac{1}{N+1}-\frac{1}{N} \\       \vdots                    & \ddots & \vdots                    & \vdots \\       \frac{t}{N+1}-\frac{t}{N} & \ldots & \frac{t}{N+1}-\frac{t}{N} & \frac{1}{N+1}-\frac{1}{N} \\       \frac{t}{N+1}             & \ldots & \frac{t}{N+1}             & \frac{1}{N+1}     \end{array} \right].

Substituting and simplifying, we obtain the bound:

   [****] \,\,\,\, \|q-p_e\| \leq \frac{2}{N+1}.

Not surprisingly, this bound shows that creating a new and completely isolated page makes only a very small difference to PageRank. Still, it’s good to have this intuition confirmed and precisely quantified.

Exercises

  • Suppose we turn the question asked in this section around. Suppose we have a web of N pages, and one of those pages is an isolated page which is deleted. Can you prove an analogous bound to [****] in this situation?

Problems for the author

  • How does the bound [****] change if we use a non-uniform probability distribution to teleport? The next problem may assist in addressing this question.
  • An alternative approach to analysing the problem in this section is to use the formula p = t (I-(1-t)G)^{-1} e, where e is the vector whose entries are the uniform probability distribution (or, more generally, the vector of teleportation probabilities), and G is the matrix whose jk entry is 1/l_j if j links to k, and 0 otherwise. This formula is well-suited to analysing the problem considered in the current section; without going into details, the essential reason is that the modified matrix I-(1-t)G' has a natural block-triangular structure, and such block-triangular matrices are easy to invert. What is the outcome of doing such an analysis?

Adding multiple pages to the web

Suppose now that instead of adding a single page to the web we add multiple pages to the web. How does the PageRank change? Again, we’ll analyse the case where no new links are created, on the understanding that by analysing this simple case first we’ll be in better position to understand more realistic cases.

We’ll consider first the case of just two new pages. Suppose p is the original N-dimensional PageRank vector. Then we’ll use p_e as before to denote the extension of p by an extra 0 entry, and p_{ee} to denote the extension of p by two 0 entries. We’ll use q to denote the N+1-dimensional PageRank vector after adding a single webpage, and use q_e to denote the extension of q by appending an extra 0 entry. Finally, we’ll use r to denote the N+2-dimensional PageRank vector after adding two new webpages.

Our interest is in analysing \|r-p_{ee}\|. To do this, note that

 \|r-p_{ee}\| = \|r-q_{e}+q_e-p_{ee}\|.

Applying the triangle inequality gives

   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q_e-p_{ee}\|.

Observe that \|q_e-p_{ee}\| = \|q-p_e\|, since the extra 0 appended to the end of both vectors makes no difference to the l_1 norm. And so the previous inequality may be rewritten

   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q-p_e\|.

Now we can apply the results of last section twice on the right-hand side to obtain

   \|r-p_{ee} \| \leq \frac{2}{N+2} + \frac{2}{N+1}.

More generally, suppose we add \Delta N new pages to the web. Denote the final N+\Delta N-dimensional PageRank vector by q. Let p denote the initial N-dimensional PageRank vector, and let p_e denote the N+\Delta N-dimensional extension obtained by appending \Delta N 0 entries to p. Then applying an argument similar to that above, we obtain:

   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}.

Adding multiple pages, adding and removing multiple links

Suppose now that we add many new pages to the web, containing links to existing webpages. How does the PageRank change? Not surprisingly, answering this kind of question can be done in a straightforward way using the techniques described already in this post. To see this, suppose we add a single new webpage, which contains n new links back to existing webpages. As before, we use q to denote the new PageRank vector, p to denote the old PageRank vector, and p_e to denote the extension of p obtained by appending a 0 entry. We have

   \|q-p_e\| = \|M'q-M_e p_e\|,

where M_e is the (N+1) \times (N+1) matrix obtained by adding a row of 0s to the bottom of M, and a column (except the bottom right entry) of 1/N values to the right of M. We can write M' = M_e+\Delta_1+\Delta_2, where \Delta_1 is the change in M_e due to the added page, and \Delta_2 is the change in M_e due to the added links. Applying a similar analysis to before we obtain

   \|q-p_e\| \leq \frac{\|\Delta_1 p_e\| + \| \Delta_2 p_e \|}{t}.

We can bound the first term on the right-hand side by 2/(N+1), by our earlier argument. The second term vanishes since \Delta_2 is zero everywhere except in the last column, and p_e‘s final entry is zero. As a result, we have

   \|q-p_e\| \leq \frac{2}{N+1},

i.e., the bound is exactly as if we had added a new page with no links.

Exercises

  • There is some ambiguity in the specification of \Delta_1 and \Delta_2 in the above analysis. Resolve this ambiguity by writing out explicit forms for \Delta_1 and \Delta_2, and then verifying that the remainder of the proof of the bound goes through.

Suppose, instead, that the new link had been added from an existing page. Then the term \| \Delta_2 p_e\| above would no longer vanish, and the bound would become

   \|q-p_e\| \leq \frac{2}{N+1} + \frac{2(1-t)}{t} \frac{p}{(l+1)},

where p was the initial PageRank for the page to which the link was added, and l was the initial number of links from that page.

Generalizing this analysis, it’s possible to write a single inequality that unites all our results to date. In particular, suppose we add \Delta N new pages to the web. Denote the final N+\Delta N-dimensional PageRank vector by q. Let p denote the initial N-dimensional PageRank vector, and let p_e denote the N+\Delta N-dimensional extension obtained by appending \Delta N 0 entries to p. Suppose we add m_j outbound links to page j, and remove n_j outbound links from page j. Then an analysis similar to that above shows that:

   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}+ \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right),

where, as before, the sum \sum_{j:{\rm nd}} is over those pages j which are non-dangling, and the sum \sum_{j:{\rm d}} is over dangling pages j. Note that we can omit all the new pages j from this latter sum, for the same reason the quantity \| \Delta_2 p_e \| vanished earlier in this section. This inequality generalizes all our earlier results.

Problems

  • Fill in the details of the above proof. To do so it helps to consider the change in PageRank in two steps: (1) the change due to the addition of new links to existing pages; and (2) the change due to the addition of new webpages, and links on those webpages.

Recomputing PageRank on the fly

Part of my reason for being interested in the questions discussed in this post is that I’m interested in how to quickly update PageRank during a web crawl. The most obvious way to compute the PageRank of a set of crawled webpages is as a batch job, doing the computation maybe once a week or once a month. But if you’re operating a web crawler, and constantly adding new pages to an index, you don’t want to have to wait a week or a month before computing the PageRank (or similar measure) of a newly crawled page. You’d like to compute – or at least estimate – the PageRank immediately upon adding the page to the index, without having to do an enormous matrix computation. Is there a way this can be done?

Unfortunately, I don’t know how to quickly update the PageRank. However, the bounds in this post do help at least a little in figuring out how to update the PageRank as a batch job. Suppose we had an extant multi-billion page crawl, and knew the corresponding values for PageRank. Suppose then that we did a mini-crawl to update the index, perhaps with a few million new pages and links. Suppose p was the (known value of) the PageRank vector before the mini-crawl, and q is the PageRank after the mini-crawl, which we are now trying to compute. Suppose also that M' is the new PageRank matrix. One natural way of updating our estimate for PageRank would be to apply M' repeatedly to p (more precisely, its extension, p_e), in the hopes that it will quickly converge to q. The bounds in this post can help establish how quickly this convergence will happen. To see this, observe that

   \|q-(M')^np_e\| = \| (M')^n(q-p_e) \| \leq (1-t)^n \|q-p_e\|,

where we used the PageRank equation M'q = q to get the first equality, and the contractivity of PageRank to get the second inequality. This equation tells us that we can compute an estimate for q by applying M' repeatedly to p_e, and it will converge exponentially quickly. This is all standard stuff about PageRank. However, the good news is that the results in this post tell us something about how to bound \|q-p_e\|, and frequently this quantity will be very small to start with, and so the convergence will occur much more quickly than we would a priori expect.

Problems for the author

  • Is there a better way of updating PageRank on the fly, rather than applying powers of M'?

Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.