How changing the structure of the web changes PageRank
Suppose I add a hyperlink from a webpage to a webpage . In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web. For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, https://michaelnielsen.org, then not only will that increase the PageRank of my homepage, it will also increase (though in a smaller way) the PageRank of pages my homepage links to, and then the pages they link to, and so on, a cascading network of ever-smaller changes to PageRank.
In this post I investigate how PageRank changes when the structure of the web changes. We’ll look at the impact of adding and removing multiple links from the web, and also at what happens when entirely new webpages are created. In each case we’ll derive quantitative bounds on the way in which PageRank can change.
I’ve no doubt that most or all of the results I describe here are already known – PageRank has been studied in great detail in the academic literature, and presumably the bounds I describe (or better) have been obtained by others. This post is for my own pleasure (and to improve my own understanding) in attacking these questions from first principles; it is, essentially, a cleaned up version of my notes on the problem, with the hope that the notes may also be of interest to others. For an entree into the academic literature on these questions, see here.
The post requires a little undergraduate-level linear algebra to follow, as well as some basic familiarity with how PageRank is defined and calculated. I’ve written an introduction to PageRank here (see also here, here, here and here). However, I’ve included a short summary below that should be enough to follow this post.
One final note before we get into the analysis: I understand that there are some people reading my blog who are interested in search engine optimization (SEO). If you’re from that community, then I should warn you in advance that this post doesn’t address questions like how to link in such a way as to boost the PageRank of a particular page. Rather, the results I obtain are general bounds on global changes in PageRank as a result of local changes to the structure of the web. Understand such global changes is of great interest if you’re trying to understand how a search engine using PageRank might work (which is my concern), but probably of less interest if your goal is to boost the PageRank of individual pages (SEO’s concern).
PageRank is defined by imagining a crazy websurfer, who surfs the web by randomly following links on any given page. They interrupt this random websurfing occasionally to “teleport” to a page chosen completely at random from the web. The probability of teleporting in any given step is assumed to be around , in line with the original PageRank paper. The PageRank of a given webpage is defined to be the long-run probability of the surfer being on the page. A high PageRank means the page is important; a low PageRank means the page is not important.
One minor wrinkle in this definition of PageRank is what our websurfer should do when they arrive at a dangling webpage, i.e., a webpage with no outgoing links. Obviously, they can’t just select a link to follow at random – there’s nothing to select! So, instead, in this situation what they do is always teleport to a randomly chosen webpage. Another, equivalent way of thinking about this is to imagine that we add outgoing links from the dangling page to every single other page. In this imagined world, our crazy websurfer simply chooses a webpage at random, regardless of whether they are teleporting or not.
With the above definition of the PageRank of a page in mind, the PageRank vector is defined to be the vector whose components are the PageRanks of all the different webpages. If we number the pages then the component is the PageRank of page number 0, is the PageRank of page number 1, and so on. The PageRank vector satisfies the PageRank equation, , where is an matrix which we’ll call the PageRank matrix. The element of is just the probability that a crazy websurfer on page will go to page . This probability is if there is no link between pages and , i.e., it’s just the teleporation probability. The probability is if there is a link between page and , where is the number of outbound links from page . Note that in writing these formulae I’ve assumed that is not a dangling page. If it is, then we will use the convention that , i.e., the page is linked to every other page, and so .
In this post I’ll consider several scenarios where we imagine altering the structure of the web in some way – by adding a link, deleting a link, adding a page, and so on. I’ll use to denote the PageRank vector before the alteration, and to denote the PageRank vector after the alteration. What we’re interested in is understanding the change from to . The main quantity we’ll study is the total change in PageRank across all webpages. This quantity is derived from the norm for vectors, . We’ll drop the subscript in the norm notation , since we’ll only be using the norm, never the Euclidean norm, which is more conventionally denoted .
A key fact about the norm and the PageRank matrix is that when we apply the PageRank matrix to any two probability vectors, and , they are guaranteed to get closer by a factor of :
Intuitively, applying the PageRank matrix is like our crazy websurfer doing a single step. And so if we think of and as possible starting distributions for the crazy websurfer, then this inequality shows that these distributions gradually get closer and closer, before finally converging to PageRank. We’ll call this inequality the contractivity property for PageRank. I discuss the contractivity property and its consequences in much more detail in this post.
What happens when we add a single link to the web?
Suppose we add a single new link to the web, from a page to a page . How does the PageRank vector change? In this section we’ll obtain a bound on the total change, . We’ll obtain this bound under the assumption that isn’t a dangling page, i.e., a page with no outgoing links. As we’ll see, dangling pages create some complications, which we’ll deal with in a later section.
To obtain the bound on , let be the PageRank matrix before the new link is inserted, so the PageRank equation is , and let be the PageRank matrix after the new link is inserted, when the PageRank equation becomes . Note that we have . If we introduce a new matrix , then this equation may be rewritten:
Taking norms on both sides, and using the triangle inequality:
Using the contractivity property, , and so:
Rearranging, we obtain:
Up to this point we haven’t used in any way the fact that we’re adding merely a single link to the web. The inequality [*] holds no matter how we change the structure of the web, and so in later sections we’ll use [*] (or an analagous inequality) to analyse more complex situations.
To analyse what happens in the specific case when we add a single new link, we will compute . Recall that is the change between the two PageRank matrices. Suppose that starts with outgoing hyperlinks (where , reflecting the fact that is not a dangling page). By numbering pages appropriately, we can assume that is page number , that the new link is from page to page , and that before the link was inserted, the page was linked to pages . Under this numbering of pages, the only column of which changes in is the first column, corresponding to the new link from page to page . Before the link was added, this column had entries for the rows corresponding to pages through , i.e., for the outgoing links, and everywhere else. After the link is added, this column has entries for the rows corresponding to pages through , and everywhere else. Combining these facts and doing a little algebra we find that the entries in the first column of are:
The other columns of are all zero, because and don’t change in those other columns. Substituting for in [*] and simplifying, we see that [*] becomes:
This is a strong result. The inequality [**] tells us that the total PageRank vector changes at most in proportion to the PageRank of the page to which the outbound link is being added. Ordinarily, will be tiny – it’s a probability in an -element probability distribution, and typically such probabilities are of size . As a result, we learn that the total variation in PageRank will be miniscule in the typical case. [**] also tells us that the total variation in PageRank scales inversely with the number of outbound links. So adding an extra outbound link to a page which already has a large number of links will have little effect on the overall PageRank.
Incidentally, going carefully over the derivation of [**] we can see why we needed to assume that the page was not a dangling page. We get a hint that something must be wrong from the final result: the right-hand side of [**] diverges for a dangling page, since . In fact, during the derivation of [**] we assumed that ‘s first column had entries for the rows corresponding to pages through , and everywhere else. In fact, ‘s first column has entries everywhere in the case of a dangling page. We could fix this problem right now by redoing the analysis for the case, but I’ll skip it, because we’ll get the result for free as part of a later analysis we need to do anyway.
One final note: in the derivation of [**] I assumed that the existing links from page are to pages , and that the new link is to page . This presumes that all the links are to pages other than the original page, i.e., it’s not self-linking. Of course, in practice many pages are self-linking, and there’s no reason that couldn’t be the case here. However, it’s easy to redo the analysis for the case when the page is self-linking, and if we do so it turns out that we arrive at the same result, [**].
- Verify the assertion in the last paragraph.
- The inequality [**] bounds the change in terms of the initial PageRank of the linking page, . A very similar derivation can be used to bound it in terms of the final PageRank, . Prove that .
Problems for the author
- Is it possible to saturate the bound [**]? My guess is yes (or close to, maybe within a constant factor), but I haven’t explicitly proved this. What is the best possible bound of this form?
We’ve obtained a bound on the variation in norm produced by adding a link to the web. We might also wonder whether we can say anything about the change in the PageRank of individual pages, i.e., whether we can bound ? This is the type of question that is likely to be of interest to people who run webpages, or to people interested in search engine optimization. Note that it’s really three separate questions: we’d like bounds on when (1) is the source page for the new link; (2) is the target page for the new link; and (3) is neither the source nor the target. Unfortunately, I don’t have any such bounds to report, beyond the obvious observation that in all three cases, , and so at least the bound on the right-hand side of [**] always applies. But it’d obviously be nice to have a stronger result!
Problems for the author
- Can I find bounds on for the three scenarios described above? (Or perhaps for some different way of specifying the properties of ?, e.g., when is another page linked to by the source page.)
What happens when we add and remove multiple links from the same page?
Suppose now that instead of adding a single link from a page, we remove existing links outbound from that page, and add new links outbound from the page, for a total of links. How does the PageRank change? Exactly as in our earlier analysis, we can show:
where is the change in the PageRank matrix caused by the modified link structure. Assuming as before that the page is not a dangling page, we number the pages so that: (1) the links are outbound from page ; (2) the links to pages are preserved, before and after; (3) the links to pages are removed; and (4) the links to pages are new links. Then all the columns of are zero, except the first column, which has entries:
As a result, we have
To simplify the quantity on the right we analyse two cases separately: what happens when , and what happens when outgoing links from the dangling page. Because of this, the addition of a single link to a dangling page is equivalent to the deletion of links from a page that started with outgoing links. From the inequality [***], we see that:
To generalize further, suppose instead that we’d added links to a dangling page. Then the same method of analysis shows:
In general, we can always deal with the addition of links to a dangling page as being equivalent to the deletion of links from a page with outgoing links. Because of this, in the remainder of this post we will use the convention of always treating dangling pages as though they have outgoing links.
What happens when we add and remove links from different pages?
In the last section we assumed that the links were all being added or removed from the same page. In this section we generalize these results to the case where the links aren’t all from the same page. To do this, we’ll start by considering the case where just two pages are involved, which we label and . We suppose outbound links are added to page ( or ), and outbound links are removed. Observe that , where and are the changes in the PageRank matrix due to the modifications to page and page , respectively. Then a similar argument to before shows that:
Applying the triangle inequality we get:
And so we can reuse our earlier analysis to conclude that:
where and are just the number of links outbound from page , before and after the modifications to the link structure, respectively. This expression is easily generalized to the case where we are modifying more than two pages, giving:
This inequality is a quite general bound which can be applied even when very extensive modifications are made to the structure of the web. Note that, as discussed in the last section, if any of the pages are dangling pages then we treat the addition of links to that page as really being deleting links from a page with outgoing links, and so the corresponding term in the sum is . Indeed, we can rewrite the last inequality, splitting up the right-hand side into a sum over pages which are not dangling, and a sum over dangling pages to obtain:
Modified teleportation step
The original PageRank paper describes a way of modifying PageRank to use a teleportation step which doesn’t take us to a page chosen uniformly at random. Instead, some non-uniform probability distribution is used for teleportation. I don’t know for sure whether Google uses this idea, but I wouldn’t be suprised if they use it as a way of decreasing the PageRank of suspected spam and link-farm pages, by making them less likely to be the target of teleportation. It’s possible to redo all the analyses to date, and to prove that the bounds we’ve obtained hold even with this modified teleportation step.
- Prove the assertion in the last paragraph.
What happens when we add a new page to the web?
So far, our analysis has involved only links added between existing pages. How about when a new page is added to the web? How does the PageRank change then? To analyse this question we’ll begin by focusing on understanding how PageRank changes when a single page is created, with no incoming or outgoing links. Obviously, this is rather unrealistic, but by understanding this simple case first we’ll be in better position to understand more realistic cases.
It’s perhaps worth commenting on why we’d expect the PageRank to change at all when we create a new page with no incoming or outgoing links. After all, it seems as though the random walk taken by our crazy websurfer won’t be changed by the addition of this new and completely isolated page. In fact, things do change. The reason is because the teleportation step is modified. Instead of going to a page chosen uniformly at random from pages, each with probability , teleportation now takes us to a page chosen uniformly at random from the full set of pages, each with probability . It is this change which causes a change in the PageRank.
A slight quirk in analysing the change in the PageRank vector is that while the initial PageRank vector is an -dimensional vector, the new PageRank vector is an -dimensional vector. Because of the different dimensionalities, the quantity is not even defined! We resolve this problem by extending into the -dimensional space, adding a PageRank of for the new page, page number . We’ll denote this extended version of by . We also need to extend so that it becomes an by matrix satisfying the PageRank equation for the extended . One way of doing this is to extend by adding a row of zeros at the bottom, and everywhere in the final column, except the final entry:
With the extended and it is easy to verify that the PageRank equation is satisfied. As in our earlier discussions, we have the inequality
where now . Computing , we have:
Substituting and simplifying, we obtain the bound:
Not surprisingly, this bound shows that creating a new and completely isolated page makes only a very small difference to PageRank. Still, it’s good to have this intuition confirmed and precisely quantified.
- Suppose we turn the question asked in this section around. Suppose we have a web of pages, and one of those pages is an isolated page which is deleted. Can you prove an analogous bound to [****] in this situation?
Problems for the author
- How does the bound [****] change if we use a non-uniform probability distribution to teleport? The next problem may assist in addressing this question.
- An alternative approach to analysing the problem in this section is to use the formula , where is the vector whose entries are the uniform probability distribution (or, more generally, the vector of teleportation probabilities), and is the matrix whose entry is if links to , and otherwise. This formula is well-suited to analysing the problem considered in the current section; without going into details, the essential reason is that the modified matrix has a natural block-triangular structure, and such block-triangular matrices are easy to invert. What is the outcome of doing such an analysis?
Adding multiple pages to the web
Suppose now that instead of adding a single page to the web we add multiple pages to the web. How does the PageRank change? Again, we’ll analyse the case where no new links are created, on the understanding that by analysing this simple case first we’ll be in better position to understand more realistic cases.
We’ll consider first the case of just two new pages. Suppose is the original -dimensional PageRank vector. Then we’ll use as before to denote the extension of by an extra entry, and to denote the extension of by two entries. We’ll use to denote the -dimensional PageRank vector after adding a single webpage, and use to denote the extension of by appending an extra entry. Finally, we’ll use to denote the -dimensional PageRank vector after adding two new webpages.
Our interest is in analysing . To do this, note that
Applying the triangle inequality gives
Observe that , since the extra appended to the end of both vectors makes no difference to the norm. And so the previous inequality may be rewritten
Now we can apply the results of last section twice on the right-hand side to obtain
More generally, suppose we add new pages to the web. Denote the final -dimensional PageRank vector by . Let denote the initial -dimensional PageRank vector, and let denote the -dimensional extension obtained by appending entries to . Then applying an argument similar to that above, we obtain:
Adding multiple pages, adding and removing multiple links
Suppose now that we add many new pages to the web, containing links to existing webpages. How does the PageRank change? Not surprisingly, answering this kind of question can be done in a straightforward way using the techniques described already in this post. To see this, suppose we add a single new webpage, which contains new links back to existing webpages. As before, we use to denote the new PageRank vector, to denote the old PageRank vector, and to denote the extension of obtained by appending a entry. We have
where is the matrix obtained by adding a row of s to the bottom of , and a column (except the bottom right entry) of values to the right of . We can write , where is the change in due to the added page, and is the change in due to the added links. Applying a similar analysis to before we obtain
We can bound the first term on the right-hand side by , by our earlier argument. The second term vanishes since is zero everywhere except in the last column, and ‘s final entry is zero. As a result, we have
i.e., the bound is exactly as if we had added a new page with no links.
- There is some ambiguity in the specification of and in the above analysis. Resolve this ambiguity by writing out explicit forms for and , and then verifying that the remainder of the proof of the bound goes through.
Suppose, instead, that the new link had been added from an existing page. Then the term above would no longer vanish, and the bound would become
where was the initial PageRank for the page to which the link was added, and was the initial number of links from that page.
Generalizing this analysis, it’s possible to write a single inequality that unites all our results to date. In particular, suppose we add new pages to the web. Denote the final -dimensional PageRank vector by . Let denote the initial -dimensional PageRank vector, and let denote the -dimensional extension obtained by appending entries to . Suppose we add outbound links to page , and remove outbound links from page . Then an analysis similar to that above shows that:
where, as before, the sum is over those pages which are non-dangling, and the sum is over dangling pages . Note that we can omit all the new pages from this latter sum, for the same reason the quantity vanished earlier in this section. This inequality generalizes all our earlier results.
- Fill in the details of the above proof. To do so it helps to consider the change in PageRank in two steps: (1) the change due to the addition of new links to existing pages; and (2) the change due to the addition of new webpages, and links on those webpages.
Recomputing PageRank on the fly
Part of my reason for being interested in the questions discussed in this post is that I’m interested in how to quickly update PageRank during a web crawl. The most obvious way to compute the PageRank of a set of crawled webpages is as a batch job, doing the computation maybe once a week or once a month. But if you’re operating a web crawler, and constantly adding new pages to an index, you don’t want to have to wait a week or a month before computing the PageRank (or similar measure) of a newly crawled page. You’d like to compute – or at least estimate – the PageRank immediately upon adding the page to the index, without having to do an enormous matrix computation. Is there a way this can be done?
Unfortunately, I don’t know how to quickly update the PageRank. However, the bounds in this post do help at least a little in figuring out how to update the PageRank as a batch job. Suppose we had an extant multi-billion page crawl, and knew the corresponding values for PageRank. Suppose then that we did a mini-crawl to update the index, perhaps with a few million new pages and links. Suppose was the (known value of) the PageRank vector before the mini-crawl, and is the PageRank after the mini-crawl, which we are now trying to compute. Suppose also that is the new PageRank matrix. One natural way of updating our estimate for PageRank would be to apply repeatedly to (more precisely, its extension, ), in the hopes that it will quickly converge to . The bounds in this post can help establish how quickly this convergence will happen. To see this, observe that
where we used the PageRank equation to get the first equality, and the contractivity of PageRank to get the second inequality. This equation tells us that we can compute an estimate for by applying repeatedly to , and it will converge exponentially quickly. This is all standard stuff about PageRank. However, the good news is that the results in this post tell us something about how to bound , and frequently this quantity will be very small to start with, and so the convergence will occur much more quickly than we would a priori expect.
Problems for the author
- Is there a better way of updating PageRank on the fly, rather than applying powers of ?