Brian Eno on conservatism and creativity

A great quote from Brian Eno about the conservative force that comes from previous success:

I’m afraid to say that admirers can be a tremendous force for conservatism, for consolidation. Of course it’s really wonderful to be acclaimed for things you’ve done – in fact it’s the only serious reward, because it makes you think “it worked! I’m not isolated!” or something like that, and it makes you feel gratefully connected to your own culture.

But on the other hand, there’s a tremendously strong pressure to repeat yourself, to do more of that thing we all liked so much. I can’t do that – I don’t have the enthusiasm to push through projects that seem familiar to me ( – this isn’t so much a question of artistic nobility or high ideals: I just get too bloody bored), but at the same time I do feel guilt for ‘deserting my audience’ by not doing the things they apparently wanted. I’d rather not feel this guilt, actually, so I avoid finding out about situations that could cause it.

The problem is that people nearly always prefer what I was doing a few years earlier – this has always been true. The other problem is that so, often, do I! Discovering things is clumsy and sporadic, and the results don’t at first compare well with the glossy and lauded works of the past. You have to keep reminding yourself that they went through that as well, otherwise they become frighteningly accomplished.

That’s another problem with being made to think about your own past – you forget its genesis and start to feel useless awe towards your earlier self: “How did I do it? Wherever did these ideas come from?”. Now, the workaday everyday now, always looks relatively less glamorous than the rose-tinted then (except for those magic hours when your finger is right on the pulse, and those times only happen when you’ve abandoned the lifeline of your own history).

Similar forces operate within science, although it’s not so much from admirers as peers. Institutions want their scientists to get grants; grant agencies want scientists with a “track record”, and the natural outcome is a lot of people doing stuff that’s only marginally different from what they’ve done before, with a concentration in fashionable areas.

Published
Categorized as Quotes

The economics of scientific collaboration

What economics can tell us about scientific collaboration

In this and several future posts I’m going to discuss what economics can tell us about scientific collaboration.

This may sound like a strange topic. Why should economics tell us anything interesting about scientific collaboration? Most discussions of economics are couched in terms of money, interest rates, prices, and so on. While these are relevant to science in a shallow, who’s-paying-for-this-lab-space kind of way, it’s not obvious we can learn anything deep about scientific collaboration by thinking in economic terms.

At a deeper level, though, economics is about understanding how human beings behave when one or more resources are scarce. How are those resources allocated? Are there more efficient ways they might be allocated? What tradeoffs are incurred?

There is a fundamental scarce resource in science, one whose allocation largely determines how science progresses. That scarce resource is expert attention. Who pays attention to what problems? How long do they spend on those problems? What institutional structures determine the answers to those questions? In short, what determines the architecture of scientific attention?

We can learn interesting things by thinking about these questions using ideas from economics. In this post I pull apart the way scientific collaboration works, and put it back together again within the conceptual framework economists use to understand free trade, using concepts like comparative advantage, opportunity cost, and markets. The reason I’m doing this is because the way we structure scientific attention is currently changing rapidly (by historical standards), as networked tools like wikis, blogs, twitter, email, online databases and friendfeed change the architecture of scientific attention. To understand these changes is in part an economic problem, and the point of this post is to begin developing an economic perspective.

Comparative advantage, opportunity cost, and the benefits of free trade

Scientific collaboration can be viewed as a type of trade in expert attention. I can, for example, trade some of my skill as a theoretical physicist for someone else’s skills as a computational physicist, enabling us to jointly write a paper neither of us could have written alone.

To understand this collaboration-as-trade perspective, let’s review some ideas about trade in the context where trade is most often discussed, namely, free trade of goods. We’ll start with a beautiful simplifed model of free trade, a model that goes back to a famous 1817 book “On the Principles of Political Economy and Taxation”, by the economist David Ricardo. Like many useful models, it leaves out a lot that’s relevant to the real world, but it does capture an essential element of the world, and we can learn a great deal by thinking about the model. In particular, the model demonstrates vividly why all parties involved in free trade can benefit, and is one of the main reasons most economists strongly support free trade.

(A small digression: there’s a family connection in this post, since David Ricardo was my Great-Great-Great-Great-Great-Uncle.)

Here’s the model. Imagine there are just two people in the world, Alice the farmer, and Bob the car technician. Alice is good at producing potatoes, but not much good at assembling cars, while Bob is good at assembling cars, but not so good at producing potatoes. Pretty obviously, both Alice and Bob can benefit if Alice concentrates on producing potatoes, Bob concentrates on assembling cars, and they then trade potatoes for cars. While this is intuitively clear, it’s worth making precise with more concrete details. Let’s suppose the effort required for Alice to assemble a car is equal to the effort she requires to produce 20 tonnes of potatoes. Put another way, each car she assembles has an opportunity cost of 20 tonnes of potatoes, since that’s how much assembling a car will cost her in lost production of potatoes. Similarly, suppose the effort for Bob to assemble a car is equal to the effort he requires to produce 2 tonnes of potatoes. That is, each car has an opportunity cost for Bob of just 2 tonnes of potatoes.

In this situation, Bob has a comparative advantage over Alice in the production of cars, because Bob’s opportunity cost for producing cars is less than Alice’s. Equivalently, Alice has a comparative advantage over Bob in the production of potatoes, for her opportunity cost to produce a tonne of potatoes is 1/20th of a car, which is less than Bob’s opportunity cost of half a car.

Suppose Alice and Bob each concentrate in the areas where they have a comparative advantage, i.e., Alice concentrates on producing potatoes, and Bob concentrates on building cars. They then trade potatoes for cars. Both Alice and Bob benefit if the rate at which they trade is greater than 2 and less than 20 tonnes of potatoes per car, because they both will end up with more cars and potatoes than either could have produced on their own. Furthermore, the greater the comparative advantage, the more both parties benefit. Put another way, the more people specialize, the more possible benefit there is in free trade.

It’s worth stressing that the key factor determing the benefits of trade is comparative advantage, not Alice and Bob’s absolute abilities. It might be, for example, that Bob is a layabout who’s lousy both at assembling cars and producing potatoes. Perhaps he’s only capable of assembling one car (or producing 2 tonnes of potatoes) for every ten days of labour. Alice, despite being a farmer, might actually be better than layabout-Bob at assembling cars, producing one car (or twenty tonnes of potatoes) for every 5 days of labour. Even though Alice has an absolute advantage in producing both cars and potatoes, she and Bob are both better off if they concentrate on the areas where they have a comparative advantage, and then trade. Although this example is contrived, it has many implications in the real world. For example, differences in education and infrastructure mean that people in different countries often have enormous differences in their absolute ability to produce goods. Despite this, people in both countries may still benefit from trade if they all concentrate on areas where they have a comparative advantage.

This is all pretty simple, but it’s not universally understood. Much anti-business rhetoric assumes a zero-sum world in which evil captains of industry exploit the defenseless poor, i.e., if one person benefits from a transaction, the other person must lose. Very often, that’s a bad assumption. Good businesspeople look for transactions where both parties benefit; wouldn’t you prefer doing business with enthusiastic trading partners, rather than people who feel coerced or exploited? Of course, sometimes unethical businesspeople do coerce their trading partners, and sometimes trade between two parties can damage a third – environmental issues like pollution often have this nature. But Ricardo’s model is a good starting point to understand how free trade can work to the benefit of all parties.

Markets as a mechanism for aggregating information about comparative advantage

One question not answered in Ricardo’s model is how the trading rate is set. At what rate between 2 and 20 tonnes of potatoes per car should Alice and Bob trade? There are many possible ways to set the rate. In our society, the standard way is to use money as a medium of exchange, with markets determining the price of the relevant goods.

Let’s suppose Alice and Bob participate in such a market, and that the market price is 10,000 dollars per car, and 1,000 dollars per tonne of potatoes. The market thus provides a mechanism by which Alice and Bob can effectively trade cars for potatoes at a rate of one car for ten tonnes of potatoes. This is within the range where it is beneficial for both of them to trade, and so both may enter the market.

What if, instead, the market price was 5,000 dollars for a car, and 5,000 dollars for a tonne of potatoes? Then the effective trading rate is one car for one tonne of potatoes. Bob will be worse off if he enters the market: he’s better off both making cars and growing potatoes. The result is that Bob will withdraw from the car market, reducing the supply of cars. This will drive the market price of cars up a little, but this probably won’t be enough to change the price enough for Bob to re-enter the market. But if enough people withdraw, then the price of cars will go up a lot, and it will make sense for Bob to re-enter.

Money and markets thus serve several purposes. First, the market determines the price of different goods, and thus effectively sets exchange rates between different goods.

Second, the market price automatically aggregates information about comparative advantage, because the people who enter the market are those with a comparative advantage large enough that they can benefit from being in the market. People with a smaller comparative advantage have no reason to do so.

Third, while it’s possible to set up a barter market without the use of money, it’s obviously a great deal more efficient to use money as an intermediary, since for each type of good in the market, we need only keep track of a single price, rather than exchange rates with all the other types of good.

In fact, digressing briefly, it’s possible to prove that in an efficient barter market, an effective currency does emerge. By efficient, I mean that it’s not possible to increase your holdings by conducting a series of trades in immediate succession, e.g., by trading one ox for two cows, the two cows for one horse, and then the horse for two oxen. If this kind of trade is impossible, then it’s possible to just fix on one type of good – say, cows – as the effective unit of commerce, like the dollar, and peg all trades to that unit. From there it’s a small step to forgo the cows, introducing an abstract entity (i.e., money) to replace them. Furthermore, it’s reasonable to argue that you’d expect efficiency in this kind of market; if the market was inefficient in the way described above, then you’d expect one of the intermediaries in the transaction to realize it, and raise their prices, and so smooth away the inefficiency.

It’s remarkable how effective the market is at aggregating information about comparative advantage in this way. It lets us all take advantage of the combined efforts of millions of individuals, most doing tasks for which they have a considerable comparative advantage. Think about the number of people involved in producing a laptop computer. Tens or hundreds of thousands of people participated directly in designing and producing the components in that laptop; most of those people had considerable (in some cases, enormous) comparative advantage in the skills they contributed. When you buy a laptop, your few hundred dollars buys you the accumulated wisdom from a design history of billions of hours, stretching all the way back to the earliest computers. Beyond that, hundreds of millions of people contribute capital (e.g., via retirement funds) used to build infrastructure like chip foundries. Chances are that anywhere between a few dollars and a few hundred dollars from your retirement fund was invested in the chip foundry that produced the processor for the computer that I’m typing these words on. We’re both benefiting right now from differences in comparative advantage.

By providing a way of identifying and taking advantage of comparative advantage, markets encourage people to specialize, creating even greater disparaties in comparative advantage, and thus producing more mutual benefit. The better the market operates, the stronger this feedback effect becomes. Although it’s currently fashionable to bash markets (and economists), in fact many technologies we take for granted – cars, airliners, computers, telecommunications – would be near impossible without the modern market infrastructure.

Comparative advantage and scientific collaboration

Let’s construct a simple model of scientific collaboration inspired by Ricardo’s model of free trade. The model is, of course, a great oversimplification of how collaboration works; the point isn’t to capture the reality of collaboration exactly, but rather to illuminate some elements.

We’ll imagine two people, Alice and Bob, a physicist and a chemist, respectively. Alice is working on a problem in physics, but as she works an unanticipated problem arises, let’s say in chemistry. Let’s suppose for the sake of argument that the problem requires 100 hours of straightforward physics to solve, and 10 hours of straightforward chemistry. (The real numbers in most significant scientific problems are probably larger, but these numbers make the discussion below a little easier to read.) Unfortunately, Alice isn’t much of a chemist, and it would take her 200 hours to do the chemistry part of the problem, mostly spent bumbling around learning the required material. Alternately, if Bob got involved in the project, he could solve the chemistry problem in just ten hours.

There are two scenarios here. In the first, Alice does all the work, it takes 300 hours, and Alice gets all the credit for the paper published as a result. In the second, Alice does 100 hours of work, Bob does 10 hours of work, and they split the credit. Let’s say Alice ends up as first author on a paper describing the work, and Bob ends up as second author, and let’s further say that Alice gets two thirds of the credit as a result, and Bob gets one third of the credit.

Per hour worked, Alice is much better off in the collaborative scenario, getting two thirds of the reward for only one third of the effort. Bob is probably also better off, although the reason is more subtle: if Bob entered the collaboration freely, then it was presumably because Bob felt this was the best use of his time. This is not always the case – if Bob works for Alice he may have to do the work (or find another job), even though he’d do better science if he concentrated on other projects. This is a case where the trade is not completely free, but rather there is coercion. We’ll assume, though, that no coercion is involved, and that both parties benefit.

Let’s fill the model out a little more. Imagine that Bob’s alternative to collaboration is to go off and write straight-up chemistry papers, on his own, taking 110 hours to write each paper, and getting full credit for the paper. He is still better off working with Alice, for he gets one third of the credit for only 10 hours worth of work. Both Alice and Bob benefit, just as in Ricardo’s model.

Another similarity to Ricardo’s model is that it is comparative, not absolute, advantage which is critical. Let’s imagine Bob is actually a beginning chemistry student, and takes 100 hours to complete the work Alice needs done. He’s still better off working with Alice than working on his own, for on his own it would take 1100 hours to write a chemistry paper. Furthermore, Alice is still better off working with Bob than on her own, for the time she saves on doing chemistry is time she can put to work doing physics.

As an illustration of these ideas in a different context, consider the way many supervisors work with students and postodcs. The supervisors suggest problems, reading materials, likely angles of attack, and so on – all areas in which their experience gives them an enormous absolute advantage, and a sizable comparative advantage. The students do the detailed work in the lab. Many supervisors will have an absolute advantage in such lab work, but it is likely to be much smaller, and so the student likely has a comparative advantage in doing such work. Any time the supervisor spends doing such detailed lab work has an opportunity cost in lost time to be suggesting problems, reading materials and the like for another student.

An important difference between this model and Ricardo’s lies in the way we define the benefit to the parties involved. In the case of Ricardo’s model, the benefit is entirely intrinsic: Alice and Bob both want cars and potatoes. In the scientific case, there’s no intrinsic desire the parties have for “expert attention”. Rather, the benefit lies in the reputational credit derived from publications. This difference complicates the analysis of when it is worth it to collaborate. Instead of a simple trading rate, one must consider the way in which reputational credit is split. It is the ratio of this split to the opportunity cost that determines when it makes sense to collaborate. If Alice got 95 percent of the credit, and Bob only 5 percent of the credit, obviously it would not be in Bob’s interest to collaborate. In a future post, I’ll address this more fully, as well as many other aspects of this model.

For now, let me simply point out the relative lack of mechanisms science has for aggregating information about comparative advantage. Mostly, we do it by word of mouth and personal connection, the same way our ancestors traded goods, and so we don’t get the advantages that come from modern markets.

There are good reasons it’s difficult to set up efficient collaboration markets in expert attention. Creative problems are often highly specialized one-off problems, quite unlike the commodites traded in most markets. Until very recently, markets in such specialized goods were relatively uncommon and rather limited even in the realm of physical goods. This has recently changed, with online markets such as eBay showing that it is possible to set up markets which are highly specialized, provided suitable search and reputational tools are in place.

To the extent such collaboration markets do exist in science, they still operate very inefficiently compared with markets for trade in goods. There are considerable trust barriers that inhibit trading relationship being set up. There is no medium of exchange (c.f. the posts by Shirley Wu and Cameron Neylon’s on this topic). The end result is that mechanisms for identifying and aggregating comparative advantage are downright primitive compared with markets for physical goods.

Perhaps the best existing examples of collaboration markets occur in the open source programming community. No single model is used throughout that community, but for many open source projects the basic model is to set up one or more online fora (email discussion lists, wikis, bug-tracking software, etcetera) which is used to co-ordinate activity. The fora are used to advertise problems people are having, such as bugs they’d like fixed, or features they’d like added. People then volunteer to solve those problems, with self-selection ensuring that work is most often done by people with a considerable comparative advantage. The forum thus acts as a simple mechanism for aggregating information about comparative advantage. While this mechanism is primitive compared with modern markets, the success of open source is impressive, and the mechanisms for aggregating information about comparative advantage in expert attention will no doubt improve.

Let me conclude with a question that’s still puzzling me. As I mentioned before, markets have creative power: without them, it’s unlikely that sophisticated goods like laptops and aeroplanes could exist. I’d like to better understand whether more efficient collaboration markets can cause a similar shift in what scientific problems can be solved. Might scientific problems now regarded as out of reach become accessible with more effective ways of structuring scientific attention?

Using your laptop to compute PageRank for millions of webpages

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web as a whole, simply because the web contains so many webpages. While just a few lines of code can be used to implement PageRank on collections of a few thousand webpages, it’s trickier to compute PageRank for larger sets of pages.

The underlying problem is that the most direct way to compute the PageRank of [tex]n[/tex] webpages involves inverting an [tex]n \times n[/tex] matrix. Even when [tex]n[/tex] is just a few thousand, this means inverting a matrix containing millions or tens of millions of floating point numbers. This is possible on a typical personal computer, but it’s hard to go much further.

In this post, I describe how to compute PageRank for collections containing millions of webpages. My little laptop easily coped with two million pages, using about 650 megabytes of RAM and a few hours of computation – at a guess it would max out at about 4 to 5 million pages. More powerful machines could cope with upwards of 10 million pages. And that’s with a program written in Python, an interpreted language not noted for its efficiency! Of course, this number still falls a long way short of the size of the entire web, but it’s a lot better than the naive approach. In a future post I’ll describe a more scalable parallel implementation that can be used to compute PageRank for hundreds of millions of pages.

Let’s quickly run through the basic facts we need to know about PageRank. The idea is that we number webpages [tex]1,\ldots,n[/tex]. For webpage number [tex]j[/tex] there is an associated PageRank [tex]q_j[/tex] which measures the importance of page [tex]j[/tex]. The PageRank is a number between [tex]0[/tex] and [tex]1[/tex]; the bigger the PageRank, the more important the page. Furthermore, the vector [tex]q = (q_1,\ldots,q_n)[/tex] of all the PageRanks is a probability distribution, that is, the PageRanks sum to one.

How is the PageRank vector [tex]q[/tex] computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector [tex]q[/tex] can be defined by the equation (explanation below):

[tex] q \leftarrow M^j P. [/tex]

What this equation represents is a starting distribution [tex]P[/tex] for the crazy websurfer, and then [tex]j[/tex] steps of “surfing”, with each action of [tex]M[/tex] representing how the distribution changes in a single step. [tex]P[/tex] is an [tex]n[/tex]-dimensional vector, and [tex]M[/tex] is an [tex]n \times n[/tex] matrix whose entries reflect the link structure of the web in a way I’ll describe below. The PageRank [tex]q[/tex] is defined in the limit of large [tex]j[/tex] – in our examples, convergence occurs for [tex]j[/tex] in the range [tex]20[/tex] to [tex]30[/tex]. You might wonder how [tex]P[/tex] is chosen, but part of the magic of PageRank is that it doesn’t matter how [tex]P[/tex] is chosen, provided it’s a probability distribution. The intuition here is that the starting distribution for the websurfer doesn’t matter to their long-run behaviour. We’ll therefore choose the uniform probability distribution, [tex]P = (1/n,1/n,\ldots)[/tex].

How is the matrix [tex]M[/tex] defined? It can be broken up into three pieces

[tex] M = s A + s D + t E, [/tex]

which correspond to, respectively, a contribution [tex]sA[/tex] representing the crazy websurfer randomly picking links to follow, a contribution [tex]sD[/tex] due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case, and finally a contribution [tex]tE[/tex] representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set [tex]s = 0.85[/tex] and [tex]t = 1-s = 0.15[/tex] as the respective probabilities for randomly selecting a link and teleporting. See here for discussion of the reasons for this choice.

The matrix [tex]A[/tex] describes the link structure of the web. In particular, suppose we define [tex]\#(j)[/tex] to be the number of links outbound from page [tex]j[/tex]. Then [tex]A_{kj}[/tex] is defined to be [tex]0[/tex] if page [tex]j[/tex] does not link to [tex]k[/tex], and [tex]1/\#(j)[/tex] if page [tex]j[/tex] does link to page [tex]k[/tex]. Stated another way, the entries of the [tex]j[/tex]th column of [tex]A[/tex] are zero, except at locations corresponding to outgoing links, where they are [tex]1/\#(j)[/tex]. The intuition is that [tex]A[/tex] describes the action of a websurfer at page [tex]j[/tex] randomly choosing an outgoing link.

The matrix [tex]D[/tex] is included to deal with the problem of dangling pages, i.e., pages with no outgoing links, where it is ambigous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if [tex]j[/tex] is a dangling page, then the [tex]j[/tex]th column of [tex]D[/tex] should have all its entries [tex]1/n[/tex], otherwise all the entries should be zero. A compact way of writing this is

[tex] D = e d^T/ n, [/tex]

where [tex]d[/tex] is the vector of dangling pages, i.e., the [tex]j[/tex]th entry of [tex]d[/tex] is [tex]1[/tex] if page [tex]j[/tex] is dangling, and otherwise is zero. [tex]e[/tex] is the vector whose entries are all [tex]1[/tex]s.

The final piece of [tex]M[/tex] is the matrix [tex]E[/tex], describing the bored websurfer teleporting somewhere else at random. This matrix has entries [tex]1/n[/tex] everywhere, representing a uniform probability of going to another webpage.

Exercises

  • This exercise is for people who’ve read my earlier post introducing PageRank. In that post you’ll see that I’ve broken up [tex]M[/tex] in a slightly different way. You should verify that the breakup in the current post and the breakup in the old post result in the same matrix [tex]M[/tex].

Now, as we argued above, you quickly run into problems in computing the PageRank vector if you naively try to store all [tex]n \times n = n^2[/tex] elements in the matrix [tex]M[/tex] or [tex]M^j[/tex]. Fortunately, it’s possible to avoid doing this. The reason is that the matrices [tex]A, D[/tex] and [tex]E[/tex] all have a special structure. [tex]A[/tex] is a very sparse matrix, since most webpages only link to a tiny fraction of the web as a whole, and so most entries are zero. [tex]D[/tex] is not ordinarily sparse, but it does have a very simple structure, determined by the vector (not matrix!) of dangling pages, [tex]d[/tex], and this makes it easy to do vector multiplications involving [tex]D[/tex]. Finally, the matrix [tex]E[/tex] is even more simple, and it’s straightforward to compute the action of [tex]E[/tex] on a vector.

The idea used in our algorithm is therefore to compute [tex]M^jP[/tex] by repeatedly multiplying the vector [tex]P[/tex] by [tex]M[/tex], and with each multiplication using the special form [tex]M = sA+sD+tE[/tex] to avoid holding anything more than [tex]n[/tex]-dimensional vectors in memory. This is the critical step that lets us compute PageRank for a web containing millions of webpages.

The crux of the algorithm, therefore, is to determine [tex](Mv)_j[/tex] for each index [tex]j[/tex], where [tex]v[/tex] is an arbitrary vector. We do this by computing [tex](Av)_j, (Dv)_j[/tex] and [tex](Ev)_j[/tex] separately, and then summing the results, with the appropriate prefactors, [tex]s[/tex] or [tex]t[/tex].

To compute [tex](Av)_j[/tex], just observe that [tex](Av)_j = \sum_k A_{jk} v_k[/tex]. This can be computed by summing over all the pages [tex]k[/tex] which have links to page [tex]j[/tex], with [tex]A_{jk} = 1/\#(k)[/tex] in each case. We see that the total number of operations involved in computing all the entries in [tex]Av[/tex] is [tex]\Theta(l)[/tex], where [tex]l[/tex] is the total number of links in the web graph, a number that scales far more slowly than [tex]n^2[/tex]. Furthermore, the memory consumed is [tex]\Theta(n)[/tex].

I’m being imprecise in analysing the number of operations and the memory consumption, not worrying about small constant factors. If we were really trying to optimize our algorithm, we’d be careful to keep track of exactly what resources are consumed, rather than ignoring those details. However, as we’ll see below, the advantages of the approach we’re now considering over the approach based on matrix inversion are enormous, and so ignoring a few small factors is okay for the purposes of comparison.

To compute [tex]Dv[/tex], observe that [tex]Dv = e (d \cdot v) / n[/tex], and so it suffices to compute the inner product [tex]d \cdot v[/tex], which in the worst case requires [tex]\Theta(n)[/tex] operations, and [tex]\Theta(n)[/tex] memory, to store [tex]d[/tex].

Finally, to compute [tex](Ev)_j[/tex] is easy, since it is just [tex](e \cdot v)/n[/tex], which requires virtually no additional computation or memory.

It follows that we can compute [tex]Mv[/tex] using a total number of operations [tex]\Theta(n+l)[/tex], and total memory [tex]\Theta(n)[/tex], and thus approximate the PageRank vector

[tex] q \approx M^j P [/tex]

with a total number of operations [tex]\Theta(j (n+l))[/tex], and with memory consumption [tex]\Theta(n)[/tex].

These are imposingly large numbers. Suppose we have [tex]n = 10^6[/tex] webpages, [tex]l = 10^7[/tex] links, and use [tex]j = 10^2[/tex] iterations. Then the total amount of memory involved is on the order of [tex]10^6[/tex] floating point numbers, and the total number of operations is on the order of [tex]10^9[/tex].

On the other hand, if we’d used the method that required us to store the full matrix [tex]M[/tex], we’d have needed to store on the order of [tex]10^{12}[/tex] floating pointing numbers. Furthermore using Gaussian elimination the matrix inversion requires on the order of [tex]10^{18}[/tex] operations, give or take a few (but only a few!) orders of magnitude. More efficient methods for matrix inverstion are known, but the numbers remain daunting.

The code

Let’s think about how we’d code this algorithm up in Python. If you’re not already familiar with Python, then a great way to learn is “Dive into Python”, an excellent free online introductory book about Python by Mark Pilgrim. You can read the first five chapters in just a few hours, and that will provide you with enough basic understanding of Python to follow most code, and to do a lot of coding yourself. Certainly, it’s more than enough for all the examples in this post series.

The most important part of our implementation of PageRank is to choose our representation of the web in a compact way, and so that the information we need to compute [tex]M^jP[/tex] can be retrieved quickly. This is achieved by the following definition for a class web, whose instances represent possible link structures for the web:

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

Suppose g is an object of type web, created using, e.g., g = web(3), to create a web of 3 pages. g.size = 3 for this web. g.in_links is a dictionary whose keys are numbers for the webpages, 0,1,2, and with corresponding values being lists of the inbound webpages. These lists are initially set to contain no webpages:

g.in_links = {0 : [], 1 : [], 2 : []}

Note that in Python lists are indexed [tex]0,1,2,\ldots[/tex], and this makes it convenient to number pages [tex]0,\ldots,n-1[/tex], rather than [tex]1,\ldots,n[/tex], as we have been doing up to now.

The advantages of storing the in-links as a dictionary of lists are two-fold. First, by using this representation, we store no more information than we absolutely need. E.g., we are not storing information about the absence of links, as we would be if we stored the entire adjacency matrix. Second, Python dictionaries are a convenient way to store the information, because they don’t require any ordering information, and so are in many ways faster to manipulate than lists.

You might be wondering why there is no attribute g.out_links in our class definition. The glib answer is that we don’t need it – after all, it can, in principle, be reconstructed from g.in_links. A better answer is that in real applications it would quite sensible to also have g.out_links, but storing that extra dictionary would consume a lot of extra memory, and we avoid it since we can compute PageRank without it. An even better answer would be to make use of a data structure that gives us the speed and convenience provided by both g.in_links and g.out_links, but only requires links to be stored once. This can certainly be done, but requires more work, and so I’ve avoided it.

Let’s finish our description of the attributes of g. g.number_out_links is a dictionary such that g.number_out_links[j] is the number of out-links from page j. Initially these values are all set to zero, but they will be incremented when links are added. We need to store the number of outgoing links in order to compute the probabilities [tex]1/\#(k)[/tex] used in the PageRank algorithm, as described earlier. While this could in principle be computed from the dictionary g.in_links, in practice that would be extremely slow.

Finally, g.dangling_pages is a dictionary whose key set is the set of dangling webpages. The values of the keys are placeholders which are never used, and so we use True as a placeholder value. Initially, all webpages are dangling, but we remove keys when links are added.

With these data structures in place, the rest of the program is a pretty straightforward implementation of the algorithm described earlier. Here’s the code:

# Generates a random web link structure, and finds the
# corresponding PageRank vector.  The number of inbound
# links for each page is controlled by a power law
# distribution.
#
# This code should work for up to a few million pages on a
# modest machine.
#
# Written and tested using Python 2.5.

import numpy
import random

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

def paretosample(n,power=2.0):
  '''Returns a sample from a truncated Pareto distribution
  with probability mass function p(l) proportional to
  1/l^power.  The distribution is truncated at l = n.'''
  m = n+1
  while m > n: m = numpy.random.zipf(power)
  return m

def random_web(n=1000,power=2.0):
  '''Returns a web object with n pages, and where each
  page k is linked to by L_k random other pages.  The L_k
  are independent and identically distributed random
  variables with a shifted and truncated Pareto
  probability mass function p(l) proportional to
  1/(l+1)^power.'''
  g = web(n)
  for k in xrange(n):
    lk = paretosample(n+1,power)-1
    values = random.sample(xrange(n),lk)
    g.in_links[k] = values
    for j in values: 
      if g.number_out_links[j] == 0: g.dangling_pages.pop(j)
      g.number_out_links[j] += 1
  return g

def step(g,p,s=0.85):
  '''Performs a single step in the PageRank computation,
  with web g and parameter s.  Applies the corresponding M
  matrix to the vector p, and returns the resulting
  vector.'''
  n = g.size
  v = numpy.matrix(numpy.zeros((n,1)))
  inner_product = sum([p[j] for j in g.dangling_pages.keys()])
  for j in xrange(n):
    v[j] = s*sum([p[k]/g.number_out_links[k] 
    for k in g.in_links[j]])+s*inner_product/n+(1-s)/n
  # We rescale the return vector, so it remains a
  # probability distribution even with floating point
  # roundoff.
  return v/numpy.sum(v)  

def pagerank(g,s=0.85,tolerance=0.00001):
  '''Returns the PageRank vector for the web g and
  parameter s, where the criterion for convergence is that
  we stop when M^(j+1)P-M^jP has length less than
  tolerance, in l1 norm.'''
  n = g.size
  p = numpy.matrix(numpy.ones((n,1)))/n
  iteration = 1
  change = 2
  while change > tolerance:
    print "Iteration: %s" % iteration
    new_p = step(g,p,s)
    change = numpy.sum(numpy.abs(p-new_p))
    print "Change in l1 norm: %s" % change
    p = new_p
    iteration += 1
  return p

print '''Now computing the PageRank vector for a random
web containing 10000 pages, with the number of inbound
links to each page controlled by a Pareto power law
distribution with parameter 2.0, and with the criterion
for convergence being a change of less than 0.0001 in l1
norm over a matrix multiplication step.'''

g = random_web(10000,2.0) # works up to several million
                          # pages.
pr = pagerank(g,0.85,0.0001)

As written, the code computes PageRank for a 10,000 page web, and I recommend you start with that number of pages, to check that the code runs. Once that’s done, you can change the parameter in the second last line to change the number of pages. If you do that, you may wish to change the tolerance 0.0001 in the final line, decreasing it to ensure a tolerance more appropriate to the number of pages, say [tex]10^{-7}[/tex].

What determines the tolerance that should be used? I have used the [tex]l_1[/tex] norm to quantify the rate of convergence in the code, but that is a rather arbitrary measure. In particular, an alternate measure might be to look at the magnitude of the differences between individual elements, computing [tex]\max_j |(Mv-v)_j|[/tex], and requiring that it be substantially smaller than the minimal possible value for PageRank, [tex]t/n[/tex]. This is less conservative, and arguably a more relevant measure; the code is easily changed.

Exercises

  • The code above is complex enough that bugs are likely. As a test, use an alternate method to compute PageRank for small web sizes, and compare your results against the output from the code above. Do you believe the code is correct?
  • Modify the program described in this lecture so that it works with a non-uniform teleportation vector, as described in an earlier post.
  • Python is a great language, but it’s not designed with speed in mind. Write a program in C to compute the PageRank for a random web. Can you compute the PageRank for ten million pages? What’s the largest number you’re able to compute on a single machine?

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog here.

Published
Categorized as GTS

Biweekly links for 12/22/2008

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 12/19/2008

  • Do Tank and the Democracy Design Workshop
    • “The goal of the Virtual Company Project is to build online tools to help groups create and implement governance rules necessary for successful collaboration. The project is premised on the belief that the right graphical interfaces can translate the structures of the group into clear and intelligible procedures that will enable teams to make decisions, control assets and enter into contractual relationships with third parties. The Virtual Company project is creating the interfaces and designing the back-end functionality that is needed to enable group participants to see themselves in relation to the group as a whole, allocate roles, establish accountability to the group, make collective decisions, and administer group assets, expenditures and distributions. “
  • OnTheCommons.org » Online Collaboration Goes Legit
    • Good discussion of the recently passed Vermont law which allows online-only organizations to file for status as corporations.
  • Eddie A. Tejeda: Google might be doing more than making us stupid
    • Nice framing of the question, even if he ultimately goes off on a tangent: “While Socrates was right that writing made us lose our ability to memorize, he was unable to foresee (or maybe foresaw, but still did not approve) the rise of engaged and literate societies, which have been responsible for many of the great creations of humanity.

      So what type of world are we unable to yet see by high level and data extracting reading that the Internet so easily provides? Well, it’s hard to say, but there is another angle we can approach this to give us an idea.”

  • Benjamin Zander
    • Remarkable. A music lesson from the conductor of the Boston Philharmonic, live on stage.
  • Native Client: Google’s craziest idea yet |Fatal Exception | Neil McAllister | InfoWorld
    • Google’s Native Client uses sandboxing / virtualization to allow desktop(ish) software to run inside the browser. They apparently have a demo with Quake running, which is pretty impressive. Try doing that with Javascript!
  • Paul Allen: Piece of mind | The Economist
    • “The scientists used state-of-the-art technology to dissect a mouse brain, photographed it sliver section by section, then reassembled it in a computer database that would allow easy access. But it was the speed at which the project was accomplished and what they did with it afterwards that changed the game.

      They released it to the public. Over the internet. Free.

      When we first put the mouse-brain atlas online free, it was met by the research world with suspicion. People wondered what the catch was. Scientific research has long been a solitary endeavour—one researcher, one microscope. Findings are protected so that discovery credit can be clearly defined and awarded…

      … Today we have many scientists using the atlas for their research into Alzheimer’s, bipolar disorders, Down’s syndrome, Parkinson’s, fragile x mental retardation and epilepsy. The atlas is also giving scientists insight into alcoholism, obesity, sleep, hearing and memory.”

  • Synthese Recommender Service
    • Recommendation engine for scientific journal articles. The areas covered are too far outside my area of scientific expertise to know how good the recommendations are.
  • Publish in Wikipedia or perish : Nature News
    • “Wikipedia, meet RNA. Anyone submitting to a section of the journal RNA Biology will, in the future, be required to also submit a Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.”
  • Caveat Lector » Proto-librarians and computers
  • OpenNet Initiative
    • Tracks which governments are currently blocking what.

Click here for all of my del.icio.us bookmarks.

Published

Building intuition about PageRank (video)

This is a video accompaniment to my post on building intuition about PageRank:




The video is a re-recording. The first recording, which had a live audience, failed, and so I decided to re-record. I was surprised at how difficult it is to speak to an empty room – I prefer a good discussion, with people asking clarifying questions and so on. In any case, the video for the next post should be with an audience.

The post is part of a series about the Google Technology Stack. The posts are designed to be self-contained, even as they build toward the overall goal of understanding may of the fundamental technologies used at Google, and so the video will hopefully make sense without watching the video for the earlier post. There’s also a FriendFeed room for discussion here.

Published
Categorized as GTS, Video

Biweekly links for 12/15/2008

  • Mark Newman:The structure of scientific collaboration networks
    • “We investigate the structure of scientific collaboration networks. We consider two scientists to be connected if they have authored a paper together, and construct explicit networks of such connections using data drawn from a number of databases, including MEDLINE (biomedical research), the Los Alamos e-Print Archive (physics), and NCSTRL (computer science). We show that these collaboration networks form “small worlds” in which randomly chosen pairs of scientists are typically separated by only a short path of intermediate acquaintances. We further give results for mean and distribution of numbers of collaborators of authors, demonstrate the presence of clustering in the networks, and highlight a number of apparent differences in the patterns of collaboration between the fields studied. “

Click here for all of my del.icio.us bookmarks.

Published

The PageRank distribution for the web

The PageRank for a webpage is a probability between 0 and 1. The general idea is that PageRank quantifies the importance of the page: the bigger the probability the more important the page. I was curious about what the PageRank distribution is across the web as a whole, and so I built a very simple model to investigate the question.

Here’s the three assumptions that went into the model.

1. I use just 200 webpages. The qualitative results don’t seem to vary much as the number of webpages increases further, and the results get both harder to interpret and harder to generate, so 200 seems like a good size.

2. I assume that the number [tex]L[/tex] of inbound links to any given webpage is governed by a Pareto probability distribution [tex]p(L) \propto 1/L^a[/tex], with [tex]a = 2[/tex]. This assumption is based on a paper by Adamic and Huberman. Note that the data in that paper is a decade old, and more recent data should really be used. (An independently interesting question is how that exponent is changing over time, and why it changes.)

3. I assume that the number of inbound links to each webpage is an independent random variable.

With these assumptions, the histogram of PageRanks for a typical random web looks like this:

Aggregating over multiple runs gives:

There are a few notable things about these histograms.

First, most pages have PageRank near the minimal possible value of [tex]0.15 * 1/200 = 0.00075[/tex] (see here for an explanation of why that’s the minimal possible value.

Second, the distribution of PageRanks drops off very fast. Because there are 200 webpages, and the PageRanks must sum to one (being probabilities), the “average” PageRank must be [tex]0.005[/tex]. You can see from the histograms that the distribution has already dropped off quite a bit by the time you get to this PageRank: most pages have a PageRank quite a bit below the average. It’s a few very high PageRank pages that restore the average.

Third, the page with the highest PageRank had a PageRank approximately 20 times higher than average PageRank.

I haven’t done the analysis, but it looks pretty likely that the distribution of PageRanks for this model is itself approximated by a power law distribution. Curious.

This post is part of an ongoing series about the Google Technology Stack, covering technologies such as PageRank, MapReduce, the Google File System, and Bigtable. Posts appear on this blog once a week; there is an associated FriendFeed room for discussion.

Published
Categorized as GTS