Biweekly links for 01/09/2009

Click here for all of my del.icio.us bookmarks.

Published

Three myths about scientific peer review

What’s the future of scientific peer review? The way science is communicated is currently changing rapidly, leading to speculation that the peer review system itself might change. For example, the wildly successful physics preprint arXiv is only very lightly moderated, which has led many people to wonder if the peer review process might perhaps die out, or otherwise change beyond recognition.

I’m currently finishing up a post on the future of peer review, which I’ll post in the near future. Before I get to that, though, I want to debunk three widely-believed myths about peer review, myths which can derail sensible discussion of the future of peer review.

A brief terminological note before I get to the myths: the term “peer review” can mean many different things in science. In this post, I restrict my focus to the anonymous peer review system scientific journals use to decide whether to accept or reject scientific papers.

Myth number 1: Scientists have always used peer review

The myth that scientists adopted peer review broadly and early in the history of science is surprisingly widely believed, despite being false. It’s true that peer review has been used for a long time – a process recognizably similar to the modern system was in use as early as 1731, in the Royal Society of Edinburgh’s Medical Essays and Observations (ref). But in most scientific journals, peer review wasn’t routine until the middle of the twentieth century, a fact documented in historical papers by Burnham, Kronick, and Spier.

Let me give a few examples to illustrate the point.

As a first example, we’ll start with the career of Albert Einstein, who wasn’t just an outstanding scientist, but was also a prolific scientist, publishing more than 300 journal articles between 1901 and 1955. Many of Einstein’s most ground-breaking papers appeared in his “miracle year” of 1905, when he introduced new ways of understanding space, time, energy, momentum, light, and the structure of matter. Not bad for someone unable to secure an academic position, and working as a patent clerk in the Swiss patent office.

How many of Einstein’s 300 plus papers were peer reviewed? According to the physicist and historian of science Daniel Kennefick, it may well be that only a single paper of Einstein’s was ever subject to peer review. That was a paper about gravitational waves, jointly authored with Nathan Rosen, and submitted to the journal Physical Review in 1936. The Physical Review had at that time recently introduced a peer review system. It wasn’t always used, but when the editor wanted a second opinion on a submission, he would send it out for review. The Einstein-Rosen paper was sent out for review, and came back with a (correct, as it turned out) negative report. Einstein’s indignant reply to the editor is amusing to modern scientific sensibilities, and suggests someone quite unfamiliar with peer review:

Dear Sir,

We (Mr. Rosen and I) had sent you our manuscript for publication and had not authorized you to show it to specialists before it is printed. I see no reason to address the in any case erroneous comments of your anonymous expert. On the basis of this incident I prefer to publish the paper elsewhere.

Respectfully,

P.S. Mr. Rosen, who has left for the Soviet Union, has authorized me to represent him in this matter.

As a second example, consider the use of peer review at the journal Nature. The prestige associated with publishing in Nature is, of course, considerable, and so competition to get published there is tough. According to Nature’s website, only 8 percent of submissions are accepted, and the rest are rejected. Given this, you might suppose that Nature has routinely used peer review for a long time as a way of filtering submissions. In fact, a formal peer review system wasn’t introduced by Nature until 1967. Prior to that, some papers were refereed, but some weren’t, and many of Nature’s most famous papers were not refereed. Instead, it was up to editorial judgement to determine which papers should be published, and which papers should be rejected.

This was a common practice in the days before peer review became widespread: decisions about what to publish and what to reject were usually made by journal editors, often acting largely on their own. These decisions were often made rapidly, with papers appearing days or weeks after submission, after a cursory review by the editor. Rejection rates at most journals were low, with only obviously inappropriate or unsound material being rejected; indeed, for some Society journals, Society members even asserted a “right” to publication, which occasionally caused friction with unhappy editors (ref).

What caused the change to the modern system of near-ubiquitous peer review? There were three main factors. The first was the increasing specialization of science (ref). As science became more specialized in the early 20th century, editors gradually found it harder to make informed decisions about what was worth publishing, even by the relatively relaxed standards common in many journals at the time.

The second factor in the move to peer review was the enormous increase in the number of scientific papers being published (ref). In the 1800s and early 1900s, journals often had too few submissions. Journal editors would actively round up submissions to make sure their journals remained active. The role of many editorial boards was to make sure enough papers were being submitted; if the journal came up short, members of the editorial board would be asked to submit papers themselves. As late as 1938, the editor-in-chief of the prestigious journal Science relied on personal solicitations for most articles (ref).

The twentieth century saw a massive increase in the number of scientists, a much easier process for writing papers, due to technologies such as typewriters, photocopiers, and computers, and a gradually increasing emphasis on publication in decisions about jobs, tenure, grants and prizes. These factors greatly increased the number of papers being written, and added pressure for filtering mechanisms, such as peer review.

The third factor in the move to peer review (ref) was the introduction of technologies for copying papers. It’s just plain editorially difficult to implement peer review if you can’t easily make copies of papers. The first step along this road was the introduction of typewriters and carbon paper in the 1890s, followed by the commercial introduction of photocopiers in 1959. Both technologies made peer review much easier to implement.

Nowadays, of course, the single biggest factor preserving the peer review system is probably social inertia: in most fields of science, a journal that’s not peer-reviewed isn’t regarded as serious, and so new journals invariably promote the fact that they are peer reviewed. But it wasn’t always that way.

Myth number 2: peer review is reliable

Update: Bill Hooker has pointed out that I’m using a very strong sense of “reliable” in this section, holding peer review to the standard that it nearly always picks up errors, is a very accurate gauge of quality, and rarely suppresses innovation. If you adopt a more relaxed notion of reliability, as many but not all scientists and members of the general public do, then I’d certainly back off describing this as a myth. As an approximate filter that eliminates or improves many papers, peer review may indeed function well.

Every scientist has a story (or ten) about how they were poorly treated by peer review – the important paper that was unfairly rejected, or the silly editor who ignored their sage advice as a referee. Despite this, many strongly presume that the system works “pretty well”, overall.

There’s not much systematic evidence for that presumption. In 2002 Jefferson et al (ref) surveyed published studies of biomedical peer review. After an extensive search, they found just 19 studies which made some attempt to eliminate obvious confounding factors. Of those, just two addressed the impact of peer review on quality, and just one addressed the impact of peer review on validity; most of the rest of the studies were concerned with questions like the effect of double-blind reviewing. Furthermore, for the three studies that addressed quality and validity, Jefferson et al concluded that there were other problems with the studies which meant the results were of limited general interest; as they put it, “Editorial peer review, although widely used, is largely untested and its effects are uncertain”.

In short, at least in biomedicine, there’s not much we know for sure about the reliability of peer review. My searches of the literature suggest that we know don’t much more in other areas of science. If anything, biomedicine seems to be unusually well served, in large part because several biomedical journals (perhaps most notably the Journal of the American Medical Association) have over the last 20 years put a lot of effort into building a community of people studying the effects of peer review; Jefferson et al‘s study is one of the outcomes from that effort.

In the absence of compelling systematic studies, is there anything we can say about the reliability of peer review?

The question of reliability should, in my opinion, really be broken up into three questions. First, does peer review help verify the validity of scientific studies; second, does peer review help us filter scientific studies, making the higher quality ones easier to find, because they get into the “best” journals, i.e., the ones with the most stringent peer review; third, to what extent does peer review suppress innovation?

As regards validity and quality, you don’t have to look far to find striking examples suggesting that peer review is at best partially reliable as a check of validity and a filter of quality.

Consider the story of the German physicist Jan Hendrik Schoen. In 2000 and 2001 Schoen made an amazing series of breakthroughs in organic superconductivity, publishing his 2001 work at a rate of one paper every 8 days, many in prestigious journals such as Nature, Science, and the Physical Review. Eventually, it all seemed a bit too good to be true, and other researchers in his community began to ask questions. His work was investigated, and much of it found to be fraudulent. Nature retracted seven papers by Schoen; Science retracted eight papers; and the Physical Review retracted six. What’s truly breathtaking about this case is the scale of it: it’s not that a few referees failed to pick up on the fraud, but rather that the refereeing system at several of the top journals systematically failed to detect the fraud. Furthermore, what ultimately brought Schoen down was not the anonymous peer review system used by journals, but rather investigation by his broader community of peers.

You might object to using this as an example on the grounds that the Schoen case involved deliberate scientific fraud, and the refereeing system isn’t intended to catch fraud so much as it is to catch mistakes. I think that’s a pretty weak objection – it can be a thin line between honest mistakes and deliberate fraud – but it’s not entirely without merit. As a second example, consider an experiment conducted by the editors of the British Medical Journal (ref). They inserted eight deliberate errors into a paper already accepted for publication, and sent the paper to 420 potential reviewers. 221 responded, catching on average only two of the errors. None of the reviewers caught more than five of the errors, and 16 percent no errors at all.

None of these examples is conclusive. But they do suggest that the refereeing system is far from perfect as a means of checking validity or filtering the quality of scientific papers.

What about the suppression of innovation? Every scientist knows of major discoveries that ran into trouble with peer review. David Horrobin has a remarkable paper (ref) where he documents some of the discoveries almost suppressed by peer review; as he points out, he can’t list the discoveries that were in fact suppressed by peer review, because we don’t know what those were. His list makes horrifying reading. Here’s just a few instances that I find striking, drawn in part from his list. Note that I’m restricting myself to suppression of papers by peer review; I believe peer review of grants and job applications probably has a much greater effect in suppressing innovation.

  • George Zweig’s paper announcing the discovery of quarks, one of the fundamental building blocks of matter, was rejected by Physical Review Letters. It was eventually issued as a CERN report.
  • Berson and Yalow’s work on radioimmunoassay, which led to a Nobel Prize, was rejected by both Science and the Journal of Clinical Investigation. It was eventually published in the Journal of Clinical Investigation.
  • Krebs’ work on the citric acid cycle, which led to a Nobel Prize, was rejected by Nature. It was published in Experientia.
  • Wiesner’s paper introducing quantum cryptography was initially rejected, finally appearing well over a decade after it was written.

To sum up: there is very little reliable evidence about the effect of peer review available from systematic studies; peer review is at best an imperfect filter for validity and quality; and peer review sometimes has a chilling effect, suppressing important scientific discoveries.

At this point I expect most readers will have concluded that I don’t much like the current peer review system. Actually, that’s not true, a point that will become evident in my post about the future of peer review. There’s a great deal that’s good about the current peer review system, and that’s worth preserving. However, I do believe that many people, both scientists and non-scientists, have a falsely exalted view of how well the current peer review system functions. What I’m trying to do in this post is to establish a more realistic view, and that means understanding some of the faults of the current system.

Myth: Peer review is the way we determine what’s right and wrong in science

By now, it should be clear that the peer review system must play only a partial role in determing what scientists think of as established science. There’s no sign, for example, that the lack of peer review in the 19th and early 20th century meant that scientists then were more confused than now about what results should be regarded as well established, and what should not. Nor does it appear that the unreliability of the peer review process leaves us in any great danger of collectively coming to believe, over the long run, things that are false.

In practice, of course, nearly all scientists understand that peer review is only part of a much more complex process by which we evaluate and refine scientific knowledge, gradually coming to (provisionally) accept some findings as well established, and discarding the rest. So, in that sense, this third myth isn’t one that’s widely believed within the scientific community. But in many scientists’ shorthand accounts of how science progresses, peer review is given a falsely exaggerated role, and this is reflected in the understanding many people in the general public have of how science works. Many times I’ve had non-scientists mention to me that a paper has been “peer-reviewed!”, as though that somehow establishes that it is correct, or high quality. I’ve encountered this, for example, in some very good journalists, and it’s a concern, for peer review is only a small part of a much more complex and much more reliable system by which we determine what scientific discoveries are worth taking further, and what should be discarded.

Further reading

I’m writing a book about “The Future of Science”; this post is part of a series where I try out ideas from the book in an open forum. A summary of many of the themes in the book is available in this essay. If you’d like to be notified when the book is available, please send a blank email to the.future.of.science@gmail.com with the subject “subscribe book”. I’ll email you to let you know in advance of publication. I will not use your email address for any other purpose! You can subscribe to my blog here.

Biweekly links for 01/05/2009

Click here for all of my del.icio.us bookmarks.

Published

Write your first MapReduce program in 20 minutes

The slow revolution

Some revolutions are marked by a single, spectacular event: the storming of the Bastille during the French Revolution, or the destruction of the towers of the World Trade Center on September 11, 2001, which so changed the US’s relationship with the rest of the world. But often the most important revolutions aren’t announced with the blare of trumpets. They occur softly, too slow for a single news cycle, but fast enough that if you aren’t alert, the revolution is over before you’re aware it’s happening.

Such a revolution is happening right now in computing. Microprocessor clock speeds have stagnated since about 2000. Major chipmakers such as Intel and AMD continue to wring out improvements in speed by improving on-chip caching and other clever techniques, but they are gradually hitting the point of diminishing returns. Instead, as transistors continue to shrink in size, the chipmakers are packing multiple processing units onto a single chip. Most computers shipped today use multi-core microprocessors, i.e., chips with 2 (or 4, or 8, or more) separate processing units on the main microprocessor.

The result is a revolution in software development. We’re gradually moving from the old world in which multiple processor computing was a special case, used only for boutique applications, to a world in which it is widespread. As this movement happens, software development, so long tailored to single-processor models, is seeing a major shift in some its basic paradigms, to make the use of multiple processors natural and simple for programmers.

This movement to multiple processors began decades ago. Projects such as the Connection Machine demonstrated the potential of massively parallel computing in the 1980s. In the 1990s, scientists became large-scale users of parallel computing, using parallel computing to simulate things like nuclear explosions and the dynamics of the Universe. Those scientific applications were a bit like the early scientific computing of the late 1940s and 1950s: specialized, boutique applications, built with heroic effort using relatively primitive tools. As computing with multiple processors becomes widespread, though, we’re seeing a flowering of general-purpose software development tools tailored to multiple processor environments.

One of the organizations driving this shift is Google. Google is one of the largest users of multiple processor computing in the world, with its entire computing cluster containing hundreds of thousands of commodity machines, located in data centers around the world, and linked using commodity networking components. This approach to multiple processor computing is known as distributed computing; the characteristic feature of distributed computing is that the processors in the cluster don’t necessarily share any memory or disk space, and so information sharing must be mediated by the relatively slow network.

In this post, I’ll describe a framework for distributed computing called MapReduce. MapReduce was introduced in a paper written in 2004 by Jeffrey Dean and Sanjay Ghemawat from Google. What’s beautiful about MapReduce is that it makes parallelization almost entirely invisible to the programmer who is using MapReduce to develop applications. If, for example, you allocate a large number of machines in the cluster to a given MapReduce job, the job runs in a highly parallelized way. If, on the other hand, you allocate only a small number of machines, it will run in a much more serial way, and it’s even possible to run jobs on just a single machine.

What exactly is MapReduce? From the programmer’s point of view, it’s just a library that’s imported at the start of your program, like any other library. It provides a single library call that you can make, passing in a description of some input data and two ordinary serial functions (the “mapper” and “reducer”) that you, the programmer, specify in your favorite programming language. MapReduce then takes over, ensuring that the input data is distributed through the cluster, and computing those two functions across the entire cluster of machines, in a way we’ll make precise shortly. All the details – parallelization, distribution of data, tolerance of machine failures – are hidden away from the programmer, inside the library.

What we’re going to do in this post is learn how to use the MapReduce library. To do this, we don’t need a big sophisticated version of the MapReduce library. Instead, we can get away with a toy implementation (just a few lines of Python!) that runs on a single machine. By using this single-machine toy library we can learn how to develop for MapReduce. The programs we develop will run essentially unchanged when, in later posts, we improve the MapReduce library so that it can run on a cluster of machines.

Our first MapReduce program

Okay, so how do we use MapReduce? I’ll describe it with a simple example, which is a program to count the number of occurrences of different words in a set of files. The example is simple, but you’ll find it rather strange if you’re not already familiar with MapReduce: the program we’ll describe is certainly not the way most programmers would solve the word-counting problem! What it is, however, is an excellent illustration of the basic ideas of MapReduce. Furthermore, what we’ll eventually see is that by using this approach we can easily scale our wordcount program up to run on millions or even billions of documents, spread out over a large cluster of computers, and that’s not something a conventional approach could do easily.

The input to a MapReduce job is just a set of (input_key,input_value) pairs, which we’ll implement as a Python dictionary. In the wordcount example, the input keys will be the filenames of the files we’re interested in counting words in, and the corresponding input values will be the contents of those files:

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

After this code is run the Python dictionary i will contain the input to our MapReduce job, namely, i has three keys containing the filenames, and three corresponding values containing the contents of those files. Note that I’ve used Windows’ filenaming conventions above; if you’re running a Mac or Linux you may need to tinker with the filenames. Also, to run the code you will of course need text files text\1.txt, text\2.txt, text\3.txt. You can create some simple examples texts by cutting and pasting from the following:

text\a.txt:

The quick brown fox jumped over the lazy grey dogs.

text\b.txt:

That's one small step for a man, one giant leap for mankind.

text\c.txt:

Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.

The MapReduce job will process this input dictionary in two phases: the map phase, which produces output which (after a little intermediate processing) is then processed by the reduce phase. In the map phase what happens is that for each (input_key,input_value) pair in the input dictionary i, a function mapper(input_key,input_value) is computed, whose output is a list of intermediate keys and values. This function mapper is supplied by the programmer – we’ll show how it works for wordcount below. The output of the map phase is just the list formed by concatenating the list of intermediate keys and values for all of the different input keys and values.

I said above that the function mapper is supplied by the programmer. In the wordcount example, what mapper does is takes the input key and input value – a filename, and a string containing the contents of the file – and then moves through the words in the file. For each word it encounters, it returns the intermediate key and value (word,1), indicating that it found one occurrence of word. So, for example, for the input key text\a.txt, a call to mapper("text\\a.txt",i["text\\a.txt"]) returns:

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), 
 ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

Notice that everything has been lowercased, so we don’t count words with different cases as distinct. Furthermore, the same key gets repeated multiple times, because words like the appear more than once in the text. This, incidentally, is the reason we use a Python list for the output, and not a Python dictionary, for in a dictionary the same key can only be used once.

Here’s the Python code for the mapper function, together with a helper function used to remove punctuation:

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

mapper works by lowercasing the input file, removing the punctuation, splitting the resulting string around whitespace, and finally emitting the pair (word,1) for each resulting word. Note, incidentally, that I’m ignoring apostrophes, to keep the code simple, but you can easily extend the code to deal with apostrophes and other special cases.

With this specification of mapper, the output of the map phase for wordcount is simply the result of combining the lists for mapper("text\\a.txt"), mapper("text\\b.txt"), and mapper("text\\c.txt"):

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), 
 ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), 
 ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), 
 ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), 
 ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), 
 ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), 
 ('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1), 
 ('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1), 
 ('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1), 
 ('mankind', 1)]

The map phase of MapReduce is logically trivial, but when the input dictionary has, say 10 billion keys, and those keys point to files held on thousands of different machines, implementing the map phase is actually quite non-trivial. What the MapReduce library handles is details like knowing which files are stored on what machines, making sure that machine failures don’t affect the computation, making efficient use of the network, and storing the output in a useable form. We won’t worry about these issues for now, but we will come back to them in future posts.

What the MapReduce library now does in preparation for the reduce phase is to group together all the intermediate values which have the same key. In our example the result of doing this is the following intermediate dictionary:

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 
 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 
 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 
 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 
 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 
 'that': [1], 'little': [1], 'small': [1], 'step': [1], 
 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 
 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 
 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

We see, for example, that the word ‘and’, which appears only once in the three files, has as its associated value a list containing just a single 1, [1]. By contrast, the word ‘one’, which appears twice, has [1,1] as its value.

The reduce phase now commences. A programmer-defined function reducer(intermediate_key,intermediate_value_list) is applied to each entry in the intermediate dictionary. For wordcount, reducer simply sums up the list of intermediate values, and return both the intermediate_key and the sum as the output. This is done by the following code:

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

The output from the reduce phase, and from the total MapReduce computation, is thus:

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), 
 ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), 
 ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), 
 ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), 
 ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), 
 ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), 
 ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), 
 ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

You can easily check that this is just a list of the words in the three files we started with, and the associated wordcounts, as desired.

We’ve looked at code defining the input dictionary i, the mapper and reducer functions. Collecting it all up, and adding a call to the MapReduce library, here’s the complete wordcount.py program:

#word_count.py

import string
import map_reduce

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

print map_reduce.map_reduce(i,mapper,reducer)

The map_reduce module imported by this program implements MapReduce in pretty much the simplest possible way, using some useful functions from the itertools library:

# map_reduce.py
"'Defines a single function, map_reduce, which takes an input
dictionary i and applies the user-defined function mapper to each
(input_key,input_value) pair, producing a list of intermediate 
keys and intermediate values.  Repeated intermediate keys then 
have their values grouped into a list, and the user-defined 
function reducer is applied to the intermediate key and list of 
intermediate values.  The results are returned as a list."'

import itertools

def map_reduce(i,mapper,reducer):
  intermediate = []
  for (key,value) in i.items():
    intermediate.extend(mapper(key,value))
  groups = {}
  for key, group in itertools.groupby(sorted(intermediate), 
                                      lambda x: x[0]):
    groups[key] = list([y for x, y in group])
  return [reducer(intermediate_key,groups[intermediate_key])
          for intermediate_key in groups] 

(Credit to a nice blog post from Dave Spencer for the use of itertools.groupby to simplify the reduce phase.)

Obviously, on a single machine an implementation of the MapReduce library is pretty trivial! In later posts we’ll extend this library so that it can distribute the execution of the mapper and reducer functions across multiple machines on a network. The payoff is that with enough improvement to the library we can with essentially no change use our wordcount.py program to count the words not just in 3 files, but rather the words in billions of files, spread over thousands of computers in a cluster. What the MapReduce library does, then, is provide an approach to developing in a distributed environment where many simple tasks (like wordcount) remain simple for the programmer. Important (but boring) tasks like parallelization, getting the right data into the right places, dealing with the failure of computers and networking components, and even coping with racks of computers being taken offline for maintenance, are all taken care of under the hood of the library.

In the posts that follow, we’re thus going to do two things. First, we’re going to learn how to develop MapReduce applications. That means taking familiar tasks – things like computing PageRank – and figuring out how they can be done within the MapReduce framework. We’ll do that in the next post in this series. In later posts, we’ll also take a look at Hadoop, an open source platform that can be used to develop MapReduce applications.

Second, we’ll go under the hood of MapReduce, and look at how it works. We’ll scale up our toy implementation so that it can be used over small clusters of computers. This is not only fun in its own right, it will also make us better MapReduce programmers, in the same way as understanding the innards of an operating system (for example) can make you a better application programmer.

To finish off this post, though, we’ll do just two things. First, we’ll sum up what MapReduce does, stripping out the wordcount-specific material. It’s not any kind of a formal specification, just a brief informal summary, together with a few remarks. We’ll refine this summary a little in some future posts, but this is the basic MapReduce model. Second, we’ll give a overview of how MapReduce takes advantage of a distributed environment to parallelize jobs.

MapReduce in general

Summing up our earlier description of MapReduce, and with the details about wordcount removed, the input to a MapReduce job is a set of (input_key,input_value) pairs. Each pair is used as input to a function mapper(input_key,input_value) which produces as output a list of intermediate keys and intermediate values:

[(intermediate_key,intermediate_value),
 (intermediate_key',intermediate_value'),
 ...]

The output from all the different input pairs is then sorted, so that values associated with the same intermediate_key are grouped together in a list of intermediate values. The reducer(intermediate_key,intermediate_value_list) function is then applied to each intermediate key and list of intermediate values, to produce the output from the MapReduce job.

A natural question is whether the order of values in intermediate_value_list matters. I must admit I’m not sure of the answer to this question – if it’s discussed in the original paper, then I missed it. In most of the examples I’m familiar with, the order doesn’t matter, because the reducer works by applying a commutative, associate operation across all intermediate values in the list. As we’ll see in a minute, because the mapper computations are potentially done in parallel, on machines which may be of varying speed, it’d be hard to guarantee the ordering, and this suggests that the ordering doesn’t matter. It’d be nice to know for sure – if anyone reading this does know the answer, I’d appreciate hearing it, and will update the post!

One of the most striking things about MapReduce is how restrictive it is. A priori it’s by no means clear that MapReduce should be all that useful in practical applications. It turns out, though, that many interesting computations can be expressed either directly in MapReduce, or as a sequence of a few MapReduce computations. We’ve seen wordcount implemented in this post, and we’ll see how to compute PageRank using MapReduce in the next post. Many other important computations can also be implemented using MapReduce, including doing things like finding shortest paths in a graph, grepping a large document collection, or many data mining algorithms. For many such problems, though, the standard approach doesn’t obviously translate into MapReduce. Instead, you need to think through the problem again from scratch, and find a way of doing it using MapReduce.

Exercises

  • How would you implement grep in MapReduce?

Problems

  • Take a well-known algorithms book, e.g., Cormen-Leiserson-Rivest-Stein, or a list of well-known algorithms. Which algorithms lend themselves to being expressed in MapReduce? Which do not? This isn’t so much a problem as it is a suggestion for a research program.
  • The early years of serial computing saw the introduction of powerful general-purpose ideas like those that went into the Lisp and Smalltalk programming languages. Arguably, those ideas are still the most powerful in today’s modern programming languages. MapReduce isn’t a programming language as we conventionally think of them, of course, but it’s similar in that it introduces a way of thinking about and approaching programming problems. I don’t think MapReduce has quite the same power as the ideas that went into Lisp and Smalltalk; it’s more like the Fortran of distributed computing. Can we find similarly powerful ideas to Lisp or Smalltalk in distributed computing? What’s the hundred-year framework going to look like for distributed computing?

How MapReduce takes advantage of the distributed setting

You can probably already see how MapReduce takes advantage of a large cluster of computers, but let’s spell out some of the details. There are two key points. First, the mapper functions can be run in parallel, on different processors, because they don’t share any data. Provided the right data is in the local memory of the right processor – a task MapReduce manages for you – the different computations can be done in parallel. The more machines are in the cluster, the more mapper computations can be simultaneously executed. Second, the reducer functions can also be run in parallel, for the same reason, and, again, the more machines are available, the more computations can be done in parallel.

The difficulty in this story arises in the grouping step that takes place between the map phase and the reduce phase. For the reducer functions to work in parallel, we need to ensure that all the intermediate values corresponding to the same key get sent to the same machine. Obviously, this requires communcation between machines, over the relatively slow network connections.

This looks tough to arrange, and you might think (as I did at first) that a lot of communication would be required to get the right data to the right machine. Fortunately, a simple and beautiful idea is used to make sure that the right data ends up in the right location, without there being too much communication overhead.

Imagine you’ve got 1000 machines that you’re going to use to run reducers on. As the mappers compute the intermediate keys and value lists, they compute hash(intermediate_key) mod 1000 for some hash function. This number is used to identify the machine in the cluster that the corresponding reducer will be run on, and the resulting intermediate key and value list is then sent to that machine. Because every machine running mappers uses the same hash function, this ensures that value lists corresponding to the same intermediate key all end up at the same machine. Furthermore, by using a hash we ensure that the intermediate keys end up pretty evenly spread over machines in the cluster. Now, I should warn you that in practice this hashing method isn’t literally quite what’s done (we’ll describe that in later lectures), but that’s the main idea.

Needless to say, there’s a lot more to the story of how MapReduce works in practice, especially the way it handles data distribution and fault-tolerance. In future posts we’ll explore many more of the details. Nonetheless, hopefully this post gives you a basic understanding of how MapReduce works. In the next post we’ll apply MapReduce to the computation of PageRank.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog to follow future posts in the series.

Published
Categorized as GTS

Biweekly links for 01/02/2009

  • Why is the free market letting us down?
    • Interesting summary of some of the problems and promise of markets.
  • Eric Drexler: Greenhouse Gases and Advanced Nanotechnology
  • Building Nutch: Open Source Search
    • A paper describing Nutch, an open source search engine, from the creator, Doug Cutting.
  • Now that’s what I call LEGO
    • A guy has built a LEGO aircraft carrier that looks to be about 25 feet long.
  • Multi-core in the Source Engine
    • A fun article describing some of the challenges involved in moving the Source Game Engine (used in Half-Life 2 and other games) to a world in which multi-core processors are the norm.
  • Steve Koch Research blog
    • Steve Koch is building up an experimental biophysics lab at the University of New Mexico. He’s blogging here about some of the work going on in the lab, in an open-sciencey way. The first post links to his lab’s “research principles”, which includes a discussion of some of the concrete problems surrounding open science.
  • Backreaction: We are Einstein
    • Excellent thoughtful post. Short quote (there’s much more): “The larger the body of knowledge grows that we are working with, the more important it thus becomes for scientists – as for everybody else – to not only have information available, but also have the tools to search, filter, and structure this information, and to direct it to where it is useful. Given this possibility, we could save a lot of time and effort by more efficiently sorting through available sources of information, by faster finding people with the right knowledge to complement our work, by outsourcing specialized tasks to those who have the best skills. And while the first of these points is readily under way with ever more powerful search tools, the latter two are only in the beginning and will need to bring changes in the way science is presently done.”
  • Daniel Lemire: Grabbing attention or building a reputation?
    • “However, I do not blog or write research papers merely to grab attention. Instead, I seek to increase my reputation. While attention fluctuates depending on your current actions, reputation builds up over time based on your reliability, your honesty, and your transparency. To build a good reputation, you do not need to do anything extraordinary: you just need to be consistent over a long time.”
  • Cloth Physics
    • Wonderful little demo that lets you manipulate a piece of cloth(!) by dragging bits of it around.
  • The Quantum Pontiff : A Curmudgeon’s and Improv’s Guide to Outliers: Introduction
    • Dave Bacon punches some holes in the opening story from Malcolm Gladwell’s new book.

Click here for all of my del.icio.us bookmarks.

Published

Brian Eno on conservatism and creativity

A great quote from Brian Eno about the conservative force that comes from previous success:

I’m afraid to say that admirers can be a tremendous force for conservatism, for consolidation. Of course it’s really wonderful to be acclaimed for things you’ve done – in fact it’s the only serious reward, because it makes you think “it worked! I’m not isolated!” or something like that, and it makes you feel gratefully connected to your own culture.

But on the other hand, there’s a tremendously strong pressure to repeat yourself, to do more of that thing we all liked so much. I can’t do that – I don’t have the enthusiasm to push through projects that seem familiar to me ( – this isn’t so much a question of artistic nobility or high ideals: I just get too bloody bored), but at the same time I do feel guilt for ‘deserting my audience’ by not doing the things they apparently wanted. I’d rather not feel this guilt, actually, so I avoid finding out about situations that could cause it.

The problem is that people nearly always prefer what I was doing a few years earlier – this has always been true. The other problem is that so, often, do I! Discovering things is clumsy and sporadic, and the results don’t at first compare well with the glossy and lauded works of the past. You have to keep reminding yourself that they went through that as well, otherwise they become frighteningly accomplished.

That’s another problem with being made to think about your own past – you forget its genesis and start to feel useless awe towards your earlier self: “How did I do it? Wherever did these ideas come from?”. Now, the workaday everyday now, always looks relatively less glamorous than the rose-tinted then (except for those magic hours when your finger is right on the pulse, and those times only happen when you’ve abandoned the lifeline of your own history).

Similar forces operate within science, although it’s not so much from admirers as peers. Institutions want their scientists to get grants; grant agencies want scientists with a “track record”, and the natural outcome is a lot of people doing stuff that’s only marginally different from what they’ve done before, with a concentration in fashionable areas.

Published
Categorized as Quotes

The economics of scientific collaboration

What economics can tell us about scientific collaboration

In this and several future posts I’m going to discuss what economics can tell us about scientific collaboration.

This may sound like a strange topic. Why should economics tell us anything interesting about scientific collaboration? Most discussions of economics are couched in terms of money, interest rates, prices, and so on. While these are relevant to science in a shallow, who’s-paying-for-this-lab-space kind of way, it’s not obvious we can learn anything deep about scientific collaboration by thinking in economic terms.

At a deeper level, though, economics is about understanding how human beings behave when one or more resources are scarce. How are those resources allocated? Are there more efficient ways they might be allocated? What tradeoffs are incurred?

There is a fundamental scarce resource in science, one whose allocation largely determines how science progresses. That scarce resource is expert attention. Who pays attention to what problems? How long do they spend on those problems? What institutional structures determine the answers to those questions? In short, what determines the architecture of scientific attention?

We can learn interesting things by thinking about these questions using ideas from economics. In this post I pull apart the way scientific collaboration works, and put it back together again within the conceptual framework economists use to understand free trade, using concepts like comparative advantage, opportunity cost, and markets. The reason I’m doing this is because the way we structure scientific attention is currently changing rapidly (by historical standards), as networked tools like wikis, blogs, twitter, email, online databases and friendfeed change the architecture of scientific attention. To understand these changes is in part an economic problem, and the point of this post is to begin developing an economic perspective.

Comparative advantage, opportunity cost, and the benefits of free trade

Scientific collaboration can be viewed as a type of trade in expert attention. I can, for example, trade some of my skill as a theoretical physicist for someone else’s skills as a computational physicist, enabling us to jointly write a paper neither of us could have written alone.

To understand this collaboration-as-trade perspective, let’s review some ideas about trade in the context where trade is most often discussed, namely, free trade of goods. We’ll start with a beautiful simplifed model of free trade, a model that goes back to a famous 1817 book “On the Principles of Political Economy and Taxation”, by the economist David Ricardo. Like many useful models, it leaves out a lot that’s relevant to the real world, but it does capture an essential element of the world, and we can learn a great deal by thinking about the model. In particular, the model demonstrates vividly why all parties involved in free trade can benefit, and is one of the main reasons most economists strongly support free trade.

(A small digression: there’s a family connection in this post, since David Ricardo was my Great-Great-Great-Great-Great-Uncle.)

Here’s the model. Imagine there are just two people in the world, Alice the farmer, and Bob the car technician. Alice is good at producing potatoes, but not much good at assembling cars, while Bob is good at assembling cars, but not so good at producing potatoes. Pretty obviously, both Alice and Bob can benefit if Alice concentrates on producing potatoes, Bob concentrates on assembling cars, and they then trade potatoes for cars. While this is intuitively clear, it’s worth making precise with more concrete details. Let’s suppose the effort required for Alice to assemble a car is equal to the effort she requires to produce 20 tonnes of potatoes. Put another way, each car she assembles has an opportunity cost of 20 tonnes of potatoes, since that’s how much assembling a car will cost her in lost production of potatoes. Similarly, suppose the effort for Bob to assemble a car is equal to the effort he requires to produce 2 tonnes of potatoes. That is, each car has an opportunity cost for Bob of just 2 tonnes of potatoes.

In this situation, Bob has a comparative advantage over Alice in the production of cars, because Bob’s opportunity cost for producing cars is less than Alice’s. Equivalently, Alice has a comparative advantage over Bob in the production of potatoes, for her opportunity cost to produce a tonne of potatoes is 1/20th of a car, which is less than Bob’s opportunity cost of half a car.

Suppose Alice and Bob each concentrate in the areas where they have a comparative advantage, i.e., Alice concentrates on producing potatoes, and Bob concentrates on building cars. They then trade potatoes for cars. Both Alice and Bob benefit if the rate at which they trade is greater than 2 and less than 20 tonnes of potatoes per car, because they both will end up with more cars and potatoes than either could have produced on their own. Furthermore, the greater the comparative advantage, the more both parties benefit. Put another way, the more people specialize, the more possible benefit there is in free trade.

It’s worth stressing that the key factor determing the benefits of trade is comparative advantage, not Alice and Bob’s absolute abilities. It might be, for example, that Bob is a layabout who’s lousy both at assembling cars and producing potatoes. Perhaps he’s only capable of assembling one car (or producing 2 tonnes of potatoes) for every ten days of labour. Alice, despite being a farmer, might actually be better than layabout-Bob at assembling cars, producing one car (or twenty tonnes of potatoes) for every 5 days of labour. Even though Alice has an absolute advantage in producing both cars and potatoes, she and Bob are both better off if they concentrate on the areas where they have a comparative advantage, and then trade. Although this example is contrived, it has many implications in the real world. For example, differences in education and infrastructure mean that people in different countries often have enormous differences in their absolute ability to produce goods. Despite this, people in both countries may still benefit from trade if they all concentrate on areas where they have a comparative advantage.

This is all pretty simple, but it’s not universally understood. Much anti-business rhetoric assumes a zero-sum world in which evil captains of industry exploit the defenseless poor, i.e., if one person benefits from a transaction, the other person must lose. Very often, that’s a bad assumption. Good businesspeople look for transactions where both parties benefit; wouldn’t you prefer doing business with enthusiastic trading partners, rather than people who feel coerced or exploited? Of course, sometimes unethical businesspeople do coerce their trading partners, and sometimes trade between two parties can damage a third – environmental issues like pollution often have this nature. But Ricardo’s model is a good starting point to understand how free trade can work to the benefit of all parties.

Markets as a mechanism for aggregating information about comparative advantage

One question not answered in Ricardo’s model is how the trading rate is set. At what rate between 2 and 20 tonnes of potatoes per car should Alice and Bob trade? There are many possible ways to set the rate. In our society, the standard way is to use money as a medium of exchange, with markets determining the price of the relevant goods.

Let’s suppose Alice and Bob participate in such a market, and that the market price is 10,000 dollars per car, and 1,000 dollars per tonne of potatoes. The market thus provides a mechanism by which Alice and Bob can effectively trade cars for potatoes at a rate of one car for ten tonnes of potatoes. This is within the range where it is beneficial for both of them to trade, and so both may enter the market.

What if, instead, the market price was 5,000 dollars for a car, and 5,000 dollars for a tonne of potatoes? Then the effective trading rate is one car for one tonne of potatoes. Bob will be worse off if he enters the market: he’s better off both making cars and growing potatoes. The result is that Bob will withdraw from the car market, reducing the supply of cars. This will drive the market price of cars up a little, but this probably won’t be enough to change the price enough for Bob to re-enter the market. But if enough people withdraw, then the price of cars will go up a lot, and it will make sense for Bob to re-enter.

Money and markets thus serve several purposes. First, the market determines the price of different goods, and thus effectively sets exchange rates between different goods.

Second, the market price automatically aggregates information about comparative advantage, because the people who enter the market are those with a comparative advantage large enough that they can benefit from being in the market. People with a smaller comparative advantage have no reason to do so.

Third, while it’s possible to set up a barter market without the use of money, it’s obviously a great deal more efficient to use money as an intermediary, since for each type of good in the market, we need only keep track of a single price, rather than exchange rates with all the other types of good.

In fact, digressing briefly, it’s possible to prove that in an efficient barter market, an effective currency does emerge. By efficient, I mean that it’s not possible to increase your holdings by conducting a series of trades in immediate succession, e.g., by trading one ox for two cows, the two cows for one horse, and then the horse for two oxen. If this kind of trade is impossible, then it’s possible to just fix on one type of good – say, cows – as the effective unit of commerce, like the dollar, and peg all trades to that unit. From there it’s a small step to forgo the cows, introducing an abstract entity (i.e., money) to replace them. Furthermore, it’s reasonable to argue that you’d expect efficiency in this kind of market; if the market was inefficient in the way described above, then you’d expect one of the intermediaries in the transaction to realize it, and raise their prices, and so smooth away the inefficiency.

It’s remarkable how effective the market is at aggregating information about comparative advantage in this way. It lets us all take advantage of the combined efforts of millions of individuals, most doing tasks for which they have a considerable comparative advantage. Think about the number of people involved in producing a laptop computer. Tens or hundreds of thousands of people participated directly in designing and producing the components in that laptop; most of those people had considerable (in some cases, enormous) comparative advantage in the skills they contributed. When you buy a laptop, your few hundred dollars buys you the accumulated wisdom from a design history of billions of hours, stretching all the way back to the earliest computers. Beyond that, hundreds of millions of people contribute capital (e.g., via retirement funds) used to build infrastructure like chip foundries. Chances are that anywhere between a few dollars and a few hundred dollars from your retirement fund was invested in the chip foundry that produced the processor for the computer that I’m typing these words on. We’re both benefiting right now from differences in comparative advantage.

By providing a way of identifying and taking advantage of comparative advantage, markets encourage people to specialize, creating even greater disparaties in comparative advantage, and thus producing more mutual benefit. The better the market operates, the stronger this feedback effect becomes. Although it’s currently fashionable to bash markets (and economists), in fact many technologies we take for granted – cars, airliners, computers, telecommunications – would be near impossible without the modern market infrastructure.

Comparative advantage and scientific collaboration

Let’s construct a simple model of scientific collaboration inspired by Ricardo’s model of free trade. The model is, of course, a great oversimplification of how collaboration works; the point isn’t to capture the reality of collaboration exactly, but rather to illuminate some elements.

We’ll imagine two people, Alice and Bob, a physicist and a chemist, respectively. Alice is working on a problem in physics, but as she works an unanticipated problem arises, let’s say in chemistry. Let’s suppose for the sake of argument that the problem requires 100 hours of straightforward physics to solve, and 10 hours of straightforward chemistry. (The real numbers in most significant scientific problems are probably larger, but these numbers make the discussion below a little easier to read.) Unfortunately, Alice isn’t much of a chemist, and it would take her 200 hours to do the chemistry part of the problem, mostly spent bumbling around learning the required material. Alternately, if Bob got involved in the project, he could solve the chemistry problem in just ten hours.

There are two scenarios here. In the first, Alice does all the work, it takes 300 hours, and Alice gets all the credit for the paper published as a result. In the second, Alice does 100 hours of work, Bob does 10 hours of work, and they split the credit. Let’s say Alice ends up as first author on a paper describing the work, and Bob ends up as second author, and let’s further say that Alice gets two thirds of the credit as a result, and Bob gets one third of the credit.

Per hour worked, Alice is much better off in the collaborative scenario, getting two thirds of the reward for only one third of the effort. Bob is probably also better off, although the reason is more subtle: if Bob entered the collaboration freely, then it was presumably because Bob felt this was the best use of his time. This is not always the case – if Bob works for Alice he may have to do the work (or find another job), even though he’d do better science if he concentrated on other projects. This is a case where the trade is not completely free, but rather there is coercion. We’ll assume, though, that no coercion is involved, and that both parties benefit.

Let’s fill the model out a little more. Imagine that Bob’s alternative to collaboration is to go off and write straight-up chemistry papers, on his own, taking 110 hours to write each paper, and getting full credit for the paper. He is still better off working with Alice, for he gets one third of the credit for only 10 hours worth of work. Both Alice and Bob benefit, just as in Ricardo’s model.

Another similarity to Ricardo’s model is that it is comparative, not absolute, advantage which is critical. Let’s imagine Bob is actually a beginning chemistry student, and takes 100 hours to complete the work Alice needs done. He’s still better off working with Alice than working on his own, for on his own it would take 1100 hours to write a chemistry paper. Furthermore, Alice is still better off working with Bob than on her own, for the time she saves on doing chemistry is time she can put to work doing physics.

As an illustration of these ideas in a different context, consider the way many supervisors work with students and postodcs. The supervisors suggest problems, reading materials, likely angles of attack, and so on – all areas in which their experience gives them an enormous absolute advantage, and a sizable comparative advantage. The students do the detailed work in the lab. Many supervisors will have an absolute advantage in such lab work, but it is likely to be much smaller, and so the student likely has a comparative advantage in doing such work. Any time the supervisor spends doing such detailed lab work has an opportunity cost in lost time to be suggesting problems, reading materials and the like for another student.

An important difference between this model and Ricardo’s lies in the way we define the benefit to the parties involved. In the case of Ricardo’s model, the benefit is entirely intrinsic: Alice and Bob both want cars and potatoes. In the scientific case, there’s no intrinsic desire the parties have for “expert attention”. Rather, the benefit lies in the reputational credit derived from publications. This difference complicates the analysis of when it is worth it to collaborate. Instead of a simple trading rate, one must consider the way in which reputational credit is split. It is the ratio of this split to the opportunity cost that determines when it makes sense to collaborate. If Alice got 95 percent of the credit, and Bob only 5 percent of the credit, obviously it would not be in Bob’s interest to collaborate. In a future post, I’ll address this more fully, as well as many other aspects of this model.

For now, let me simply point out the relative lack of mechanisms science has for aggregating information about comparative advantage. Mostly, we do it by word of mouth and personal connection, the same way our ancestors traded goods, and so we don’t get the advantages that come from modern markets.

There are good reasons it’s difficult to set up efficient collaboration markets in expert attention. Creative problems are often highly specialized one-off problems, quite unlike the commodites traded in most markets. Until very recently, markets in such specialized goods were relatively uncommon and rather limited even in the realm of physical goods. This has recently changed, with online markets such as eBay showing that it is possible to set up markets which are highly specialized, provided suitable search and reputational tools are in place.

To the extent such collaboration markets do exist in science, they still operate very inefficiently compared with markets for trade in goods. There are considerable trust barriers that inhibit trading relationship being set up. There is no medium of exchange (c.f. the posts by Shirley Wu and Cameron Neylon’s on this topic). The end result is that mechanisms for identifying and aggregating comparative advantage are downright primitive compared with markets for physical goods.

Perhaps the best existing examples of collaboration markets occur in the open source programming community. No single model is used throughout that community, but for many open source projects the basic model is to set up one or more online fora (email discussion lists, wikis, bug-tracking software, etcetera) which is used to co-ordinate activity. The fora are used to advertise problems people are having, such as bugs they’d like fixed, or features they’d like added. People then volunteer to solve those problems, with self-selection ensuring that work is most often done by people with a considerable comparative advantage. The forum thus acts as a simple mechanism for aggregating information about comparative advantage. While this mechanism is primitive compared with modern markets, the success of open source is impressive, and the mechanisms for aggregating information about comparative advantage in expert attention will no doubt improve.

Let me conclude with a question that’s still puzzling me. As I mentioned before, markets have creative power: without them, it’s unlikely that sophisticated goods like laptops and aeroplanes could exist. I’d like to better understand whether more efficient collaboration markets can cause a similar shift in what scientific problems can be solved. Might scientific problems now regarded as out of reach become accessible with more effective ways of structuring scientific attention?

Biweekly links for 12/29/2008

Click here for all of my del.icio.us bookmarks.

Published

Using your laptop to compute PageRank for millions of webpages

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web as a whole, simply because the web contains so many webpages. While just a few lines of code can be used to implement PageRank on collections of a few thousand webpages, it’s trickier to compute PageRank for larger sets of pages.

The underlying problem is that the most direct way to compute the PageRank of [tex]n[/tex] webpages involves inverting an [tex]n \times n[/tex] matrix. Even when [tex]n[/tex] is just a few thousand, this means inverting a matrix containing millions or tens of millions of floating point numbers. This is possible on a typical personal computer, but it’s hard to go much further.

In this post, I describe how to compute PageRank for collections containing millions of webpages. My little laptop easily coped with two million pages, using about 650 megabytes of RAM and a few hours of computation – at a guess it would max out at about 4 to 5 million pages. More powerful machines could cope with upwards of 10 million pages. And that’s with a program written in Python, an interpreted language not noted for its efficiency! Of course, this number still falls a long way short of the size of the entire web, but it’s a lot better than the naive approach. In a future post I’ll describe a more scalable parallel implementation that can be used to compute PageRank for hundreds of millions of pages.

Let’s quickly run through the basic facts we need to know about PageRank. The idea is that we number webpages [tex]1,\ldots,n[/tex]. For webpage number [tex]j[/tex] there is an associated PageRank [tex]q_j[/tex] which measures the importance of page [tex]j[/tex]. The PageRank is a number between [tex]0[/tex] and [tex]1[/tex]; the bigger the PageRank, the more important the page. Furthermore, the vector [tex]q = (q_1,\ldots,q_n)[/tex] of all the PageRanks is a probability distribution, that is, the PageRanks sum to one.

How is the PageRank vector [tex]q[/tex] computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector [tex]q[/tex] can be defined by the equation (explanation below):

[tex] q \leftarrow M^j P. [/tex]

What this equation represents is a starting distribution [tex]P[/tex] for the crazy websurfer, and then [tex]j[/tex] steps of “surfing”, with each action of [tex]M[/tex] representing how the distribution changes in a single step. [tex]P[/tex] is an [tex]n[/tex]-dimensional vector, and [tex]M[/tex] is an [tex]n \times n[/tex] matrix whose entries reflect the link structure of the web in a way I’ll describe below. The PageRank [tex]q[/tex] is defined in the limit of large [tex]j[/tex] – in our examples, convergence occurs for [tex]j[/tex] in the range [tex]20[/tex] to [tex]30[/tex]. You might wonder how [tex]P[/tex] is chosen, but part of the magic of PageRank is that it doesn’t matter how [tex]P[/tex] is chosen, provided it’s a probability distribution. The intuition here is that the starting distribution for the websurfer doesn’t matter to their long-run behaviour. We’ll therefore choose the uniform probability distribution, [tex]P = (1/n,1/n,\ldots)[/tex].

How is the matrix [tex]M[/tex] defined? It can be broken up into three pieces

[tex] M = s A + s D + t E, [/tex]

which correspond to, respectively, a contribution [tex]sA[/tex] representing the crazy websurfer randomly picking links to follow, a contribution [tex]sD[/tex] due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case, and finally a contribution [tex]tE[/tex] representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set [tex]s = 0.85[/tex] and [tex]t = 1-s = 0.15[/tex] as the respective probabilities for randomly selecting a link and teleporting. See here for discussion of the reasons for this choice.

The matrix [tex]A[/tex] describes the link structure of the web. In particular, suppose we define [tex]\#(j)[/tex] to be the number of links outbound from page [tex]j[/tex]. Then [tex]A_{kj}[/tex] is defined to be [tex]0[/tex] if page [tex]j[/tex] does not link to [tex]k[/tex], and [tex]1/\#(j)[/tex] if page [tex]j[/tex] does link to page [tex]k[/tex]. Stated another way, the entries of the [tex]j[/tex]th column of [tex]A[/tex] are zero, except at locations corresponding to outgoing links, where they are [tex]1/\#(j)[/tex]. The intuition is that [tex]A[/tex] describes the action of a websurfer at page [tex]j[/tex] randomly choosing an outgoing link.

The matrix [tex]D[/tex] is included to deal with the problem of dangling pages, i.e., pages with no outgoing links, where it is ambigous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if [tex]j[/tex] is a dangling page, then the [tex]j[/tex]th column of [tex]D[/tex] should have all its entries [tex]1/n[/tex], otherwise all the entries should be zero. A compact way of writing this is

[tex] D = e d^T/ n, [/tex]

where [tex]d[/tex] is the vector of dangling pages, i.e., the [tex]j[/tex]th entry of [tex]d[/tex] is [tex]1[/tex] if page [tex]j[/tex] is dangling, and otherwise is zero. [tex]e[/tex] is the vector whose entries are all [tex]1[/tex]s.

The final piece of [tex]M[/tex] is the matrix [tex]E[/tex], describing the bored websurfer teleporting somewhere else at random. This matrix has entries [tex]1/n[/tex] everywhere, representing a uniform probability of going to another webpage.

Exercises

  • This exercise is for people who’ve read my earlier post introducing PageRank. In that post you’ll see that I’ve broken up [tex]M[/tex] in a slightly different way. You should verify that the breakup in the current post and the breakup in the old post result in the same matrix [tex]M[/tex].

Now, as we argued above, you quickly run into problems in computing the PageRank vector if you naively try to store all [tex]n \times n = n^2[/tex] elements in the matrix [tex]M[/tex] or [tex]M^j[/tex]. Fortunately, it’s possible to avoid doing this. The reason is that the matrices [tex]A, D[/tex] and [tex]E[/tex] all have a special structure. [tex]A[/tex] is a very sparse matrix, since most webpages only link to a tiny fraction of the web as a whole, and so most entries are zero. [tex]D[/tex] is not ordinarily sparse, but it does have a very simple structure, determined by the vector (not matrix!) of dangling pages, [tex]d[/tex], and this makes it easy to do vector multiplications involving [tex]D[/tex]. Finally, the matrix [tex]E[/tex] is even more simple, and it’s straightforward to compute the action of [tex]E[/tex] on a vector.

The idea used in our algorithm is therefore to compute [tex]M^jP[/tex] by repeatedly multiplying the vector [tex]P[/tex] by [tex]M[/tex], and with each multiplication using the special form [tex]M = sA+sD+tE[/tex] to avoid holding anything more than [tex]n[/tex]-dimensional vectors in memory. This is the critical step that lets us compute PageRank for a web containing millions of webpages.

The crux of the algorithm, therefore, is to determine [tex](Mv)_j[/tex] for each index [tex]j[/tex], where [tex]v[/tex] is an arbitrary vector. We do this by computing [tex](Av)_j, (Dv)_j[/tex] and [tex](Ev)_j[/tex] separately, and then summing the results, with the appropriate prefactors, [tex]s[/tex] or [tex]t[/tex].

To compute [tex](Av)_j[/tex], just observe that [tex](Av)_j = \sum_k A_{jk} v_k[/tex]. This can be computed by summing over all the pages [tex]k[/tex] which have links to page [tex]j[/tex], with [tex]A_{jk} = 1/\#(k)[/tex] in each case. We see that the total number of operations involved in computing all the entries in [tex]Av[/tex] is [tex]\Theta(l)[/tex], where [tex]l[/tex] is the total number of links in the web graph, a number that scales far more slowly than [tex]n^2[/tex]. Furthermore, the memory consumed is [tex]\Theta(n)[/tex].

I’m being imprecise in analysing the number of operations and the memory consumption, not worrying about small constant factors. If we were really trying to optimize our algorithm, we’d be careful to keep track of exactly what resources are consumed, rather than ignoring those details. However, as we’ll see below, the advantages of the approach we’re now considering over the approach based on matrix inversion are enormous, and so ignoring a few small factors is okay for the purposes of comparison.

To compute [tex]Dv[/tex], observe that [tex]Dv = e (d \cdot v) / n[/tex], and so it suffices to compute the inner product [tex]d \cdot v[/tex], which in the worst case requires [tex]\Theta(n)[/tex] operations, and [tex]\Theta(n)[/tex] memory, to store [tex]d[/tex].

Finally, to compute [tex](Ev)_j[/tex] is easy, since it is just [tex](e \cdot v)/n[/tex], which requires virtually no additional computation or memory.

It follows that we can compute [tex]Mv[/tex] using a total number of operations [tex]\Theta(n+l)[/tex], and total memory [tex]\Theta(n)[/tex], and thus approximate the PageRank vector

[tex] q \approx M^j P [/tex]

with a total number of operations [tex]\Theta(j (n+l))[/tex], and with memory consumption [tex]\Theta(n)[/tex].

These are imposingly large numbers. Suppose we have [tex]n = 10^6[/tex] webpages, [tex]l = 10^7[/tex] links, and use [tex]j = 10^2[/tex] iterations. Then the total amount of memory involved is on the order of [tex]10^6[/tex] floating point numbers, and the total number of operations is on the order of [tex]10^9[/tex].

On the other hand, if we’d used the method that required us to store the full matrix [tex]M[/tex], we’d have needed to store on the order of [tex]10^{12}[/tex] floating pointing numbers. Furthermore using Gaussian elimination the matrix inversion requires on the order of [tex]10^{18}[/tex] operations, give or take a few (but only a few!) orders of magnitude. More efficient methods for matrix inverstion are known, but the numbers remain daunting.

The code

Let’s think about how we’d code this algorithm up in Python. If you’re not already familiar with Python, then a great way to learn is “Dive into Python”, an excellent free online introductory book about Python by Mark Pilgrim. You can read the first five chapters in just a few hours, and that will provide you with enough basic understanding of Python to follow most code, and to do a lot of coding yourself. Certainly, it’s more than enough for all the examples in this post series.

The most important part of our implementation of PageRank is to choose our representation of the web in a compact way, and so that the information we need to compute [tex]M^jP[/tex] can be retrieved quickly. This is achieved by the following definition for a class web, whose instances represent possible link structures for the web:

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

Suppose g is an object of type web, created using, e.g., g = web(3), to create a web of 3 pages. g.size = 3 for this web. g.in_links is a dictionary whose keys are numbers for the webpages, 0,1,2, and with corresponding values being lists of the inbound webpages. These lists are initially set to contain no webpages:

g.in_links = {0 : [], 1 : [], 2 : []}

Note that in Python lists are indexed [tex]0,1,2,\ldots[/tex], and this makes it convenient to number pages [tex]0,\ldots,n-1[/tex], rather than [tex]1,\ldots,n[/tex], as we have been doing up to now.

The advantages of storing the in-links as a dictionary of lists are two-fold. First, by using this representation, we store no more information than we absolutely need. E.g., we are not storing information about the absence of links, as we would be if we stored the entire adjacency matrix. Second, Python dictionaries are a convenient way to store the information, because they don’t require any ordering information, and so are in many ways faster to manipulate than lists.

You might be wondering why there is no attribute g.out_links in our class definition. The glib answer is that we don’t need it – after all, it can, in principle, be reconstructed from g.in_links. A better answer is that in real applications it would quite sensible to also have g.out_links, but storing that extra dictionary would consume a lot of extra memory, and we avoid it since we can compute PageRank without it. An even better answer would be to make use of a data structure that gives us the speed and convenience provided by both g.in_links and g.out_links, but only requires links to be stored once. This can certainly be done, but requires more work, and so I’ve avoided it.

Let’s finish our description of the attributes of g. g.number_out_links is a dictionary such that g.number_out_links[j] is the number of out-links from page j. Initially these values are all set to zero, but they will be incremented when links are added. We need to store the number of outgoing links in order to compute the probabilities [tex]1/\#(k)[/tex] used in the PageRank algorithm, as described earlier. While this could in principle be computed from the dictionary g.in_links, in practice that would be extremely slow.

Finally, g.dangling_pages is a dictionary whose key set is the set of dangling webpages. The values of the keys are placeholders which are never used, and so we use True as a placeholder value. Initially, all webpages are dangling, but we remove keys when links are added.

With these data structures in place, the rest of the program is a pretty straightforward implementation of the algorithm described earlier. Here’s the code:

# Generates a random web link structure, and finds the
# corresponding PageRank vector.  The number of inbound
# links for each page is controlled by a power law
# distribution.
#
# This code should work for up to a few million pages on a
# modest machine.
#
# Written and tested using Python 2.5.

import numpy
import random

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

def paretosample(n,power=2.0):
  '''Returns a sample from a truncated Pareto distribution
  with probability mass function p(l) proportional to
  1/l^power.  The distribution is truncated at l = n.'''
  m = n+1
  while m > n: m = numpy.random.zipf(power)
  return m

def random_web(n=1000,power=2.0):
  '''Returns a web object with n pages, and where each
  page k is linked to by L_k random other pages.  The L_k
  are independent and identically distributed random
  variables with a shifted and truncated Pareto
  probability mass function p(l) proportional to
  1/(l+1)^power.'''
  g = web(n)
  for k in xrange(n):
    lk = paretosample(n+1,power)-1
    values = random.sample(xrange(n),lk)
    g.in_links[k] = values
    for j in values: 
      if g.number_out_links[j] == 0: g.dangling_pages.pop(j)
      g.number_out_links[j] += 1
  return g

def step(g,p,s=0.85):
  '''Performs a single step in the PageRank computation,
  with web g and parameter s.  Applies the corresponding M
  matrix to the vector p, and returns the resulting
  vector.'''
  n = g.size
  v = numpy.matrix(numpy.zeros((n,1)))
  inner_product = sum([p[j] for j in g.dangling_pages.keys()])
  for j in xrange(n):
    v[j] = s*sum([p[k]/g.number_out_links[k] 
    for k in g.in_links[j]])+s*inner_product/n+(1-s)/n
  # We rescale the return vector, so it remains a
  # probability distribution even with floating point
  # roundoff.
  return v/numpy.sum(v)  

def pagerank(g,s=0.85,tolerance=0.00001):
  '''Returns the PageRank vector for the web g and
  parameter s, where the criterion for convergence is that
  we stop when M^(j+1)P-M^jP has length less than
  tolerance, in l1 norm.'''
  n = g.size
  p = numpy.matrix(numpy.ones((n,1)))/n
  iteration = 1
  change = 2
  while change > tolerance:
    print "Iteration: %s" % iteration
    new_p = step(g,p,s)
    change = numpy.sum(numpy.abs(p-new_p))
    print "Change in l1 norm: %s" % change
    p = new_p
    iteration += 1
  return p

print '''Now computing the PageRank vector for a random
web containing 10000 pages, with the number of inbound
links to each page controlled by a Pareto power law
distribution with parameter 2.0, and with the criterion
for convergence being a change of less than 0.0001 in l1
norm over a matrix multiplication step.'''

g = random_web(10000,2.0) # works up to several million
                          # pages.
pr = pagerank(g,0.85,0.0001)

As written, the code computes PageRank for a 10,000 page web, and I recommend you start with that number of pages, to check that the code runs. Once that’s done, you can change the parameter in the second last line to change the number of pages. If you do that, you may wish to change the tolerance 0.0001 in the final line, decreasing it to ensure a tolerance more appropriate to the number of pages, say [tex]10^{-7}[/tex].

What determines the tolerance that should be used? I have used the [tex]l_1[/tex] norm to quantify the rate of convergence in the code, but that is a rather arbitrary measure. In particular, an alternate measure might be to look at the magnitude of the differences between individual elements, computing [tex]\max_j |(Mv-v)_j|[/tex], and requiring that it be substantially smaller than the minimal possible value for PageRank, [tex]t/n[/tex]. This is less conservative, and arguably a more relevant measure; the code is easily changed.

Exercises

  • The code above is complex enough that bugs are likely. As a test, use an alternate method to compute PageRank for small web sizes, and compare your results against the output from the code above. Do you believe the code is correct?
  • Modify the program described in this lecture so that it works with a non-uniform teleportation vector, as described in an earlier post.
  • Python is a great language, but it’s not designed with speed in mind. Write a program in C to compute the PageRank for a random web. Can you compute the PageRank for ten million pages? What’s the largest number you’re able to compute on a single machine?

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog here.

Published
Categorized as GTS