Michael Nielsen – Page 18 – Michael Nielsen

Biweekly links for 01/26/2009

Building an Inverted Index with Hadoop and Pig Â« SquareCogâ€™s SquareBlog
- “In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin.
  Pig offers SQL-like data processing instructions (select, project, filter, group), while being both more flexible by allowing simple integration of user-defined functions, and more straightforward by allowing users to issue command proceduraly, rather than declaratively, as in SQL. “
Yahoo! Hadoop Tutorial
Comparison of biological wikis
- Andrew Su’s survey of biological wikis (if you click again it links through to a spreadsheet). Lots of very interesting data about number of edits, number of editors, etc.
Datawocky: The Real Long Tail: Why both Chris Anderson and Anita Elberse are Wrong

Click here for all of my del.icio.us bookmarks.

Biweekly links for 01/23/2009

Visual Wikipedia
- This works surprisingly well, showing visually what different Wikipedia articles are linked to. The example I’ve chosen is the open notebook science article; many others work very well also.
the physics arXiv blog Â» How Googleâ€™s PageRank predicts Nobel Prize winners
- The title is over the top, but the results from the paper are very interesting.
A New Kind of Big Science – Olivia Judson Blog – NYTimes.com
- Thoughtful piece on big science, citizen science, and the relationship between them, from Aaron Hirsh.
The Inner Ring, by C.S. Lewis
- “When you invite a middle-aged moralist to address you, I suppose I must conclude, however unlikely the conclusion seems, that you have a taste for middle-aged moralizing. I shall do my best to gratify it.” The essay is entertaining throughout; confused in a couple of places, and enlightening in others. Well worth the read.
Evaluating MapReduce for Multi-core and Multiprocessor Systems
Machine Learning (Theory) Â» Adversarial Academia
- Nice discussion of the idea that academia is a zero-sum game.
Controversial Tell-All Book Reveals Wrestling Fans Are Fake | The Onion
- Who knew?
European Commission Â» Report on the Copyright Law for Protection of Databases
- In the late 90s, the EU introduced a copyright law intended to protect some kinds of databases. This report is an evaluation of the impact of that law on innovation in the EU.
MediaWiki database schema
- Lovely visualization.
ISIS Biolab
- FriendFeed room aggregating some (all?) of Cameron Neylon’s open notebook activities
The Semantic Web in Action: Scientific American
Virtual conferences in Second Life Â« Buried Treasure

Click here for all of my del.icio.us bookmarks.

When can the long tail be leveraged?

In 2006, Chris Anderson, the editor-in-chief of Wired magazine, wrote a bestselling book about an idea he called the long tail. The long tail is nicely illustrated by the bookselling business. Until recently, the conventional wisdom in bookselling was to stock only bestsellers. But internet bookstores such as Amazon.com take a different approach, stocking everything in print. According to Anderson, about a quarter of Amazon’s sales come from the long tail of books outside the top 100,000 bestselling titles (see here for the original research). While books in the long tail don’t individually sell many copies, they greatly outnumber the bestsellers, and so what they lack in individual sales they make up in total sales volume.

The long tail attracted attention because it suggested a new business model, selling into the long tail. Companies like Amazon, Netflix, and Lulu have built businesses doing just that. It also attracted attention because it suggested that online collaborations like Wikipedia and Linux might be benefitting greatly from the long tail of people who contribute just a little.

The problem if you’re building a business or online collaboration is that it can be difficult to tell whether participation is dominated by the long tail or not. Take a look at these two graphs:

The first graph is an idealized graph of Amazon’s book sales versus the sales rank, [tex]r[/tex], of the book. The second graph is an idealized graph of the number of edits made by the [tex]r[/tex]th most prolific contributor to Wikipedia. Superficially, the two graphs look similar, and it’s tempting to conclude that both graphs have a long tail. In fact, the two have radically different behaviour. In this post I’ll describe a general-purpose test that shows that Amazon.com makes it (just!) into the long tail regime, but in Wikipedia contributions from the short head dominate. Furthermore, this difference isn’t just an accident, but is a result of design decisions governing how people find information and make contributions.

Let’s get into more detail about the specifics of the Amazon and Wikipedia cases, before turning to the big picture. The first graph above shows the function

[tex]a / r^{0.871},[/tex]

where [tex]a[/tex] is a constant of proportionality, and [tex]r[/tex] is the rank of the book. The exponent is chosen to be [tex]0.871[/tex] because as of 2003 that makes the function a pretty close approximation to the number of books sold by Amazon. For our analysis, it doesn’t much matter what the value of [tex]a[/tex] is, so we won’t worry about pinning it down. All the important stuff is contained in the [tex]r^{0.871}[/tex] in the denominator.

The second graph shows the function

[tex]a / r^{1.7}.[/tex]

As with the Amazon sales formula, the Wikipedia edit formula isn’t exact, but rather is an approximation. I extracted the formula from a blog post written by a researcher studying Wikipedia at the Xerox PARC Augmented Cognition Center. I mention this because they don’t actually determine the exponent 1.7 themselves – I backed it out from one of their graphs. Note that, as for the Amazon formula, [tex]a[/tex] is a constant of proportionality whose exact value doesn’t matter. There’s no reason the values of the Wikipedia [tex]a[/tex] and the Amazon [tex]a[/tex] should be the same; I’m using the same letter in both formulas simply to avoid a profusion of different letters.

(A little parenthetical warning: figuring out power law exponents is a surprisingly subtle problem. It’s possible that my estimate of the exponent in the last paragraph may be off. See, e.g., this paper for a discussion of some of the subtleties, and references to the literature. If someone with access to the raw data wants to do a proper analysis, I’d be interested to know the results. In any case, we’ll see that the correct value for the exponent would need to be wildly different from my estimate before it could make any difference to the qualitative conclusions we’ll reach.)

Now suppose the total number of different books Amazon stocks in their bookstore is [tex]N[/tex]. We’ll show a bit later that the total number of books sold is given approximately by:

[tex]7.75 \times a \times N^{0.129}.[/tex]

The important point in this formula is that as [tex]N[/tex] increases the total number of books sold grows fairly rapidly. Double [tex]N[/tex] and you get a nearly ten percent increase in total sales. There’s a big benefit to being in the business of the long tail of books.

Let’s move to the second graph, the number of Wikipedia edits. If the total number of editors is [tex]N[/tex], then we’ll show below that the total number of edits made is approximately

[tex]2.05 \times a – O\left( \frac{a}{N^{1.7}} \right).[/tex]

The important point here is that, in contrast to the Amazon example, as [tex]N[/tex] increases it makes little difference to the total number of edits made. In Wikipedia, the total number of edits is dominated by the short head of editors who contribute a great deal.

A general rule to decide whether the long tail or the short head dominates

Let’s generalize the above discussion. We’ll find a simple general rule that can be used to determine whether the long tail or the short head dominates. Suppose the pattern of participation is governed by a power law distribution, with the general form

[tex]\frac{a}{r^b},[/tex]

where [tex]a[/tex] and [tex]b[/tex] are both constants. Both the Amazon and Wikipedia data can be described in this way, and it turns out that many other phenomena are described similarly – if you want to dig into this, I recommend the review papers on power laws by Mark Newman and Michael Mitzenmacher.

Let’s also suppose the total number of “participants” is [tex]N[/tex], where I use the term participants loosely – it might mean the total number of books on sale, the total number of contributors to Wikipedia, or whatever is appropriate to the situation. Our interest will be in summing the contributions of all participants.

When [tex]b < 1[/tex], the sum over all values of [tex]r[/tex] is approximately [tex]\frac{a N^{1-b}}{1-b}.[/tex] Thus, this case is tail-dominated, with the sum continuing to grow reasonably rapidly as [tex]N[/tex] grows. As we saw earlier, this is the case for Amazon's book sales, so Amazon really is a case where the long tail is in operation. When [tex]b = 1[/tex], the total over all values of [tex]r[/tex] is approximately [tex]a \log N.[/tex] This also grows as [tex]N[/tex] grows, but extremely slowly. It's really an edge case between tail-dominated and head-dominated. Finally, when [tex]b > 1[/tex], the total over all values of [tex]r[/tex] is approximately

[tex]a\zeta(b)-O\left(\frac{a}{N^b}\right),[/tex]

where [tex]\zeta(b)[/tex] is just a constant (actually, the Riemann zeta function, evaluated at [tex]b[/tex]), and the size of the corrections is of order [tex]a/N^b[/tex]. It follows that for large [tex]N[/tex] this approaches a constant value, and increasing the value of [tex]N[/tex] has little effect, i.e., this case is head-dominated. So, for example, it means that the great majority of edits to Wikipedia really are made by a small handful of dedicated contributors.

There is a caveat to all this discussion, which is that in the real world power laws are usually just an approximation. For many real world cases, the power law breaks down at the end of the tail, and at the very head of the distribution. The practical implication is that the quantitative values predicted by the above formula may be somewhat off. In practice, though, I don’t think this caveat much matters. Provided the great bulk of the distribution is governed by a power law, this analysis gives insight into whether it’s dominated by the head or by the tail.

Implications

If you’re developing a long tail business or collaboration, you need to make sure the exponent [tex]b[/tex] in the power law is less than one. The smaller the exponent, the better off you’ll be.

How can you make the exponent as small as possible? In particular, how can you make sure it’s smaller than the magic value of one? To understand the answer to this question, we need to understand what actually determines the value of the exponent. There’s some nice simple mathematical models explaining how power laws emerge, and in particular how the power law exponent emerges. At some point in the future I’d like to come back and discuss those in detail, and what implications they have for site architecture. This post is already long enough, though, so let me make just make three simple comments.

First, focus on developing recommendation and search systems which spread attention out, rather than concentrating it in the short head of what’s already popular. This is difficult to do without sacrificing quality, but there’s some interesting academic work now being done on such recommendation systems – see, for example, some of the work described in this recent blog post by Daniel Lemire.

Second, in collaborative projects, ensure a low barrier to entry for newcomers. One problem Wikipedia faces is a small minority of established Wikipedians who are hostile to new editors. It’s not common, but it is there. This drives newcomers away, and so concentrates edits within the group of established editors, effectively increasing the exponent in the power law.

Third, the essence of these and other similar recommendations is that they are systematic efforts to spread attention and contribution out, not one-off efforts toward developing a long tail of sales or contributions. The problem with one-off efforts is that they do nothing to change the systematic architectural factors which actually determine the exponent in the power law, and it is that exponent which is the critical factor.

The role of open licensing in open science

The open science movement encourages scientists to make scientific information freely available online, so other scientists may reuse and build upon that information. Open science has many striking similarities to the open culture movement, developed by people like Lawrence Lessig and Richard Stallman. Both movements share the idea that powerful creative forces are unleashed when creative artifacts are freely shared in a creative commons, enabling other people to build upon and extend those artifacts. The artifact in question might be a set of text documents, like Wikipedia; it might be open source software, like Linux; or open scientific data, like the data from the Sloan Digital Sky Survey, used by services such as Galaxy Zoo. In each case, open information sharing enables creative acts not conceived by the originators of the information content.

The advocates of open culture have developed a set of open content licenses, essentially a legal framework, based on copyright law, which strongly encourages and in some cases forces the open sharing of information. This open licensing strategy has been very successful in strengthening the creative commons, and so moving open culture forward.

When talking to some open science advocates, I hear a great deal of interest and enthusiasm for open licenses for science. This enthusiasm seems prompted in part by the success of open licenses in promoting open culture. I think this is great – with a few minor caveats, I’m a proponent of open licenses for science – but the focus on open licenses sometimes bothers me. It seems to me that while open licenses are important for open science, they are by no means as critical as they are to open culture; open access is just the beginning of open science, not the end. This post discusses to what extent open licenses can be expected to play a role in open scientific culture.

Open licenses and open culture

Let me review the ideas behind the licensing used in the open culture movement. If you’re familiar with the open culture movement, you’ll have heard this all before; if you haven’t, hopefully it’s a useful introduction. In any case, it’s worth getting all this fixed in our heads before addressing the connection to open science.

The obvious thing for advocates of open culture to do is to get to work building a healthy public domain: writing software, producing movies, writing books and so on, releasing all that material into the public domain, and encouraging others to build upon those works. They could then use a moral suasion argument to encourage others to contribute back to the public domain.

The problem is that many people and organizations don’t find this kind of moral suasion very compelling. Companies take products from the public domain, build upon them, and then, for perfectly understandable reasons, fiercely protect the intellectual property they produce. Disney was happy to make use of the old tale of Cinderella, but they take a distinctly dim view of people taking their Cinderella movie and remixing it.

People like Richard Stallman and Lawrence Lessig figured out how to add legal teeth to the moral suasion argument. Instead of relying on goodwill to get people to contribute back to the creative commons, they invented a new type of licensing that compels people to contribute back. There’s now a whole bunch of such open licenses – the various varieties of the GNU Public License (GPL), Creative Commons licenses, and many others – with various technical differences between them. But there’s a basic idea of viral licensing that’s common to many (though not all) of the open licenses. This is the idea that anyone who extends a product released under such a license must release the extension under the same terms. Using such an open license is thus a lot like putting material into the public domain, in that both result in content being available in the creative commons, but the viral open licenses differ from the public domain in compelling people to contribute back into the creative commons.

The consequences of this compulsion are interesting. In the early days of open licensing, the creative commons grew slowly. As the amount of content with an open license grew, though, things began to change. This has been most obvious in software development, which was where viral open licenses first took hold. Over time it became more tempting for software developers to start development with an existing open source product. Why develop a new product from scratch, when you can start with an existing codebase? This means that you can’t use the most obvious business model – limit distribution to executable files, and charge for them – but many profitable open source companies have shown that alternate business models are possible. The result is that as time has gone on, even the most resolutely closed source companies (e.g., Microsoft) have found it difficult to avoid infection by open source. The result has been a gradually accelerating expansion of the creative commons, an expansion that has enabled extraordinary creativity.

Open licenses and open science

I’m not sure what role licensing will play in open science, but I do think there are some clear signs that it’s not going to be as central a role as it’s played in open culture.

The first reason for thinking this is that a massive experiment in open licensing has already been tried within science. By law, works produced by US Federal Government employees are, with some caveats, automatically put into the public domain. Every time I’ve signed a “Copyright Transfer” agreement with an academic journal, there’s always been in the fine print a clause exclusing US Government employees from having to transfer copyright. You can’t give away what you don’t own.

This policy has greatly enriched the creative commons. And it’s led to enormous innovation – for example, I’ve seen quite a few mapping services that build upon US Government data, presumably simply because that data is in the public domain. But in the scientific realm I don’t get the impression that this is doing all that much to promote the growth of the same culture of mass collaboration as open licenses are enabling.

(A similar discussion can be had about open access journals. The discucssion there is more complex, though, because (a) many of the journals have only been open access for a few years, and (b) the way work is licensed varies a lot from journal to journal. That’s why I’ve focused on the US Government.)

The second reason for questioning the centrality of open licenses is the observation that the main barriers to remixing and extension of scientific content aren’t legal barriers. They are, instead, cultural barriers. If someone copies my work, as a scientist, I don’t sue them. If I were to do that, it’s in any case doubtful that the courts would do more than slap the violator on the wrist – it’s not as though they’ll directly make money. Instead, there’s a strong cultural prohibition against such copying, expressed through widely-held community norms about plagiarism and acceptable forms of attribution. If someone copies my work, the right way to deal with it is to inform their colleagues, their superiors, and so on – in short, to deal with it by cultural rather than legal means.

That’s not to say there isn’t a legal issue here. But it’s a legal issue for publishers, not individual scientists. Many journal publishers have business models which are vulnerable to systematic large-scale attempts to duplicate their content. Someone could, for example, set up a “Pirate Bay” for scientific journal articles, making the world’s scientific articles freely available. That’s something those journals have to worry about, for legitimate short-term business reasons, and copyright law provides them with some form of protection and redress.

My own opinion is that over the long run, it’s likely that the publishers will move to open access business models, and that will be a good thing for open science. I might be wrong about that; I can imagine a world in which that doesn’t happen, yet certain varieties of open science still flourish. Regardless of what you think about the future of journals, the larger point is that the legal issues around openness are only a small part of a much larger set of issues, issues which are mostly cultural. The key to moving to a more open scientific system is changing scientist’s hearts and minds about the value and desirability of more openly sharing information, not reforming the legal rights under which they publish content.

So, what’s the right approach to licensing? John Wilbanks has argued, persuasively in my opinion, that data should be held in the public domain. I’ve sometimes wondered if this argument shouldn’t be extended beyond data, to all forms of scientific content, including papers, provided (and this is a big “provided”) the publisher’s business interests can be met in way that adequately serves all parties. After all, if the scientific community is primarily a reputation economy, built around cultural norms, then why not simply remove the complication of copyright from the fray?

Now, I should say that this is speculation on my part, and my thinking is incomplete on this set of issues. I’m most interested to hear what others have to say! I’m especially interested in efforts to craft open research licenses, like the license Victoria Stodden has been developing. But I must admit that it’s not yet clear to me why, exactly, we need such licenses, or what interests they serve.

The sincerest form of flattery…

… probably isn’t copying without attribution. Someone named Will Choi has apparently copied without attribution my essay about Shirky’s Law. (All links given a “nofollow” tag.) I’ve had this kind of copying done by spammers before, many times, but it’s a little more annoying to see on someone’s personal blog.

Biweekly links for 01/19/2009

Towards a Wiki For Formally Verified Mathematics
- “the wiki will state all of known mathematics in a machine-readable language and verify all theorems for correctness, thus providing a knowledge base for interactive proof assistants.”
The Art Of CommunityÂ |Â jonobacon@home
- “The Art of Community” is a new book being written by Jono Bacon, the Ubuntu Community Manager. It’s not yet done, but will be published by O’Reilly, and will also be released under a CC license.
AcaWiki
- Site intended to be a “Wikipedia for academic research”, summarizing academic papers.

Click here for all of my del.icio.us bookmarks.

Biweekly links for 01/12/2009

Tim O’Reilly: Work on Stuff that Matters: First Principles
Clay Shirky at PopTech 2008
- Designing for generosity; doing things for pure enjoyment; and the shortcomings of analysis based on the idea that the main reason people do things is for personal profit.
Alyssa Goodman: 3D PDF in Scientific Publishing
- Describes a “3D PDF” used in a paper published last week in Nature.
Cory Doctorow: Writing in the Age of Distraction

Click here for all of my del.icio.us bookmarks.

Travel and talks

I’ll be in Santa Fe, New Mexico, on Wednesday and Thursday of this coming week. While there, I’m giving a talk about scientific collaboration on the internet at the Santa Fe Institute on Wednesday afternoon, and the after-dinner banquet speech (related topic, very different talk) at the Quantum Information Processing (QIP) 2009 conference. QIP was one of my favourite annual events when I worked in quantum computing, and I’m really looking forward to the chance to catch up with old friends.

On Friday through Sunday I’ll be at the third annual science blogging conference, otherwise known as Science Online 2009, in Rayleigh-Durham, North Carolina. People who went to the first two all say they had a great time, so I’m really looking forward to this!

Finally, while advertising talks, I may as well add that I’ll be giving the University of Toronto Physics Colloquium Thursday, January 22, at 4pm, in the McLennan Building, room 102, at the main University of Toronto Campus.

Using MapReduce to compute PageRank

In this post I explain how to compute PageRank using the MapReduce approach to parallelization. This gives us a way of computing PageRank that can in principle be automatically parallelized, and so potentially scaled up to very large link graphs, i.e., to very large collections of webpages. In this post I describe a single-machine implementation which easily handles a million or so pages. In future posts we’ll use a cluster to scale out much further – it’ll be interesting to see how far we can get.

I’ve discussed PageRank and MapReduce at length in earlier posts – see here for MapReduce, and here and here for PageRank – so in this post we’ll just quickly review the basic facts. Let’s start with PageRank. The idea is that we number webpages [tex]0,\ldots,n-1[/tex]. For webpage number [tex]j[/tex] there is an associated PageRank [tex]q_j[/tex] which measures the importance of page [tex]j[/tex]. The vector [tex]q = (q_0,\ldots,q_{n-1})[/tex] of PageRanks is a probability distribution, i.e., the PageRanks are numbers between [tex]0[/tex] and [tex]1[/tex], and sum up to one, in total. The PageRank [tex]q_j[/tex] measures the importance of page [tex]j[/tex]; the bigger the PageRank, the more important the page.

How is the PageRank vector [tex]q[/tex] computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector [tex]q[/tex] can be defined by the equation (explanation below):

[tex] q \leftarrow M^j P. [/tex]

What this equation represents is a starting distribution [tex]P[/tex] for the crazy websurfer, and then [tex]j[/tex] steps of “surfing”, with each action of [tex]M[/tex] representing how the distribution changes in a single step. [tex]P[/tex] is an [tex]n[/tex]-dimensional vector, and [tex]M[/tex] is an [tex]n \times n[/tex] matrix whose entries reflect the link structure of the web in a way I’ll make precide below. The PageRank [tex]q[/tex] is defined in the limit of large [tex]j[/tex] – in our examples, convergence typically occurs for [tex]j[/tex] in the range [tex]10[/tex] to [tex]30[/tex]. You might wonder how [tex]P[/tex] is chosen, but part of the magic of PageRank is that it doesn’t matter how [tex]P[/tex] is chosen, provided it’s a probability distribution. The intuition is that the starting distribution for the websurfer doesn’t matter to the websurfer’s long-run behaviour. We’ll start with the uniform probability distribution, [tex]P = (1/n,1/n,\ldots)[/tex], since it’s easy to generate.

How is the matrix [tex]M[/tex] defined? It can be broken up into three pieces

[tex] M = s A + s D + t E, [/tex]

including: a contribution [tex]sA[/tex] representing the crazy websurfer randomly picking links to follow; a contribution [tex]sD[/tex] due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case; and finally a contribution [tex]tE[/tex] representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set [tex]s = 0.85[/tex] and [tex]t = 1-s = 0.15[/tex] as the respective probabilities for randomly selecting a link and teleporting. See this post for a discussion of the reasons for this choice.

The matrix [tex]A[/tex] describes the crazy websurfer’s linkfollowing behaviour, and so, in some sense, encodes the link structure of the web. In particular, suppose we define [tex]\#(j)[/tex] to be the number of links outbound from page [tex]j[/tex]. Then [tex]A_{kj}[/tex] is defined to be [tex]0[/tex] if page [tex]j[/tex] does not link to page [tex]k[/tex], and [tex]1/\#(j)[/tex] if page [tex]j[/tex] does link to page [tex]k[/tex]. Stated another way, the entries of the [tex]j[/tex]th column of [tex]A[/tex] are zero, except at locations corresponding to outgoing links, where they are [tex]1/\#(j)[/tex]. The intuition is that [tex]A[/tex] describes the action of a websurfer at page [tex]j[/tex] randomly choosing an outgoing link.

The matrix [tex]D[/tex] is included to deal with dangling pages, i.e., pages with no outgoing links. For such pages it is obviously ambiguous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if [tex]j[/tex] is a dangling page, then the [tex]j[/tex]th column of [tex]D[/tex] should have all its entries [tex]1/n[/tex], otherwise, if [tex]j[/tex] is not dangling, all the entries should be zero. A compact way of writing this is

[tex] D = e d^T/ n, [/tex]

where [tex]d[/tex] is the vector of dangling pages, i.e., the [tex]j[/tex]th entry of [tex]d[/tex] is [tex]1[/tex] if page [tex]j[/tex] is dangling, and otherwise is zero. [tex]e[/tex] is the vector whose entries are all [tex]1[/tex]s.

The final piece of [tex]M[/tex] is the matrix [tex]E[/tex], describing the bored websurfer teleporting somewhere else at random. This matrix has entries [tex]1/n[/tex] everywhere, representing a uniform probability of going to another webpage.

Okay, that’s PageRank in a mathematical nutshell. What about MapReduce? Again, I’ll just remind you of the basic details – if you want an introduction, see this post. MapReduce is one of those ideas where understanding is really helped by first working through an example, rather than starting with an abstract description, like I’m about to give, so if you’re not familiar with MapReduce, I strongly suggest reading the earlier post.

The input to a MapReduce job is a set of (input_key,input_value) pairs. Each pair is used as input to a function mapper(input_key,input_value) which produces as output a list of intermediate keys and intermediate values:

[(intermediate_key,intermediate_value),
 (intermediate_key',intermediate_value'),
 ...]

The output from all the different input pairs is then sorted, so that intermediate values associated with the same intermediate_key are grouped together in a list of intermediate values. The reducer(intermediate_key,intermediate_value_list) function is then applied to each intermediate key and list of intermediate values, to produce the output from the MapReduce job.

Computing PageRank with MapReduce

Okay, so how can we compute PageRank using MapReduce? The approach we’ll take is to use MapReduce to repeatedly multiply a vector by the matrix [tex]M[/tex]. In particular, we’re going to show that if [tex]p[/tex] is a probability distribution, then we can easily compute [tex]Mp[/tex] using MapReduce. We can thus compute [tex]M^jP[/tex] using repeated invocations of MapReduce. Those invocations have to be done serially, but the individual MapReduce jobs are themselves all easily parallelized, and so we can potentially get a speedup by running those jobs on a big cluster. Much more about doing that in later posts.

The nub of the problem, then, is figuring out how to compute [tex]Mp[/tex], given a starting probability distribution [tex]p[/tex]. Let’s start out with a rough approach that gets the basic idea right, essentially using MapReduce to compute [tex]Ap[/tex]. We’ll see below that it’s easy to fix this up to take dangling nodes and teleportation into account. The fix involves introducing an additional MapReduce job, though, so each multiplication step [tex]p \rightarrow Mp[/tex] actually involves two MapReduce jobs, not just one. For now, though, let’s concentrate on roughing out a MapReduce job that computes [tex]Ap[/tex].

As input to the MapReduce computation, we’ll use (key,value) pairs where the key is just the number of the webpage, let’s call it j, and value contains several items of data describing the page, including [tex]p_j[/tex], the number [tex]\#(j)[/tex] of outbound webpages, and a list [k_1,k_2,...] of pages that j links to.

For each of the pages k_l that j links to, the mapper outputs an intermediate key-value pair, with the intermediate key being k_l and the value just the contribution [tex]p_j/\#(j)[/tex] made to the PageRank. Intuitively, this corresponds to the crazy websurfer randomly moving to page k_l, with the probability [tex]p_j/\#(j)[/tex] combining both the probability [tex]p_j[/tex] that they start at page [tex]j[/tex], and the probability [tex]p_j/\#(j)[/tex] that they move to page k_l.

Between the map and reduce phases, MapReduce collects up all intermediate values corresponding to any given intermediate key, k, i.e., the list of all the probabilities of moving to page k. The reducer simply sums up all those probabilities, outputting the result as the second entry in the pair (k,p_k'), and giving us the entries of [tex]Ap[/tex], as was desired.

To modify this so it computes [tex]Mp[/tex] we need to make three changes.

The first change is to make sure we deal properly with dangling pages, i.e., we include the term [tex]sD[/tex]. One possible way is to treat dangling pages as though they have outgoing links to every single other page, [0,1,2,...]. While this works, it would require us to maintain many very large lists of links, and would be extremely inefficient.

A better way to go is to use our earlier expression [tex]D = e d^T/n[/tex], and thus [tex]Dp = e (d\cdot p)/n[/tex], where [tex]d \cdot p[/tex] is the inner product between the vector [tex]d[/tex] of dangling pages, and [tex]p[/tex]. Computing [tex]Dp[/tex] then really boils down to computing [tex]d \cdot p[/tex].

We can compute [tex]d \cdot p[/tex] using a separate MapReduce job which we run first. This job computes the inner product, and then passes it as a parameter to the second MapReduce job, which is based on the earlier rough description, and which finishes off the computation. This first MapReduce job uses the same input as the earlier job – a set of keys j corresponding to pages, and values describing the pages, i.e., containing the value for p_j, and a description of the outbound links from page j. If page j is dangling the mapper outputs the intermediate pair (1,p_j), otherwise it outputs nothing. All the intermediate keys are the same, so the reducer acts on just one big list, summing up all the values p_j for dangling pages, giving us the inner product we wanted.

As an aside, while this prescription for computing the inner product using MapReduce is obviously correct, you might worry about the fact that all the intermediate keys have the same value. This means all the intermediate values will go to a single reducer, running on just one machine in the cluster. If there are a lot of dangling pages, that means a lot of communication and computation overhead associated with that single machine – it doesn’t seem like a very parallel solution. There’s actually a simple solution to this problem, which is to modify the MapReduce framework just a little, introducing a “combine” phase inbetween map and reduce, which essentially runs little “mini-reducers” directly on the output from all the mappers, offloading some of the reduce functionality onto the machines used as mappers. We won’t explore this idea in detail here, but we will implement it in future posts, and we’ll see that in practice having just a single key isn’t a bottleneck.

The second change we need to make in our rough MapReduce job is to include the teleportation step. This can be done easily by modifying the reducer to include a contribution from teleportation.

The third change we need to make in our rough MapReduce job is somewhat subtle; I actually didn’t realize I needed to make this change until after I ran the code, and realized I had a bug. Think about the set of intermediate keys produced by the mappers. The only way a given page can appear as an intermediate key is if it’s linked to by some other page. Pages with no links to them won’t appear in the list of intermediate keys, and so won’t appear in the output from the MapReduce job. The way we deal with this problem is by modifying the mapper so that it emits one extra key-value pair as output. Namely, if it takes as input (j,value), then it emits all the intermediate keys and values described earlier, and an additional pair (j,0), which represents a probability [tex]0[/tex] of moving to page j. This ensures that every page j will appear in the list of intermediate keys, but doesn’t have any impact on the probability of moving to page j; you can think of it simply as a placeholder output.

That completes the high-level theoretical description of computing PageRank using MapReduce. In the next section of the post I’ll describe a simple Python implementation of this MapReduce-based approach to PageRank. If you’re not interested in the implementation, you can skip to the final section, where I talk about how to think about programming with MapReduce – general heuristics you can use to put problems into a form where MapReduce can be used to attack them.

Implementation

The Python code to implement the above PageRank algorithm is straightforward. To run it on just a single machine we can use the exact same MapReduce module I described in my earlier post; for convenience, here’s the code:

# map_reduce.py
# Defines a single function, map_reduce, which takes an input
# dictionary i and applies the user-defined function mapper to each
# (input_key,input_value) pair, producing a list of intermediate 
# keys and intermediate values.  Repeated intermediate keys then 
# have their values grouped into a list, and the user-defined 
# function reducer is applied to the intermediate key and list of 
# intermediate values.  The results are returned as a list.

import itertools

def map_reduce(i,mapper,reducer):
  intermediate = []
  for (key,value) in i.items():
    intermediate.extend(mapper(key,value))
  groups = {}
  for key, group in itertools.groupby(sorted(intermediate), 
                                      lambda x: x[0]):
    groups[key] = list([y for x, y in group])
  return [reducer(intermediate_key,groups[intermediate_key])
          for intermediate_key in groups]

With that code put in a file somewhere your Python interpreter can find it, here’s the code implementing PageRank:

# pagerank_mr.py
#
# Computes PageRank, using a simple MapReduce library.
#
# MapReduce is used in two separate ways: (1) to compute
# the inner product between the vector of dangling pages
# (i.e., pages with no outbound links) and the current
# estimated PageRank vector; and (2) to actually carry
# out the update of the estimated PageRank vector.
#
# For a web of one million webpages the program consumes
# about one gig of RAM, and takes an hour or so to run,
# on a (slow) laptop with 3 gig of RAM, running Vista and
# Python 2.5.

import map_reduce
import numpy.random
import random

def paretosample(n,power=2.0):
  # Returns a sample from a truncated Pareto distribution
  # with probability mass function p(l) proportional to
  # 1/l^power.  The distribution is truncated at l = n.

  m = n+1
  while m > n: m = numpy.random.zipf(power)
  return m

def initialize(n,power):
  # Returns a Python dictionary representing a web
  # with n pages, and where each page k is linked to by
  # L_k random other pages.  The L_k are independent and
  # identically distributed random variables with a
  # shifted and truncated Pareto probability mass function
  # p(l) proportional to 1/(l+1)^power.

  # The representation used is a Python dictionary with
  # keys 0 through n-1 representing the different pages.
  # i[j][0] is the estimated PageRank, initially set at 1/n,
  # i[j][1] the number of outlinks, and i[j][2] a list of
  # the outlinks.

  # This dictionary is used to supply (key,value) pairs to
  # both mapper tasks defined below.

  # initialize the dictionary
  i = {} 
  for j in xrange(n): i[j] = [1.0/n,0,[]]
  
  # For each page, generate inlinks according to the Pareto
  # distribution. Note that this is somewhat tedious, because
  # the Pareto distribution governs inlinks, NOT outlinks,
  # which is what our representation is adapted to represent.
  # A smarter representation would give easy
  # access to both, while remaining memory efficient.
  for k in xrange(n):
    lk = paretosample(n+1,power)-1
    values = random.sample(xrange(n),lk)
    for j in values:
      i[j][1] += 1 # increment the outlink count for page j
      i[j][2].append(k) # insert the link from j to k
  return i

def ip_mapper(input_key,input_value):
  # The mapper used to compute the inner product between
  # the vector of dangling pages and the current estimated
  # PageRank.  The input is a key describing a webpage, and
  # the corresponding data, including the estimated pagerank.
  # The mapper returns [(1,pagerank)] if the page is dangling,
  # and otherwise returns nothing.
  
  if input_value[1] == 0: return [(1,input_value[0])]
  else: return []

def ip_reducer(input_key,input_value_list):
  # The reducer used to compute the inner product.  Simply
  # sums the pageranks listed in the input value list, which
  # are all the pageranks for dangling pages.

  return sum(input_value_list)

def pr_mapper(input_key,input_value):
  # The mapper used to update the PageRank estimate.  Takes
  # as input a key for a webpage, and as a value the corresponding
  # data, as described in the function initialize.  It returns a
  # list with all outlinked pages as keys, and corresponding values
  # just the PageRank of the origin page, divided by the total
  # number of outlinks from the origin page.  Also appended to
  # that list is a pair with key the origin page, and value 0.
  # This is done to ensure that every single page ends up with at
  # least one corresponding (intermediate_key,intermediate_value)
  # pair output from a mapper.
  
  return [(input_key,0.0)]+[(outlink,input_value[0]/input_value[1])
          for outlink in input_value[2]]

def pr_reducer_inter(intermediate_key,intermediate_value_list,
                     s,ip,n):
  # This is a helper function used to define the reducer used
  # to update the PageRank estimate.  Note that the helper differs
  # from a standard reducer in having some additional inputs:
  # s (the PageRank parameter), ip (the value of the inner product
  # between the dangling pages vector and the estimated PageRank),
  # and n, the number of pages.  Other than that the code is
  # self-explanatory.
  
  return (intermediate_key,
          s*sum(intermediate_value_list)+s*ip/n+(1.0-s)/n)

def pagerank(i,s=0.85,tolerance=0.00001):
  # Returns the PageRank vector for the web described by i,
  # using parameter s.  The criterion for convergence is that
  # we stop when M^(j+1)P-M^jP has length less than tolerance,
  # in l1 norm.
  
  n = len(i)
  iteration = 1
  change = 2 # initial estimate of error
  while change > tolerance:
    print "Iteration: "+str(iteration)
    # Run the MapReduce job used to compute the inner product
    # between the vector of dangling pages and the estimated
    # PageRank.
    ip_list = map_reduce.map_reduce(i,ip_mapper,ip_reducer)

    # the if-else clause is needed in case there are no dangling
    # pages, in which case MapReduce returns ip_list as the empty
    # list.  Otherwise, set ip equal to the first (and only)
    # member of the list returned by MapReduce.
    if ip_list == []: ip = 0
    else: ip = ip_list[0]

    # Dynamically define the reducer used to update the PageRank
    # vector, using the current values for s, ip, and n.
    pr_reducer = lambda x,y: pr_reducer_inter(x,y,s,ip,n)

    # Run the MapReduce job used to update the PageRank vector.
    new_i = map_reduce.map_reduce(i,pr_mapper,pr_reducer)

    # Compute the new estimate of error.
    change = sum([abs(new_i[j][1]-i[j][0]) for j in xrange(n)])
    print "Change in l1 norm: "+str(change)

    # Update the estimate PageRank vector.
    for j in xrange(n): i[j][0] = new_i[j][1]
    iteration += 1
  return i

n = 1000 # works up to about 1000000 pages
i = initialize(n,2.0)
new_i = pagerank(i,0.85,0.0001)

Mostly, the code is self-explanatory. But there are three points that deserve some comment.

First, we represent the web using a Python dictionary i, with keys 0,...,n-1 representing the different pages. The corresponding values are a list, with the first element of the list i[j][0] being just the current probability estimate, which we called earlier p_j, the second element of the list i[j][1] being the number of links outbound from page j, and the third element of the list i[j][2] being another list, this time just a list of all the pages that page j links to.

This representation is, frankly, pretty ugly, and leaves you having to keep track of the meaning of the different indices. I considered instead defining a Python class, say page_description, and using an instance of that class as the value, with sensible attributes like page_description.number_outlinks. This would have made the program a bit longer, but also more readable, and would perhaps be a better choice on those grounds.

Part of the reason I don’t do this is that the way the data is stored in this example already has other problems, problems that wouldn’t be helped by using a Python class. Observe that the MapReduce job takes as input a dictionary with keys 0,...,n-1, and corresponding values describing those pages; the output has the same key set, but the values are just the new values for Mp_j, not the entire page description. That is, the input dictionary and the output dictionary have the same key set, but their values are of quite a different nature. This is a problem, because we want to apply our MapReduce job iteratively, and it’s the reason that at the end of the pagerank function we have to go through and laboriously update our current estimate for the PageRank vector. This is not a good thing – it’s ugly, and it means that part of the job is not automatically parallelizable.

One way of solving this problem would be to pass through the entire MapReduce job a lot of extra information about page description. Doing that has some overhead, though, both conceptually and computationally. What we’ll see in later posts is that by choosing the way we represent data a bit more carefully, we can have our cake and eat it too. I’ll leave that for later posts, because it’s a fairly minor point, and I don’t want to distract from the big picture, which is the focus of this post.

Second, you’ll notice that in the pagerank function, we dyamically define the pr_reducer function, using the pr_reducer_inter function. As you can see from the code, the only difference between the two is that pr_reducer effectively has some of pr_reducer_inter‘s slots filled in, most notably, the value ip for the inner product, produced by the first MapReduce job. The reason we need to do this is because the map_reduce function we’ve defined expects the reducer function to just have two arguments, an intermediate key, and a list of intermediate values.

There are other ways we could achieve the same effect, of course. Most obviously, we could modify the map_reduce function so that extra parameters can be passed to the mapper and reducer. There shouldn’t be too many extra parameters, of course, because those parameters will need to be communicated to all computers in the cluster, but a small set would be perfectly acceptable. I went with the dynamic definition of pr_reducer simply because it seemed fun and elegant.

Exercises

The dynamic definition of pr_reducer is very convenient in our code. Can you think of any problems that might arise in using such dynamic definitions on a cluster? Can you think of any ways you might avoid those problems, retaining the ability to use dynamically defined mappers and reducers?

Third, and finally, the way we compute the error estimate is not obviously parallelized. It’s easy to see how you could parallelize it using MapReduce, but, as above, the particular data representation we’re using makes this inconvenient. This will also be easily fixed when we move to our new data representation, in a later post.

A MapReduce programming heuristic

We’ve now seen two examples of using MapReduce to solve programming problems. The first, in an earlier post, showed how to use MapReduce to count word occurrences in a collection of files. The second is the example of this post, namely, to compute PageRank.

As a general rule, when you take a programming task, even one that’s very familiar, it may be challenging to figure out how to implement the algorithm using MapReduce. Not only do you need to find a way of fitting it into the MapReduce framework, you need to make sure the resulting algorithm is well adapted to take advantage of the framework. Think of how we dealt with dangling pages in the PageRank example – we could easily have modelled a dangling page as being connected to every other page, but the overhead in MapReduce would be enormous. We needed to take another approach to get the advantages of MapReduce.

With that said, it’s worth stepping back and distilling out a heuristic for attacking problems using MapReduce. The heuristic is already implicit in earlier discussion, but I’ve found it has helped my thinking to make the heuristic really explicit.

Think back to the wordcount example. There are some interesting patterns in that example, patterns that we’ll see are also repeated in other examples of MapReduce:

There is a large set of questions we want to answer: for each word w in our set of documents, how many times does w appear? The intermediate keys are simply labels for those questions, i.e., there is one intermediate key for each question we want answered. Naturally enough, we use the word itself as the label.
What the map phase does is takes a piece of input data (a file), and then identifies all the questions to which the input data might be relevant, i.e., all the words whose count might be affected by that document. For each such question it outputs the corresponding intermediate key (the word), and whatever information seems relevant to that particular question (in this case, a count).
What the reduce phase recieves as input for a particular intermediate key (i.e., question), is simply all the information relevant to that question, which it can process to produce the answer to the question.

The same pattern is followed in the computation of PageRank using MapReduce. We have a large set of questions we’d like answered: what are the values for Mp_j? We label those questions using j, and so the j are the intermediate keys. What the map phase does is takes a piece of input data (a particular page and its description), and identifies all other pages it is linked to, and therefore might contribute probability to, outputting the corresponding intermediate key (the page linked to), and the relevant information (in this case, the amount of probability that needs to be sent to the linked page). The reducer for any given page k thus receives all information relevant to computing the updated probability distribution.

This same pattern is also followed in the little MapReduce job we described for computing the inner product. There, it’s just a single question that we’re interested in: what’s the value of the inner product between [tex]p[/tex] and the vector of dangling pages? There is thus just a single intermediate key, for which we use the placeholder 1 – we could use anything. The mappers output all the information that’s relevant to that question, meaning they output nothing if a page isn’t dangling, and they output p_j if it is dangling. The reducer combines all this information to get the answer.

I should stress that this is just a heuristic for writing MapReduce programs. There are potentially other ways of using PageRank in algorithms. Furthermore, if you’re having trouble in fitting your programming problem into the MapReduce approach, you’d be advised to consider things like changing the set of questions you’re considering, or otherwise changing the way you represent the data in the problem. It may also be that there’s no good way of solving your problem using MapReduce; MapReduce is a hammer, but not every programming problem is a nail. With these caveats in mind, the heuristic I’ve described can be a useful way of thinking about how to approach putting familiar problems into a form where they can be tackled using MapReduce.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

How Are the Mighty Fallen

Joshua Gans writes to point out a paper he wrote with George Shepherd, “How Are the Mighty Fallen: Rejected Classic Articles by Leading Economists” (link to pdf), about the experience some leading economists have had with peer review:

We asked over 140 leading economists, including all living winners of the Nobel Prize and John Bates Clark Medal, to describe instances in which journals rejected their papers. We hit a nerve. More than 60 percent responded, many with several blistering pages. Paul Krugman expressed the tone of many letters: “Thanks for the opportunity to let off a bit of steam.”

The paper is extremely readable (and entertaining), if you have any interest in peer review. Among other tidbits: an extraordinary list of rejected papers, many of them among the classics of economics; the estimate from Krugman that 60% of his papers are rejected on first try; the remarkable story of George Akerlof’s Nobel Prize-Winning paper “The Market for Lemons”, rejected by three separate journals before being published; the two rejections of the famous Black-Scholes options pricing paper, also Nobel Prize-Winning; Krugman’s comment that “I am having a terrible time with my current work on economic geography: referees tell me that it’s obvious, it’s wrong, and anyway they said it years ago.” There’s much more.

Addendum: Joshua also pointed me to a retrospective on the article (pdf here), which makes for interesting followup reading.

Author: Michael Nielsen

Biweekly links for 01/26/2009

Biweekly links for 01/23/2009

When can the long tail be leveraged?

A general rule to decide whether the long tail or the short head dominates

Implications

Further reading

The role of open licensing in open science

Open licenses and open culture

Open licenses and open science

Further reading

The sincerest form of flattery…

Biweekly links for 01/19/2009

Biweekly links for 01/12/2009

Travel and talks

Using MapReduce to compute PageRank

Computing PageRank with MapReduce

Implementation

Exercises

A MapReduce programming heuristic

How Are the Mighty Fallen