Michael Nielsen – Page 18

The Logic of Collective Action

It is a curious fact that one of the seminal works on open culture and open science was published in 1965 (2nd edition 1971), several decades before the modern open culture and open science movements began in earnest. Mancur Olson’s book “The Logic of Collective Action” is a classic of economics and political science, a classic that contains much of interest for people interested in open science.

At the heart of Olson’s book is a very simple question: “How can collective goods be provided to a group?” Here, a “collective good” is something that all participants in the group desire (though possibly unevenly), and that, by its nature, is inherently shared between all the members of the group.

For example, airlines may collectively desire a cut in airport taxes, since such a cut would benefit all airlines. Supermarkets may collectively desire a rise in the market price of bread; such a rise would be, to them, a collective good, since it would be by its nature shared. Most of the world’s countries desire a stable climate, even if they are not necessarily willing to individually take the action necessary to ensure a stable climate. Music-lovers desire a free and legal online version of the Beatles’ musical repertoire. Scientists desire shared access to scientific data, e.g., from the Sloan Digital Sky Survey or the Allen Brain Atlas.

What Olson shows in the book is that although all parties in a group may strongly desire and benefit from a particular collective good (e.g., a stable climate), under many circumstances they will not take individual action to achieve that collective good. In particular, they often find it in their individual best interest to act against their collective interest. The book has a penetrating analysis of what conditions can cause individual and collective interests to be aligned, and what causes them to be out of alignement.

The notes in the present essay are much more fragmented than my standard essays. Rather than a single thesis, or a few interwoven themes, these are more in the manner of personal working notes, broken up into separate fragments, each one exploring some idea presented by Olson, and explaining how (if at all) I see it relating to open science. I hope they’ll be of interest to others who are interested in open science. I’m very open to discussion, but please do note that what I present here is a greatly abbreviated version (and my own interpretation) of what is merely part of what Olson wrote, omitting many important caveats that he discusses in detail; for the serious open scientist, I strongly recommend reading Olson’s book, as well as some of the related literature.

Why individuals may not act to obtain a collective good: Consider a situation in which many companies are all producing some type of widget, with each company’s product essentially indistinguishable from that being produced by the other companies. Obviously, the entire group of companies would benefit from a rise in the market price of the widget; such a rise would be for them a collective good. One way that price could rise would be for the supply of the widget to be restricted. Despite this fact, it is very unlikely that any single company will act on their own to restrict their supply of widgets, for their restriction of supply is likely to have a substantial negative impact on their individual profit, but a negligible impact on the market price.

This analysis is surprisingly general. As a small player in a big pond, why voluntarily act to provide a collective good, when your slice of any benefit will be quite small (e.g., due to an infinitesimal rise in prices), but the cost to you is quite large? A farmer who voluntarily restricted output to cause a rise in the price of farm products (a collective good for farmers) would be thought a loon by their farming peers, because of (not despite) their altruistic behaviour. Open scientists will recognize a familiar problem: a scientist who voluntarily shares their best ideas and data (making it a collective good for scientists) in a medium that is not yet regarded as scientifically meritorious does not do their individual prospects any good. One of the major questions of open science is how to obtain this collective good?

Small groups and big players: Olson points out that the analysis of the last two paragraphs fails to hold in the case of small groups, or in any situation where there are one or more “big players”. To see this, let’s return to the case of a restriction in supply leading to a rise in market price. Suppose a very large company decides to restrict supply of a good, perhaps causing a drop in supply of 1 percent. Suppose that the market responds with a 4 percent rise in price. Provided the company has greater than one quarter market share, the result will actually be an increase in profitability for the company. That is, in this case the company’s individual interest and the collective interest are aligned, and so the collective interest can be achieved through voluntary action on the part of the company.

This argument obviously holds only if one actor is sufficiently large that the benefit they reap from the collective good is sufficient, on its own, to justify their action. Furthermore, the fact that the large company takes this action by no means ensures that smaller companies will engage in the same action on behalf of the collective good, although the smaller companies will certainly be happy to reap the benefit of the larger company’s actions; Olson speaks, for this reason, of an “exploitation of the great by the small”. Indeed, notice that the impact of this strategy is to cause the market share of the large company to shrink slightly, moving them closer to a world in which their indiviudal benefit from collective action no longer justifies voluntary action on their part. (This shrinkage in market share also acts as a disincentive for them to act initially, despite the fact that in the short run profits will rise; this is a complication I won’t consider here.)

An closely related example may be seen in open source software. Many large companies – perhaps most famously, IBM and Sun – invest enormous quantities of money in open source software. Why do they provide this collective good for programmers and (sometimes) consumers? The answer is not as simple as the answer given in the last paragraph, because open source software is not a pure collective good. Many companies (including IBM and Sun) have developed significant revenue streams associated with open source, and they may benefit in other ways – community goodwill, and the disruption to the business models of competitors (e.g., Microsoft). Nonetheless, it seems likely that at least part of the reason they pour resources into open source is because purchasing tens of thousands of Windows licenses each year costs a company like IBM millions or tens of millions of dollars. At that scale, they can benefit substantially by instead putting that money to work making Linux better, and then using Linux for their operating system needs; the salient point is that because of IBM’s scale, it’s a large enough sum of money that they can expect to significantly improve Linux.

There is a similarity to some of the patterns seen in open data. Many open data projects are very large projects. I would go so far as to speculate that a quite disproportionate fraction of open data projects are very large projects – out of at most hundreds (more likely dozens) of projects funded at the one hundred million dollar plus level, I can think offhand of several that have open data; I’d be shocked if a similar percentage of “small science” experiments have open data policies.

Why is this the case? A partial explanation may be as follows. Imagine you are heading a big multi-institution collaboration that’s trying to get a one hundred million dollar experiment funded. You estimate that adopting an open data policy will increase your chances by three percent – i.e., it’s worth about 3 million dollars to your project. (I doubt many people really think quite this way, but in practice it probably comes to the same thing.) Now, making the data publicly available will increase the chances of outsiders “scooping” members of the collaboration. But the chance of this happening for any single member of the collaboration is rather small, especially if there is a brief embargo period before data is publicly released. By contrast, for a small experiment run in a single lab, the benefits of open data are much smaller, but the costs are comparable.

This analysis can be slotted into a more sophisticated three-part analysis. First, the person running the collaboration often isn’t concerned about being scooped themselves. This isn’t always true, but it is often true, for the leader or leaders of such projects often become more invested in the big picture than they are in making individual discoveries. They will instead tend to view any discovery from data produced by the project as a victory for the project, regardless of who actually makes the discovery. To the extent that the leadership is unconcerned about being scooped, they therefore have every incentive to go for open data. Second, if someone wants to join the collaboration, while they have researvations about an open data policy, they may also feel that it is worth giving up exclusive rights over data in exchange for a more limited type of exclusive access to a much richer data set. Third, as I argued in the previous paragraph, the trade-offs involved in open data are in any case more favourable for large collaborations than they are in small experiments.

Olson’s analysis suggests asking whether it might be easier to transition to a more open scientific culture in small, relatively close-knit research communities? If a community has only a dozen or so active research groups, might a few of those groups decide to “go open”, and then perhaps convince their peers to do so as well? With passionate, persuasive and generous leadership maybe this would be possible.

When is collective action possible? Roughly speaking, Olson identifies the following possibilities:

When it is made compulsory. This is the case in many trade unions, with Government taxes, and so on.
When social pressure is brought to bear. This is usually more effective in small groups that are already bound by a common interest. With suitable skills, it can also have an impact in larger groups, but this is usually much harder to achieve.
When it is people’s own best interests, and so occurs voluntarily. Olson argues that this mostly occurs in small groups, and that there is a tendency for “exploitation of the great by the small”. More generally, he argues that in a voluntary situation while some collective action may take place, the level is usually distinctly suboptimal.
When people are offered some other individual incentive. Olson offers many examples: one of the more amusing was the report that some trade unions spend more than ten percent of their budget on Christmas parties, simply to convince their members that membership is worthwhile.

Many of these ideas will already be familiar in the context of open science. Compulsion can be used to force people to share openly, as in the NIH public access policy. Alternately, by providing ways of measuring scientific contributions made in the open, it is possible to incentivize researchers to take a more open approach. This has contributed to the success of the preprint arXiv, with citation services such as Citebase making it straightforward to measure the impact a preprint is having.

This use of incentives means that the provision of open data (and other open knowledge) can gradually change from being a pure collective good to being a blend of a collective and a non-collective good. It becomes non-collective in the sense that the individual sharing the data derives some additional (unshared) benefit due to the act of sharing.

A similar transition occurred early in the history of science. As I have told elsewhere, early scientists such as Galileo, Hooke and Newton often went to great lengths to avoid sharing their scientific discoveries with others. They preferred to hoard their discoveries, and continue working in secret. The reason, of course, was that at the time shared results were close to a pure collective good; there was little individual incentive to share. With the introduction of the journal system, and the gradual professionalization of science, this began to change, with individuals having an incentive to share. Of course, that change only occurred very gradually, over a period of many decades. Nowadays, we take the link between publication and career success for granted, but that was something early journal editors (and others) had to fight for.

Similarly, online media are today going through a grey period. For example, a few years back, blogging was in many ways quite a disreputable activity for a scientist, fine for a hobby, but certainly not seen as a way of making a serious scientific contribution. It’s still a long way from being mainstream, but I think there are many signs that it’s becoming more accepted. As this process continues, online open science will shift from being a pure collective good to being a blend of a collective and non-collective good. As Olson suggests, this is a good way to thrive!

So, what use are networked tools for science? I’m occasionally asked: “If networked tools are so good for science, why haven’t we seen more aggressive adoption of those tools by scientists? Surely that shows that we’ve already hit the limits of what can be done, with email, Skype, and electronic journals?” Underlying this question is a presumption, the presumption that if the internet really has the potential to be as powerful a tool for science as I and others claim, then surely we scientists would have gotten together already to achieve it. More generally, it’s easy to presume that if a group of people (e.g., scientists) have a common goal (advancing science), then they will act together to achieve that goal. What’s important about Olson’s work is that it comprehensively shows the flaws in this argument. A group of people may all benefit greatly from some collective action, yet be unable to act together to achieve it. Olson shows that far from being unusual, this is in many ways to be expected.

Doing science online

This post is the text for an invited after-dinner talk about doing science online, given at the banquet for the Quantum Information Processing 2009 conference, held in Santa Fe, New Mexico, January 12-16, 2009.

Good evening.

Let me start with a few questions. How many people here tonight know what a blog is?

How many people read blogs, say once every week or so, or more often?

How many people actually run a blog themselves, or have contributed to one?

How many people read blogs, but won’t admit it in polite company?

Let me show you an example of a blog. It’s a blog called What’s New, run by UCLA mathematician Terence Tao. Tao, as many of you are probably aware, is a Fields-Medal winning mathematician. He’s known for solving many important mathematical problems, but is perhaps best known as the co-discover of the Green-Tao theorem, which proved the existence of arbitrarily long arithmetic progressions of primes.

Tao is also a prolific blogger, writing, for example, 118 blog posts in 2008. Popular stereotypes to the contrary, he’s not just sharing cat pictures with his mathematician buddies. Instead, his blog is a firehose of mathematical information and insight. To understand how valuable Tao’s blog is, let’s look at a example post, about the Navier-Stokes equations. As many of you know, these are the standard equations used by physicists to describe the behaviour of fluids, i.e., inside these equations is a way of understanding an entire state of matter.

The Navier-Stokes equations are notoriously difficult to understand. People such as Feynman, Landau, and Kolmogorov struggled for years attempting to understand their implications, mostly without much success. One of the Clay Millenium Prize problems is to prove the existence of a global smooth solution to the Navier-Stokes equations, for reasonable initial data.

Now, this isn’t a talk about the Navier-Stokes equations, and there’s far too much in Terry Tao’s blog post for me to do it justice! But I do want to describe some of what the post contains, just to give you the flavour of what’s possible in the blog medium.

Tao begins his post with a brief statement explaining what the Clay Millenium Problem asks. He shares the interesting tidibt that in two spatial dimenions the solution to the problem is known(!), and asks why it’s so much harder in three dimensions. He tells us that the standard answer is turbulence, and explains what that means, but then says that he has a different way of thinking about the problem, in terms of what he calls supercriticality. I can’t do his explanation justice here, but very roughly, he’s looking for invariants which can be used to control the behaviour of solutions to the equations at different length scales. He points out that all the known invariants give weaker and weaker control at short length scales. What this means is that the invariants give us a lot of control over solutions at long length scales, where things look quite regular, but little control at short length scales, where you see the chaotic variation characteristic of turbulence. He then surveys all the known approaches to proving global existence results for nonlinear partial differential equations — he says there are just three broad approaches – and points out that supercriticality is a pretty severe obstruction if you want to use one of these approaches.

The post has loads more in it, so let me speed this up. He describes the known invariants for the equations, and what they can be used to prove. He surveys and critiques existing attempts on the problem. He makes six suggestions for ways of attacking the problem, including one which may be interesting to some of the people in this audience: he suggests that pseudorandomness, as studied by computer scientists, may be connected to the chaotic, almost random behaviour that is seen in the solutions the Navier-Stokes equations.

The post is filled to the brim with clever perspective, insightful observations, ideas, and so on. It’s like having a chat with a top-notch mathematician, who has thought deeply about the Navier-Stokes problem, and who is willingly sharing their best thinking with you.

Following the post, there are 89 comments. Many of the comments are from well-known professional mathematicians, people like Greg Kuperberg, Nets Katz, and Gil Kalai. They bat the ideas in Tao’s post backwards and forwards, throwing in new insights and ideas of their own. It spawned posts on other mathematical blogs, where the conversation continued.

That’s just one post. Terry Tao has hundreds of other posts, on topics like Perelman’s proof of the Poincare conjecture, quantum chaos, and gauge theory. Many posts contain remarkable insights, often related to open research problems, and they frequently stimulate wide-ranging and informative conversations in the comments.

That’s just one blogger. There are, of course, many other top-notch mathematician bloggers. Cambridge’s Tim Gowers, another Fields Medallist, also runs a blog. Like Tao’s blog, it’s filled with interesting mathematical insights and conversation, on topics like how to use Zorn’s lemma, dimension arguments in combinatorics, and a thought-provoking post on what makes some mathematics particularly deep.

Alain Connes, another Fields Medallist, is also a blogger. He only posts occasionally, but when he does his posts are filled with interesting mathematical tidbits. For example, I greatly enjoyed this post, where he talks about his dream of solving one of the deepest problems in mathematics – the problem of proving the Riemann Hypothesis – using non-commutative geometry, a field Connes played a major role in inventing.

Berkeley’s Richard Borcherds, another Fields Medallist, is also a blogger, although he is perhaps better described as an ex-blogger, as he hasn’t updated in about a year.

I’ve picked on Fields Medallists, in part because at least four of the 42 living Fields Medallists have blogs. But there are also many other excellent mathematical blogs, including blogs from people closely connected to the quantum information community, like Scott Aaronson, Dave Bacon, Gil Kalai, and many others.

Let me make a few observations about blogging as a medium.

It’s informal.

It’s rapid-fire.

Many of the best blog posts contain material that could not easily be published in a conventional way: small, striking insights, or perhaps general thoughts on approach to a problem. These are the kinds of ideas that may be too small or incomplete to be published, but which often contain the seed of later progress.

You can think of blogs as a way of scaling up scientific conversation, so that conversations can become widely distributed in both time and space. Instead of just a few people listening as Terry Tao muses aloud in the hall or the seminar room about the Navier-Stokes equations, why not have a few thousand talented people listen in? Why not enable the most insightful to contribute their insights back?

You can also think of blogs as a way of making scientific conversation searchable. If you type “Navier-Stokes problem” into Google, the third hit is Terry Tao’s blog post about it. That means future mathematicians can easily benefit from his insight, and that of his commenters.

You might object that the most important papers about the Navier-Stokes problem should show up first in the search. There is some truth to this, but it’s not quite right. Rather, insofar as Google is doing its job well, the ranking should reflect the importance and significance of the respective hits, regardless of whether those hits are papers, blog posts, or some other form. If you look at this way, it’s not so surprising that Terry Tao’s blog post is near the top. As all of us know, when you’re working on a problem, a good conversation with an insightful colleague may be worth as much (and sometimes more) than reading the classic papers. Furthermore, as search engines become better personalized, the search results will better reflect your personal needs; in a search utopia, if Terry Tao’s blog post is what you most need to see, it’ll come up first, while if someone else’s paper on the Navier-Stokes problem is what you most need to see, then that will come up first.

I’ve started this talk by discussing blogs because they are familiar to most people. But ideas about doing science in the open, online, have been developed far more systematically by people who are explicitly doing open notebook science. People such as Garrett Lisi are using mathematical wikis to develop their thinking online; Garrett has referred to the site as “my brain online”. People such as chemists Jean-Claude Bradley and Cameron Neylon are doing experiments in the open, immediately posting their results for all to see. They’re developing ideas like lab equipment that posts data in real time, posting data in formats that are machine-readable, enabling data mining, automated inference, and other additional services.

Stepping back, what tools like blogs, open notebooks and their descendants enable is filtered access to new sources of information, and to new conversation. The net result is a restructuring of expert attention. This is important because expert attention is the ultimate scarce resource in scientific research, and the more efficiently it can be allocated, the faster science can progress.

How many times have you been obstructed in your research by the need to prove or disprove a small result that is a little outside your core expertise, and so would take you days or weeks, but which you know, of a certainty, the right person could resolve in minutes, if only you knew who that person was, and could easily get their attention. This may sound like a fantasy, but if you’ve worked on the right open source software projects, you’ll know that this is exactly what happens in those projects – discussion forums for open source projects often have a constant flow of messages posing what seem like tough problems; quite commonly, someone with a great comparative advantage quickly posts a clever way to solve the problem.

If new online tools offer us the opportunity to restructure expert attention, then how exactly might it be restructured? One of the things we’ve learnt from economics is that markets can be remarkably effective ways of efficiently allocating scarce resources. I’ll talk now about an interesting market in expert attention that has been set up by a company named InnoCentive.

To explain InnoCentive, let me start with an example involving an Indian not-for-profit called the ASSET India Foundation. ASSET helps at-risk girls escape the Indian sex industry, by training them in technology. To do this, they’ve set up training centres in several large cities across India. They’ve received many requests to set up training centres in smaller towns, but many of those towns don’t have the electricity needed to power technologies like the wireless routers that ASSET uses in its training centers.

On the other side of the world, in the town of Waltham, just outside Boston, is the company InnoCentive. InnoCentive is, as I said, an online market in expert attention. It enables companies like Eli Lilly and Proctor and Gamble to pose “Challenges” over the internet, scientific research problems theyâ€™d like solved, with a prize for solution, often many thousands of dollars. Anyone in the world can download a detailed description of the Challenge, and attempt to win the prize. More than 160,000 people from 175 countries have signed up for the site, and prizes for more than 200 Challenges have been awarded.

What does InnoCentive have to do with ASSET India? Well, ASSET got in touch with the Rockefeller Foundation, and explained their desire for a low-cost solar-powered wireless router. Rockefeller put up 20,000 in prize money to post an InnoCentive Challenge to design a suitable wireless router. The Challenge was posted for two months at InnoCentive. 400 people downloaded the Challenge, and 27 people submitted solutions. The prize was awarded to a 31-year old Texan software engineer named Zacary Brown, who delivered exactly the kind of design that ASSET was looking for; a prototype is now being built by engineering students at the University of Arizona.

Let’s come back to the big picture. These new forms of contribution – blogs, wikis, online markets and so forth – might sound wonderful, but you might reasonably ask whether they are a distraction from the real business of doing science? Should you blog, as a young postdoc trying to build up a career, rather than writing papers? Should you contribute to Wikipedia, as a young Assistant Professor, when you could be writing grants instead? Crucially, why would you share ideas in the manner of open notebook science, when other people might build on your ideas, maybe publishing papers on the subjects you’re investigating, but without properly giving you credit?

In the short term, these are all important questions. But I think a lot of insight into these questions can be obtained by thinking first of the long run.

At the beginnning of the 17th century, Galileo Galilei constructed the first astronomical telescope, looked up at the sky, and turned his new instrument to Saturn. He saw, for the first time in human history, Saturn’s astonishing rings. Did he share this remarkable discovery with the rest of the world? He did not, for at the time that kind of sharing of scientific discovery was unimaginable. Instead, he announced his discovery by sending a letter to Kepler and several other early scientists, containing a latin anagram, “smaismrmilmepoetaleumibunenugttauiras”. When unscrambled this may be translated, roughly, as “I have discovered Saturn three-formed”. The reason Galileo announced his discovery in this way was so that he could establish priority, should anyone after him see the rings, while avoiding revealing the discovery.

Galileo could not imagine a world in which it made sense for him to freely share a discovery like the rings of Saturn, rather than hoarding it for himself. Certainly, he couldn’t share the discovery in a journal article, for the journal system was not invented until more than 20 years after Galileo died. Even then, journals took decades to establish themselves as a legitimate means of sharing scientific discoveries, and many early scientists looked upon journals with some suspicion. The parallel to the suspicion many scientists have of online media today is striking.

Think of all the knowledge we have, which we do not share. Theorists hoard clever observations and questions, little insights which might one day mature into a full-fledged paper. Entirely understandably, we hoard those insights against that day, doling them out only to trusted friends and close colleagues. Experimentalists hoard data; computational scientists hoard code. Most scientists, like Galileo, can’t conceive of a world in which it makes sense to share all that information, in which sharing information on blogs, wikis, and their descendents is viewed as being (potentially, at least) an important contribution to science.

Over the short term, things will only change slowly. We are collectively very invested in the current system. But over the long run, a massive change is, in my opinion, inevitable. The advantages of change are simply too great.

There’s a story, almost certainly apocryhphal, that the physicist Michael Faraday was approached after a lecture by Queen Victoria, and asked to justify his research on electricity. Faraday supposedly replied “Of what use is a newborn baby?”

Blogs, wikis, open notebooks, InnoCentive and the like aren’t the end of online innovation. They’re just the beginning. The coming years and decades will see far more powerful tools developed. We really will enormously scale up scientific conversation; we will scale up scientific collaboration; we will, in fact, change the entire architecture of expert attention, developing entirely new ways of navigating data, making connections and inferences from data, and making connections between people.

When we look back at the second half of the 17th century, it’s obvious that one of the great changes of the time was the invention of modern science. When historians look back at the early part of the twentyfirst century, they will also see several major changes. I know many of you in this room believe that one of those changes will be related to the sustainability of how humans live on this planet. But I think there are at least two other major historical changes. The first is the fact that this is the time in history when the world’s information is being transformed from an inert, passive, widely separated state, and put into a single, unified, active system that can make connections, that brings that information alive. The world’s information is waking up.

The second of those changes, closely related to the first, is that we are going to change the way scientists work; we are going to change the way scientists share information; we are going to change the way expert attention itself is allocated, developing new methods for connecting people, for organizing people, for leveraging people’s skills. They will be redirected, organized, and amplified. The result will speed up the rate at which discoveries are made, not in one small corner of science, but across all of science.

Quantum information and computation is a wonderful field. I was touched and surprised by the invitation to speak tonight. I have, I think, never felt more honoured in my professional life. But, I trust you can understand when I say that I am also tremendously excited by the opportunities that lie ahead in doing science online.

Biweekly links for 01/26/2009

Building an Inverted Index with Hadoop and Pig Â« SquareCogâ€™s SquareBlog
- “In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin.
  Pig offers SQL-like data processing instructions (select, project, filter, group), while being both more flexible by allowing simple integration of user-defined functions, and more straightforward by allowing users to issue command proceduraly, rather than declaratively, as in SQL. “
Yahoo! Hadoop Tutorial
Comparison of biological wikis
- Andrew Su’s survey of biological wikis (if you click again it links through to a spreadsheet). Lots of very interesting data about number of edits, number of editors, etc.
Datawocky: The Real Long Tail: Why both Chris Anderson and Anita Elberse are Wrong

Click here for all of my del.icio.us bookmarks.

Biweekly links for 01/23/2009

Visual Wikipedia
- This works surprisingly well, showing visually what different Wikipedia articles are linked to. The example I’ve chosen is the open notebook science article; many others work very well also.
the physics arXiv blog Â» How Googleâ€™s PageRank predicts Nobel Prize winners
- The title is over the top, but the results from the paper are very interesting.
A New Kind of Big Science – Olivia Judson Blog – NYTimes.com
- Thoughtful piece on big science, citizen science, and the relationship between them, from Aaron Hirsh.
The Inner Ring, by C.S. Lewis
- “When you invite a middle-aged moralist to address you, I suppose I must conclude, however unlikely the conclusion seems, that you have a taste for middle-aged moralizing. I shall do my best to gratify it.” The essay is entertaining throughout; confused in a couple of places, and enlightening in others. Well worth the read.
Evaluating MapReduce for Multi-core and Multiprocessor Systems
Machine Learning (Theory) Â» Adversarial Academia
- Nice discussion of the idea that academia is a zero-sum game.
Controversial Tell-All Book Reveals Wrestling Fans Are Fake | The Onion
- Who knew?
European Commission Â» Report on the Copyright Law for Protection of Databases
- In the late 90s, the EU introduced a copyright law intended to protect some kinds of databases. This report is an evaluation of the impact of that law on innovation in the EU.
MediaWiki database schema
- Lovely visualization.
ISIS Biolab
- FriendFeed room aggregating some (all?) of Cameron Neylon’s open notebook activities
The Semantic Web in Action: Scientific American
Virtual conferences in Second Life Â« Buried Treasure

Click here for all of my del.icio.us bookmarks.

When can the long tail be leveraged?

In 2006, Chris Anderson, the editor-in-chief of Wired magazine, wrote a bestselling book about an idea he called the long tail. The long tail is nicely illustrated by the bookselling business. Until recently, the conventional wisdom in bookselling was to stock only bestsellers. But internet bookstores such as Amazon.com take a different approach, stocking everything in print. According to Anderson, about a quarter of Amazon’s sales come from the long tail of books outside the top 100,000 bestselling titles (see here for the original research). While books in the long tail don’t individually sell many copies, they greatly outnumber the bestsellers, and so what they lack in individual sales they make up in total sales volume.

The long tail attracted attention because it suggested a new business model, selling into the long tail. Companies like Amazon, Netflix, and Lulu have built businesses doing just that. It also attracted attention because it suggested that online collaborations like Wikipedia and Linux might be benefitting greatly from the long tail of people who contribute just a little.

The problem if you’re building a business or online collaboration is that it can be difficult to tell whether participation is dominated by the long tail or not. Take a look at these two graphs:

The first graph is an idealized graph of Amazon’s book sales versus the sales rank, [tex]r[/tex], of the book. The second graph is an idealized graph of the number of edits made by the [tex]r[/tex]th most prolific contributor to Wikipedia. Superficially, the two graphs look similar, and it’s tempting to conclude that both graphs have a long tail. In fact, the two have radically different behaviour. In this post I’ll describe a general-purpose test that shows that Amazon.com makes it (just!) into the long tail regime, but in Wikipedia contributions from the short head dominate. Furthermore, this difference isn’t just an accident, but is a result of design decisions governing how people find information and make contributions.

Let’s get into more detail about the specifics of the Amazon and Wikipedia cases, before turning to the big picture. The first graph above shows the function

[tex]a / r^{0.871},[/tex]

where [tex]a[/tex] is a constant of proportionality, and [tex]r[/tex] is the rank of the book. The exponent is chosen to be [tex]0.871[/tex] because as of 2003 that makes the function a pretty close approximation to the number of books sold by Amazon. For our analysis, it doesn’t much matter what the value of [tex]a[/tex] is, so we won’t worry about pinning it down. All the important stuff is contained in the [tex]r^{0.871}[/tex] in the denominator.

The second graph shows the function

[tex]a / r^{1.7}.[/tex]

As with the Amazon sales formula, the Wikipedia edit formula isn’t exact, but rather is an approximation. I extracted the formula from a blog post written by a researcher studying Wikipedia at the Xerox PARC Augmented Cognition Center. I mention this because they don’t actually determine the exponent 1.7 themselves – I backed it out from one of their graphs. Note that, as for the Amazon formula, [tex]a[/tex] is a constant of proportionality whose exact value doesn’t matter. There’s no reason the values of the Wikipedia [tex]a[/tex] and the Amazon [tex]a[/tex] should be the same; I’m using the same letter in both formulas simply to avoid a profusion of different letters.

(A little parenthetical warning: figuring out power law exponents is a surprisingly subtle problem. It’s possible that my estimate of the exponent in the last paragraph may be off. See, e.g., this paper for a discussion of some of the subtleties, and references to the literature. If someone with access to the raw data wants to do a proper analysis, I’d be interested to know the results. In any case, we’ll see that the correct value for the exponent would need to be wildly different from my estimate before it could make any difference to the qualitative conclusions we’ll reach.)

Now suppose the total number of different books Amazon stocks in their bookstore is [tex]N[/tex]. We’ll show a bit later that the total number of books sold is given approximately by:

[tex]7.75 \times a \times N^{0.129}.[/tex]

The important point in this formula is that as [tex]N[/tex] increases the total number of books sold grows fairly rapidly. Double [tex]N[/tex] and you get a nearly ten percent increase in total sales. There’s a big benefit to being in the business of the long tail of books.

Let’s move to the second graph, the number of Wikipedia edits. If the total number of editors is [tex]N[/tex], then we’ll show below that the total number of edits made is approximately

[tex]2.05 \times a – O\left( \frac{a}{N^{1.7}} \right).[/tex]

The important point here is that, in contrast to the Amazon example, as [tex]N[/tex] increases it makes little difference to the total number of edits made. In Wikipedia, the total number of edits is dominated by the short head of editors who contribute a great deal.

A general rule to decide whether the long tail or the short head dominates

Let’s generalize the above discussion. We’ll find a simple general rule that can be used to determine whether the long tail or the short head dominates. Suppose the pattern of participation is governed by a power law distribution, with the general form

[tex]\frac{a}{r^b},[/tex]

where [tex]a[/tex] and [tex]b[/tex] are both constants. Both the Amazon and Wikipedia data can be described in this way, and it turns out that many other phenomena are described similarly – if you want to dig into this, I recommend the review papers on power laws by Mark Newman and Michael Mitzenmacher.

Let’s also suppose the total number of “participants” is [tex]N[/tex], where I use the term participants loosely – it might mean the total number of books on sale, the total number of contributors to Wikipedia, or whatever is appropriate to the situation. Our interest will be in summing the contributions of all participants.

When [tex]b < 1[/tex], the sum over all values of [tex]r[/tex] is approximately [tex]\frac{a N^{1-b}}{1-b}.[/tex] Thus, this case is tail-dominated, with the sum continuing to grow reasonably rapidly as [tex]N[/tex] grows. As we saw earlier, this is the case for Amazon's book sales, so Amazon really is a case where the long tail is in operation. When [tex]b = 1[/tex], the total over all values of [tex]r[/tex] is approximately [tex]a \log N.[/tex] This also grows as [tex]N[/tex] grows, but extremely slowly. It's really an edge case between tail-dominated and head-dominated. Finally, when [tex]b > 1[/tex], the total over all values of [tex]r[/tex] is approximately

[tex]a\zeta(b)-O\left(\frac{a}{N^b}\right),[/tex]

where [tex]\zeta(b)[/tex] is just a constant (actually, the Riemann zeta function, evaluated at [tex]b[/tex]), and the size of the corrections is of order [tex]a/N^b[/tex]. It follows that for large [tex]N[/tex] this approaches a constant value, and increasing the value of [tex]N[/tex] has little effect, i.e., this case is head-dominated. So, for example, it means that the great majority of edits to Wikipedia really are made by a small handful of dedicated contributors.

There is a caveat to all this discussion, which is that in the real world power laws are usually just an approximation. For many real world cases, the power law breaks down at the end of the tail, and at the very head of the distribution. The practical implication is that the quantitative values predicted by the above formula may be somewhat off. In practice, though, I don’t think this caveat much matters. Provided the great bulk of the distribution is governed by a power law, this analysis gives insight into whether it’s dominated by the head or by the tail.

Implications

If you’re developing a long tail business or collaboration, you need to make sure the exponent [tex]b[/tex] in the power law is less than one. The smaller the exponent, the better off you’ll be.

How can you make the exponent as small as possible? In particular, how can you make sure it’s smaller than the magic value of one? To understand the answer to this question, we need to understand what actually determines the value of the exponent. There’s some nice simple mathematical models explaining how power laws emerge, and in particular how the power law exponent emerges. At some point in the future I’d like to come back and discuss those in detail, and what implications they have for site architecture. This post is already long enough, though, so let me make just make three simple comments.

First, focus on developing recommendation and search systems which spread attention out, rather than concentrating it in the short head of what’s already popular. This is difficult to do without sacrificing quality, but there’s some interesting academic work now being done on such recommendation systems – see, for example, some of the work described in this recent blog post by Daniel Lemire.

Second, in collaborative projects, ensure a low barrier to entry for newcomers. One problem Wikipedia faces is a small minority of established Wikipedians who are hostile to new editors. It’s not common, but it is there. This drives newcomers away, and so concentrates edits within the group of established editors, effectively increasing the exponent in the power law.

Third, the essence of these and other similar recommendations is that they are systematic efforts to spread attention and contribution out, not one-off efforts toward developing a long tail of sales or contributions. The problem with one-off efforts is that they do nothing to change the systematic architectural factors which actually determine the exponent in the power law, and it is that exponent which is the critical factor.

The role of open licensing in open science

The open science movement encourages scientists to make scientific information freely available online, so other scientists may reuse and build upon that information. Open science has many striking similarities to the open culture movement, developed by people like Lawrence Lessig and Richard Stallman. Both movements share the idea that powerful creative forces are unleashed when creative artifacts are freely shared in a creative commons, enabling other people to build upon and extend those artifacts. The artifact in question might be a set of text documents, like Wikipedia; it might be open source software, like Linux; or open scientific data, like the data from the Sloan Digital Sky Survey, used by services such as Galaxy Zoo. In each case, open information sharing enables creative acts not conceived by the originators of the information content.

The advocates of open culture have developed a set of open content licenses, essentially a legal framework, based on copyright law, which strongly encourages and in some cases forces the open sharing of information. This open licensing strategy has been very successful in strengthening the creative commons, and so moving open culture forward.

When talking to some open science advocates, I hear a great deal of interest and enthusiasm for open licenses for science. This enthusiasm seems prompted in part by the success of open licenses in promoting open culture. I think this is great – with a few minor caveats, I’m a proponent of open licenses for science – but the focus on open licenses sometimes bothers me. It seems to me that while open licenses are important for open science, they are by no means as critical as they are to open culture; open access is just the beginning of open science, not the end. This post discusses to what extent open licenses can be expected to play a role in open scientific culture.

Open licenses and open culture

Let me review the ideas behind the licensing used in the open culture movement. If you’re familiar with the open culture movement, you’ll have heard this all before; if you haven’t, hopefully it’s a useful introduction. In any case, it’s worth getting all this fixed in our heads before addressing the connection to open science.

The obvious thing for advocates of open culture to do is to get to work building a healthy public domain: writing software, producing movies, writing books and so on, releasing all that material into the public domain, and encouraging others to build upon those works. They could then use a moral suasion argument to encourage others to contribute back to the public domain.

The problem is that many people and organizations don’t find this kind of moral suasion very compelling. Companies take products from the public domain, build upon them, and then, for perfectly understandable reasons, fiercely protect the intellectual property they produce. Disney was happy to make use of the old tale of Cinderella, but they take a distinctly dim view of people taking their Cinderella movie and remixing it.

People like Richard Stallman and Lawrence Lessig figured out how to add legal teeth to the moral suasion argument. Instead of relying on goodwill to get people to contribute back to the creative commons, they invented a new type of licensing that compels people to contribute back. There’s now a whole bunch of such open licenses – the various varieties of the GNU Public License (GPL), Creative Commons licenses, and many others – with various technical differences between them. But there’s a basic idea of viral licensing that’s common to many (though not all) of the open licenses. This is the idea that anyone who extends a product released under such a license must release the extension under the same terms. Using such an open license is thus a lot like putting material into the public domain, in that both result in content being available in the creative commons, but the viral open licenses differ from the public domain in compelling people to contribute back into the creative commons.

The consequences of this compulsion are interesting. In the early days of open licensing, the creative commons grew slowly. As the amount of content with an open license grew, though, things began to change. This has been most obvious in software development, which was where viral open licenses first took hold. Over time it became more tempting for software developers to start development with an existing open source product. Why develop a new product from scratch, when you can start with an existing codebase? This means that you can’t use the most obvious business model – limit distribution to executable files, and charge for them – but many profitable open source companies have shown that alternate business models are possible. The result is that as time has gone on, even the most resolutely closed source companies (e.g., Microsoft) have found it difficult to avoid infection by open source. The result has been a gradually accelerating expansion of the creative commons, an expansion that has enabled extraordinary creativity.

Open licenses and open science

I’m not sure what role licensing will play in open science, but I do think there are some clear signs that it’s not going to be as central a role as it’s played in open culture.

The first reason for thinking this is that a massive experiment in open licensing has already been tried within science. By law, works produced by US Federal Government employees are, with some caveats, automatically put into the public domain. Every time I’ve signed a “Copyright Transfer” agreement with an academic journal, there’s always been in the fine print a clause exclusing US Government employees from having to transfer copyright. You can’t give away what you don’t own.

This policy has greatly enriched the creative commons. And it’s led to enormous innovation – for example, I’ve seen quite a few mapping services that build upon US Government data, presumably simply because that data is in the public domain. But in the scientific realm I don’t get the impression that this is doing all that much to promote the growth of the same culture of mass collaboration as open licenses are enabling.

(A similar discussion can be had about open access journals. The discucssion there is more complex, though, because (a) many of the journals have only been open access for a few years, and (b) the way work is licensed varies a lot from journal to journal. That’s why I’ve focused on the US Government.)

The second reason for questioning the centrality of open licenses is the observation that the main barriers to remixing and extension of scientific content aren’t legal barriers. They are, instead, cultural barriers. If someone copies my work, as a scientist, I don’t sue them. If I were to do that, it’s in any case doubtful that the courts would do more than slap the violator on the wrist – it’s not as though they’ll directly make money. Instead, there’s a strong cultural prohibition against such copying, expressed through widely-held community norms about plagiarism and acceptable forms of attribution. If someone copies my work, the right way to deal with it is to inform their colleagues, their superiors, and so on – in short, to deal with it by cultural rather than legal means.

That’s not to say there isn’t a legal issue here. But it’s a legal issue for publishers, not individual scientists. Many journal publishers have business models which are vulnerable to systematic large-scale attempts to duplicate their content. Someone could, for example, set up a “Pirate Bay” for scientific journal articles, making the world’s scientific articles freely available. That’s something those journals have to worry about, for legitimate short-term business reasons, and copyright law provides them with some form of protection and redress.

My own opinion is that over the long run, it’s likely that the publishers will move to open access business models, and that will be a good thing for open science. I might be wrong about that; I can imagine a world in which that doesn’t happen, yet certain varieties of open science still flourish. Regardless of what you think about the future of journals, the larger point is that the legal issues around openness are only a small part of a much larger set of issues, issues which are mostly cultural. The key to moving to a more open scientific system is changing scientist’s hearts and minds about the value and desirability of more openly sharing information, not reforming the legal rights under which they publish content.

So, what’s the right approach to licensing? John Wilbanks has argued, persuasively in my opinion, that data should be held in the public domain. I’ve sometimes wondered if this argument shouldn’t be extended beyond data, to all forms of scientific content, including papers, provided (and this is a big “provided”) the publisher’s business interests can be met in way that adequately serves all parties. After all, if the scientific community is primarily a reputation economy, built around cultural norms, then why not simply remove the complication of copyright from the fray?

Now, I should say that this is speculation on my part, and my thinking is incomplete on this set of issues. I’m most interested to hear what others have to say! I’m especially interested in efforts to craft open research licenses, like the license Victoria Stodden has been developing. But I must admit that it’s not yet clear to me why, exactly, we need such licenses, or what interests they serve.

The sincerest form of flattery…

… probably isn’t copying without attribution. Someone named Will Choi has apparently copied without attribution my essay about Shirky’s Law. (All links given a “nofollow” tag.) I’ve had this kind of copying done by spammers before, many times, but it’s a little more annoying to see on someone’s personal blog.

Biweekly links for 01/19/2009

Towards a Wiki For Formally Verified Mathematics
- “the wiki will state all of known mathematics in a machine-readable language and verify all theorems for correctness, thus providing a knowledge base for interactive proof assistants.”
The Art Of CommunityÂ |Â jonobacon@home
- “The Art of Community” is a new book being written by Jono Bacon, the Ubuntu Community Manager. It’s not yet done, but will be published by O’Reilly, and will also be released under a CC license.
AcaWiki
- Site intended to be a “Wikipedia for academic research”, summarizing academic papers.

Click here for all of my del.icio.us bookmarks.

Biweekly links for 01/12/2009

Tim O’Reilly: Work on Stuff that Matters: First Principles
Clay Shirky at PopTech 2008
- Designing for generosity; doing things for pure enjoyment; and the shortcomings of analysis based on the idea that the main reason people do things is for personal profit.
Alyssa Goodman: 3D PDF in Scientific Publishing
- Describes a “3D PDF” used in a paper published last week in Nature.
Cory Doctorow: Writing in the Age of Distraction

Click here for all of my del.icio.us bookmarks.

Travel and talks

I’ll be in Santa Fe, New Mexico, on Wednesday and Thursday of this coming week. While there, I’m giving a talk about scientific collaboration on the internet at the Santa Fe Institute on Wednesday afternoon, and the after-dinner banquet speech (related topic, very different talk) at the Quantum Information Processing (QIP) 2009 conference. QIP was one of my favourite annual events when I worked in quantum computing, and I’m really looking forward to the chance to catch up with old friends.

On Friday through Sunday I’ll be at the third annual science blogging conference, otherwise known as Science Online 2009, in Rayleigh-Durham, North Carolina. People who went to the first two all say they had a great time, so I’m really looking forward to this!

Finally, while advertising talks, I may as well add that I’ll be giving the University of Toronto Physics Colloquium Thursday, January 22, at 4pm, in the McLennan Building, room 102, at the main University of Toronto Campus.

The Logic of Collective Action

Further reading

Doing science online

Further reading

Biweekly links for 01/26/2009

Biweekly links for 01/23/2009

When can the long tail be leveraged?

A general rule to decide whether the long tail or the short head dominates

Implications

Further reading

The role of open licensing in open science

Open licenses and open culture

Open licenses and open science

Further reading

The sincerest form of flattery…

Biweekly links for 01/19/2009

Biweekly links for 01/12/2009

Travel and talks