January 2009 – Michael Nielsen

Is massively collaborative mathematics possible?

This is the title of a thought-provoking essay by Tim Gowers, which seems to have been stimulated in part by my recent essay on doing science online. What follows are some excerpts from Gowers’ essay, and some thoughts by me:

Of course, one might say, there are certain kinds of problems that lend themselves to huge collaborations. One has only to think of the proof of the classification of finite simple groups, or of a rather different kind of example such as a search for a new largest prime carried out during the downtime of thousands of PCs around the world. But my question is a different one. What about the solving of a problem that does not naturally split up into a vast number of subtasks? Are such problems best tackled by n people for some n that belongs to the set \{1,2,3\}? (Examples of famous papers with four authors do not count as an interesting answer to this question.)

It seems to me that, at least in theory, a different model could work: different, that is, from the usual model of people working in isolation or collaborating with one or two others. Suppose one had a forum (in the non-technical sense, but quite possibly in the technical sense as well) for the online discussion of a particular problem. The idea would be that anybody who had anything whatsoever to say about the problem could chip in. And the ethos of the forum â€” in whatever form it took â€” would be that comments would mostly be kept short. In other words, what you would not tend to do, at least if you wanted to keep within the spirit of things, is spend a month thinking hard about the problem and then come back and write ten pages about it. Rather, you would contribute ideas even if they were undeveloped and/or likely to be wrong.

A similar approach is used in the open-source software community – essentially, a dynamic division of labour that is not planned entirely in advance, but rather arises in response to the exigencies of the problem at hand. This dynamic division of labour is typically co-ordinated through one or more online forums. Examples close to this spirit, and also somewhat close in spirit to modern mathematics include Kasparov versus the World and the Matlab programming competition.

On the subject of the desirable size of contributions, in open source the most frequent contributions change just a single line of code. The second most frequent contributions change two lines of code, and so on. One study suggests the number of contributions [tex]n[/tex] scales as [tex]n(l) \propto l^{-1.13}[/tex], where [tex]l[/tex] is the number of lines of code changed or added (“committed”) in a single contribution.

It’s notable that with this distribution the total line count is still dominated by the larger contributions. Despite this, my guess is that the smaller contributions are still very significant for maintaining momentum and morale, which are so important in creative projects. In this regard, it’s a little like a good creative conversation – not all contributions to the conversation need to be world-shaking, some are simply needed to keep the conversation moving.

This suggestion raises several questions immediately. First of all, what would be the advantage of proceeding in this way? My answer is that I donâ€™t know for sure that there would be an advantage. However, I can see the following potential advantages.

(i) Sometimes luck is needed to have the idea that solves a problem. If lots of people think about a problem, then just on probabilistic grounds there is more chance that one of them will have that bit of luck.

(ii) Furthermore, we donâ€™t have to confine ourselves to a purely probabilistic argument: different people know different things, so the knowledge that a large group can bring to bear on a problem is significantly greater than the knowledge that one or two individuals will have. This is not just knowledge of different areas of mathematics, but also the rather harder to describe knowledge of particular little tricks that work well for certain types of subproblem, or the kind of expertise that might enable someone to say, “That idea that you thought was a bit speculative is rather similar to a technique used to solve such-and-such a problem, so it might well have a chance of working,” or “The lemma you suggested trying to prove is known to be false,” and so onâ€”the type of thing that one can take weeks or months to discover if one is working on oneâ€™s own.

I think of this as the “annoying little conjecture” problem: many conjectures that arise in the course of research are often essentially routine to prove or disprove, but it can take days or weeks to determine which it’s going to be. If you talk to just the right person, they can often cut that down to minutes or hours. Ordinarily, though, finding that right person is often just as laborious (and may be less enlightening) than solvig the problem yourself. Having a mechanism to find the right person, even if it’s essentially just broadcast search, would be enormously beneficial.

The next obvious question is this. Why would anyone agree to share their ideas? Surely we work on problems in order to be able to publish solutions and get credit for them. And what if the big collaboration resulted in a very good idea? Isnâ€™t there a danger that somebody would manage to use the idea to solve the problem and rush to (individual) publication?

Here is where the beauty of blogs, wikis, forums etc. comes in: they are completely public, as is their entire history. To see what effect this might have, imagine that a problem was being solved via comments on a blog post. Suppose that the blog was pretty active and that the post was getting several interesting comments. And suppose that you had an idea that you thought might be a good one. Instead of the usual reaction of being afraid to share it in case someone else beat you to the solution, you would be afraid not to share it in case someone beat you to that particular idea. And if the problem eventually got solved, and published under some pseudonym like Polymath, say, with a footnote linking to the blog and explaining how the problem had been solved, then anybody could go to the blog and look at all the comments. And there they would find your idea and would know precisely what you had contributed. There might be arguments about which ideas had proved to be most important to the solution, but at least all the evidence would be there for everybody to look at.

The open source world demonstrates this in action. You can see every single contribution a person has made to a project – the code, the conversations in online forums, and so on. There’s even beautiful visualizations that let you see different people’s contributions to a project. As a result, it’s very difficult to fool people about the extent of your contributions. I’m sure people are sometimes dishonest about this, but I’ll bet they’re a lot more honest than some scientists are about what they contributed to some papers.

True, it might be quite hard to say on your CV, “I had an idea that proved essential to Polymathâ€™s solution of the *** problem,” but if you made significant contributions to several collaborative projects of this kind, then you might well start to earn a reputation amongst people who read mathematical blogs, and that is likely to count for something. (Even if it doesnâ€™t count for all that much now, it is likely to become increasingly important.) And it might not be as hard as all that to put it on your CV: you could think of yourself as a joint author, with the added advantage that people could find out exactly what you had contributed.

And what about the person who tries to cut and run when the project is 85 [percent] finished? Well, it might happen, but everyone would know that they had done it. The referee of the paper would, one hopes, say, “Erm, should you not credit Polymath for your crucial Lemma 13?” And that would be rather an embarrassing thing to have to do.

Now I donâ€™t believe that this approach to problem solving is likely to be good for everything. For example, it seems highly unlikely that one could persuade lots of people to share good ideas about the Riemann hypothesis.

At present, this is undoubtedly true. However, if this sort of approach takes off and comes to be seen as a legitimate and orthodox way of making a contribution to mathematics, the kind of thing valued by (for example) hiring committees, then I think it may eventually be possible to do this for some of the more famous problems. There’s no intrinsic difference between sharing your ideas in a paper, or in a blog comment: a good idea is a good idea. The difference at present is mainly social: one is seen as legitimate, while the other is questionable.

At the other end of the scale, it seems unlikely that anybody would bother to contribute to the solution of a very minor and specialized problem. Nevertheless, I think there is a middle ground that might well be worth exploring, so as an experiment I am going to suggest a problem and see what happens.

I think it is important to do more than just say what the problem is. In order to try to get something started, I shall describe a very preliminary idea I once had for solving a problem that interests me (and several other people) greatly, but that isnâ€™t the holy grail of my area. Like many mathematical ideas, mine runs up against a brick wall fairly quickly. However, like many brick walls, this one doesnâ€™t quite prove that the approach is completely hopelessâ€”just that it definitely needs a new idea.

It may be that somebody will almost instantly be able to persuade me that the idea is completely hopeless. But that would be greatâ€”I could stop thinking about it. And if that happens Iâ€™ll dig out another idea for a different problem and try that instead.

I’ve been toying for quite some time with doing something similar, though with problems from theoretical physics. I’ll be utterly fascinated to see the result of this experiment, and will certainly follow along. Not sure I’ll have much of mathematical interest to contribute, though – combinatorics is a long way from my expertise.

Itâ€™s probably best to keep this post separate from the actual mathematics, so that comments about collaborative problem-solving in general donâ€™t get mixed up with mathematical thoughts about the particular problem I have in mind. So Iâ€™ll describe the project in my next post. Actually, make that my next post but one. The next post will say what the problem is and give enough background information about it to make it possible for anybody with a modest knowledge of combinatorics (or more than a modest knowledge) to think about it and understand my preliminary idea. The following post will explain what that preliminary idea is, and where it runs into difficulties. Then it will be over to you, or rather over to us. Iâ€™ve already written the background-information post, but will hold it back for a few days in case the responses to this post affect how I decide to do things.

The blog medium is almost certainly not optimal for this purpose, so if a serious discussion starts with lots of worthwhile contributions, then Iâ€™ll look into the possibility of migrating it over to some purpose-built site. If anyone has any suggestions for this (apart from the obvious one of using the Tricki â€” Iâ€™m not sure thatâ€™s appropriate just yet though) then Iâ€™d be delighted to receive them. My feelings at the moment are that blogs are too linearâ€”it would be quite hard to see which comments relate to which, which ones are most worth reading, and so on. A wiki, on the other hand, seems not to be linear enoughâ€”it would be quite hard to see what order the comments come in. So my guess is that the ideal forum would probably be a forum: if someone knows an easy way to set up a mathematical forum, I might even do that. But if the discussion is on this blog, then I might from time to time try to assess where it has got to and create new posts if I feel that genuine progress has been made that can be summarized and then built on.

Iâ€™ve been thinking of doing this for a long time. The reason Iâ€™ve suddenly decided to go ahead is that I followed a couple of links from this post on Michael Nielsenâ€™s blog, and discovered that, unsurprisingly, others have had similar ideas, and some people are already doing research in public. But the idea still seems pretty new, particularly when applied to one single mathematics problem, so I wanted to try it out when it was still fresh. (I would distinguish what I am proposing from what goes on at the n-category cafÃ©, which is an excellent example of collaborative mathematics, but focused on an entire research programme rather than just one problem.)

To finish, here is a set of ground rules that I hope it will be possible to abide by. At this stage Iâ€™m just guessing what will work, so these rules are subject to change. If you can see obvious flaws let me know.

1. The aim will be to produce a proof in a top-down manner. Thus, at least to start with, comments should be short and not too technical: they would be more like feasibility studies of various ideas.

2. Comments should be as easy to understand as is humanly possible. For a truly collaborative project it is not enough to have a good idea: you have to express it in such a way that others can build on it.

Points 3-5 all concern norms of behaviour, and the problem of maintaining a civil tone:

3. When you do research, you are more likely to succeed if you try out lots of stupid ideas. Similarly, stupid comments are welcome here. (In the sense in which I am using “stupid”, it means something completely different from “unintelligent”. It just means not fully thought through.)

4. If you can see why somebody else’s comment is stupid, point it out in a polite way. And if someone points out that your comment is stupid, do not take offence: better to have had five stupid ideas than no ideas at all. And if somebody wrongly points out that your idea is stupid, it is even more important not to take offence: just explain gently why their dismissal of your idea is itself stupid.

5. Donâ€™t actually use the word “stupid”, except perhaps of yourself.

Clay Shirky has pointed out that this problem – the problem of maintaining healthy conduct in online communities – has been around for decades, yet because no-one has synthesized all that is known, the same mistakes keep being made over and over (and over and over) again. The closest thing I know is a short blog post(!) from Theresa Nielsen Hayden, which is nice, but hardly comprehensive. Two suggestions:

Don’t allow anonymous posting. Forums which do seem inevitably to degenerate. At the least, people should use a consistent handle, and ideally they should be strongly encouraged to use their real name.
People who want the forum to thrive need to take ownership of social problems. If someone is behaving inappropriately, they should step up to the plate, and gently (at first) suggest alternate conduct. If someone’s behaving like an ass at a dinner party, you don’t leave it all on the host’s shoulders; you try to help out yourself, in whatever ways seem appropriate.

6. The ideal outcome would be a solution of the problem with no single individual having to think all that hard. The hard thought would be done by a sort of super-mathematician whose brain is distributed amongst bits of the brains of lots of interlinked people. So try to resist the temptation to go away and think about something and come back with carefully polished thoughts: just give quick reactions to what you read and hope that the conversation will develop in good directions.

At a talk last year by Mike Beltzner (who manages the development of the front-end for Firefox), he made a case that open-source projects where people went away and coded a lot on their own, only occasionally coming back to add big polished chunks, almost invariably failed.

7. If you are convinced that you could answer a question, but it would just need a couple of weeks to go away and try a few things out, then still resist the temptation to do that. Instead, explain briefly, but as precisely as you can, why you think it is feasible to answer the question and see if the collective approach gets to the answer more quickly. (The hope is that every big idea can be broken down into a sequence of small ideas. The job of any individual collaborator is to have these small ideas until the big idea becomes obvious â€” and therefore just a small addition to what has gone before.) Only go off on your own if there is a general consensus that that is what you should do.

8. Similarly, suppose that somebody has an imprecise idea and you think that you can write out a fully precise version. This could be extremely valuable to the project, but donâ€™t rush ahead and do it. First, announce in a comment what you think you can do. If the responses to your comment suggest that others would welcome a fully detailed proof of some substatement, then write a further comment with a fully motivated explanation of what it is you can prove, and give a link to a pdf file that contains the proof.

9. Actual technical work, as described in 8, will mainly be of use if it can be treated as a module. That is, one would ideally like the result to be a short statement that others can use without understanding its proof.

If the project thrives, a wiki may be a good place to keep reference materials like this. It seems to be a pretty common pattern for big online collaborations to use a discussion forum to manage the basic conversation, and a wiki for reference materials. Initially, a wiki might not be necessary, and probably shouldn’t be added until there is real demand.

Some wiki software that seems pretty good for mathematical use is the instiki and the TiddlyWiki. Instiki is very well suited for mathematical use; TiddlyWiki wasn’t so much designed for that purpose, but as you can see here seems to work pretty well in practice.

10. Keep the discussion focused. For instance, if the project concerns a particular approach to a particular problem (as it will do at first), and it causes you to think of a completely different approach to that problem, or of a possible way of solving a different problem, then by all means mention this, but donâ€™t disappear down a different track.

11. However, if the different track seems to be particularly fruitful, then it would perhaps be OK to suggest it, and if there is widespread agreement that it would in fact be a good idea to abandon the original project (possibly temporarily) and pursue a new one â€” a kind of decision that individual mathematicians make all the time â€” then that is permissible.

I’m not sure what I think about this. It seems rather constraining – why not do some preliminary exploration of the alternate track, if it seems promising? I agree that it would be problematic if it distracted other people too much, but that seems like a problem that could be dealt with, probably in real time, if it comes up.

12. Suppose the experiment actually results in something publishable. Even if only a very small number of people contribute the lionâ€™s share of the ideas, the paper will still be submitted under a collective pseudonym with a link to the entire online discussion.

A couple of final comments.

First, in many ways this (like most open source projects) seems to be primarily a community-building project. If you look at a successful open source-style project – Wikipedia, Linux, the Matlab competition, Kasparov versus the World – at the centre there is always a person who spends a great deal of time simply building and maintaining a healthy community of contributors. I can’t imagine this will be any different.

Second, the systems used need to be easily integrable into people’s workflow. I like the idea of starting on the blog, simply because many people are already in the habit of checking blogs. Migrations to other platforms will need to be handled carefully, to ensure that everyone does start using the new platform successfully. Providing things like RSS feeds or email update services might help greatly with this.

Biweekly links for 01/30/2009

Datawocky: More data usually beats better algorithms
The Dominance of Small Code Contributions
- Links to a study of a big open-source corpus, showing that small code contributions dominate by number (though not by total volume).
Academic Earth – Video lectures from the world’s top scholars
- “Thousands of video lectures from the world’s top scholars.” – Very interesting. Found a few problems, but this has potential.
BBC NEWS | Calls for open source government
- The new White House has asked Scott McNealy (Sun) to prepare a paper on open source.
Ruby on Rails on Vimeo
- A beautiful and informative visualization of Ruby on Rails commit history. Make sure to watch it in HD, in full-screen mode. After you’ve watched it for a bit, it’s worth skipping forward to 4:45 and watching the unbelievable explosion of activity that takes place when they moved to GitHub.
Open Access, Open Data. Open Research?
- Great summary talk about open science, from Cameron Neylon.
Is massively collaborative mathematics possible? Â« Gowersâ€™s Weblog
- A fascinating post from Tim Gowers, with a plan for some action.
Winning the Gnu
- Microsoftie Joey deVilla buys a gnu from Richard Stallman. No animals were harmed in the making of this presentation…
Dive into Python 3
- New version of a classic introduction to Python, by Mark Pilgrim, adapted for Python 3. Just the table of contents at present, with the content to be gradually filled in.

Click here for all of my del.icio.us bookmarks.

Connecting scientists to scientists

I’ve been struggling for some time with a writing problem. This is the problem of finding a really sharp way of conveying one of the most powerful ideas of open science: all the untapped creative potential existing in latent connections between scientists, and which could be released using suitable tools to activate the most valuable of those latent connections. I’ve discussed this idea in previous essays, but something was always lacking. In this post I take another shot at it, this time confronting the problem head on.

A fact of any scientist’s life is that you carry a lot of unsolved problems around in your head. Some of those problems are big (“find a quantum theory of gravity”), some of them are small (“where’d that damned minus sign disappear in my calculation?”), but all are grist for future progress. Mostly, it’s up to you to solve those problems yourself. If you’re lucky, you might also have a few supportive colleagues who can sometimes help you out.

Very occasionally, though, you’ll solve a problem in a completely different way. You’ll be chatting with a new acquaintance, when one of your problems (or something related) comes up. You’re chatting away when all of a sudden, BANG, you realize that this is just the right person to be talking to. Maybe they can just outright solve your problem. Or maybe they give you some crucial insight that provides the momentum needed to vanquish the problem.

Every working scientist recognizes this type of fortuitous serendipitous interaction. The problem is that they occur too rarely.

A few years ago, I started participating in various open source forums. Over time, I noticed something surprising going on in the healthiest of those forums. When people had a problem that was bugging them, rather than keeping silent about it, they’d post a description of the problem to the forum. Often, I’d look at their question and think to myself “yeah, I can see why they posted, that looks like a tough problem.” Then, forty minutes later, someone would come in and say “Oh, that’s easy, you just do X, Y, and Z”. Very often, X, Y and Z were quite ingenious, or at the least relied on knowledge that neither I nor the original questioner possessed. The original problem had been trivial all along.

What’s going on is similar to the fortuitous scientific exchange. A problem that’s difficult or impossible for most people can be trivial or routine to just the right person. But what was interesting and surprising about the open source forums was this: it seemed to be happening all the time. People who I’d never heard of would pop up, ask an interesting question, then someone else I’d never heard of would pop up, and provide an insightful answer. It didn’t happen every time, but it was happening over and over again.

A big “ahah!” moment for me occurred when I understood what was going on. By scaling up the creative conversation, those open source projects were providing a systematic mechanism that enabled people to find other people with just the right expertise to make their problem easy. Most of us spend much of our time stymied by problems that would be routine, if only we could find the right person to help us. As recently as 20 years ago, finding that right person was likely to be difficult. But what open source forums show is that it is possible to scale up conversation in this way, and significantly increase the likelihood of such serendipitous interaction.

Needless to say, scientists mostly don’t work this way. Many skeptics of open science say they never could, that scientists will forever be unwilling to share their problems and ideas in the way necessary to make this work. For the present post, it’s fine if you hold that position, for my purpose here isn’t to discuss the practicality of doing this. That’s a post for another day.

The question I’m concerned with is, instead, what is lost because we don’t do this? How much do we lose because so many scientists waste their time struggling with problems that some other scientist would find entirely routine?

I don’t know how to answer these questions quantitatively. What I do know is that as a practicing scientist, much of my time was spent working on problems that were hard for me, yet which I absolutely knew would be routine for someone else. The time I spent working on such problems was time lost to the whole scientific enterprise. Yet the tools and culture of science were such that I couldn’t easily outsource those problems to a person with a comparative advantage over me. When I talk about topics like restructuring expert attention, collaboration markets and open source research, this is what I’m talking about: tools and norms which allow us to trade in expert attention, and so to concentrate in areas where we have a comparative advantage.

Now, there are many caveats to this story. Most open source projects fail. Many problems – including many of the “big problems” of science – are intrinsically non-routine, and it may be extremely difficult to identify who (if anyone) has a comparative advantage in solving such problems. Furthermore, even for routine problems, there may be considerable intrinsic transaction costs associated with trade in expert attention – finding a common language, coming to a common understanding of the problem, and so on. The market for a problem may be thin (“find the screwdriver yourself!”) – for example, many of the problems facing benchtop experimentalists are problems exclusive to their own laboratories. Finally, finding ways to successfully scale up scientific conversation is not at all trivial. These are all important caveats, deserving extended discussion in their own right. Despite this, I believe the key idea – developing tools to aggregate information about comparative advantage, and to connect people who might benefit from a trade in attention – is worth taking seriously.

I started this post off with a discussion of the difficulty of describing what I believe is a latent potential for discovery within the scientific community. As I finish the post off, I must say that the post falls short of the strength and sharpness I’d like. What’s really needed is a detailed example that shows the mechanics of open source in action: how the dynamic division of labour actually works in a successful open source project. At present, so far as I’m aware there are no really successful examples within science; the culture of science remains too closed. There are, however, some extremely encouraging nascent examples, like open notebook science, and open source biology, and one day hopefully these and others will bloom.

The Logic of Collective Action

It is a curious fact that one of the seminal works on open culture and open science was published in 1965 (2nd edition 1971), several decades before the modern open culture and open science movements began in earnest. Mancur Olson’s book “The Logic of Collective Action” is a classic of economics and political science, a classic that contains much of interest for people interested in open science.

At the heart of Olson’s book is a very simple question: “How can collective goods be provided to a group?” Here, a “collective good” is something that all participants in the group desire (though possibly unevenly), and that, by its nature, is inherently shared between all the members of the group.

For example, airlines may collectively desire a cut in airport taxes, since such a cut would benefit all airlines. Supermarkets may collectively desire a rise in the market price of bread; such a rise would be, to them, a collective good, since it would be by its nature shared. Most of the world’s countries desire a stable climate, even if they are not necessarily willing to individually take the action necessary to ensure a stable climate. Music-lovers desire a free and legal online version of the Beatles’ musical repertoire. Scientists desire shared access to scientific data, e.g., from the Sloan Digital Sky Survey or the Allen Brain Atlas.

What Olson shows in the book is that although all parties in a group may strongly desire and benefit from a particular collective good (e.g., a stable climate), under many circumstances they will not take individual action to achieve that collective good. In particular, they often find it in their individual best interest to act against their collective interest. The book has a penetrating analysis of what conditions can cause individual and collective interests to be aligned, and what causes them to be out of alignement.

The notes in the present essay are much more fragmented than my standard essays. Rather than a single thesis, or a few interwoven themes, these are more in the manner of personal working notes, broken up into separate fragments, each one exploring some idea presented by Olson, and explaining how (if at all) I see it relating to open science. I hope they’ll be of interest to others who are interested in open science. I’m very open to discussion, but please do note that what I present here is a greatly abbreviated version (and my own interpretation) of what is merely part of what Olson wrote, omitting many important caveats that he discusses in detail; for the serious open scientist, I strongly recommend reading Olson’s book, as well as some of the related literature.

Why individuals may not act to obtain a collective good: Consider a situation in which many companies are all producing some type of widget, with each company’s product essentially indistinguishable from that being produced by the other companies. Obviously, the entire group of companies would benefit from a rise in the market price of the widget; such a rise would be for them a collective good. One way that price could rise would be for the supply of the widget to be restricted. Despite this fact, it is very unlikely that any single company will act on their own to restrict their supply of widgets, for their restriction of supply is likely to have a substantial negative impact on their individual profit, but a negligible impact on the market price.

This analysis is surprisingly general. As a small player in a big pond, why voluntarily act to provide a collective good, when your slice of any benefit will be quite small (e.g., due to an infinitesimal rise in prices), but the cost to you is quite large? A farmer who voluntarily restricted output to cause a rise in the price of farm products (a collective good for farmers) would be thought a loon by their farming peers, because of (not despite) their altruistic behaviour. Open scientists will recognize a familiar problem: a scientist who voluntarily shares their best ideas and data (making it a collective good for scientists) in a medium that is not yet regarded as scientifically meritorious does not do their individual prospects any good. One of the major questions of open science is how to obtain this collective good?

Small groups and big players: Olson points out that the analysis of the last two paragraphs fails to hold in the case of small groups, or in any situation where there are one or more “big players”. To see this, let’s return to the case of a restriction in supply leading to a rise in market price. Suppose a very large company decides to restrict supply of a good, perhaps causing a drop in supply of 1 percent. Suppose that the market responds with a 4 percent rise in price. Provided the company has greater than one quarter market share, the result will actually be an increase in profitability for the company. That is, in this case the company’s individual interest and the collective interest are aligned, and so the collective interest can be achieved through voluntary action on the part of the company.

This argument obviously holds only if one actor is sufficiently large that the benefit they reap from the collective good is sufficient, on its own, to justify their action. Furthermore, the fact that the large company takes this action by no means ensures that smaller companies will engage in the same action on behalf of the collective good, although the smaller companies will certainly be happy to reap the benefit of the larger company’s actions; Olson speaks, for this reason, of an “exploitation of the great by the small”. Indeed, notice that the impact of this strategy is to cause the market share of the large company to shrink slightly, moving them closer to a world in which their indiviudal benefit from collective action no longer justifies voluntary action on their part. (This shrinkage in market share also acts as a disincentive for them to act initially, despite the fact that in the short run profits will rise; this is a complication I won’t consider here.)

An closely related example may be seen in open source software. Many large companies – perhaps most famously, IBM and Sun – invest enormous quantities of money in open source software. Why do they provide this collective good for programmers and (sometimes) consumers? The answer is not as simple as the answer given in the last paragraph, because open source software is not a pure collective good. Many companies (including IBM and Sun) have developed significant revenue streams associated with open source, and they may benefit in other ways – community goodwill, and the disruption to the business models of competitors (e.g., Microsoft). Nonetheless, it seems likely that at least part of the reason they pour resources into open source is because purchasing tens of thousands of Windows licenses each year costs a company like IBM millions or tens of millions of dollars. At that scale, they can benefit substantially by instead putting that money to work making Linux better, and then using Linux for their operating system needs; the salient point is that because of IBM’s scale, it’s a large enough sum of money that they can expect to significantly improve Linux.

There is a similarity to some of the patterns seen in open data. Many open data projects are very large projects. I would go so far as to speculate that a quite disproportionate fraction of open data projects are very large projects – out of at most hundreds (more likely dozens) of projects funded at the one hundred million dollar plus level, I can think offhand of several that have open data; I’d be shocked if a similar percentage of “small science” experiments have open data policies.

Why is this the case? A partial explanation may be as follows. Imagine you are heading a big multi-institution collaboration that’s trying to get a one hundred million dollar experiment funded. You estimate that adopting an open data policy will increase your chances by three percent – i.e., it’s worth about 3 million dollars to your project. (I doubt many people really think quite this way, but in practice it probably comes to the same thing.) Now, making the data publicly available will increase the chances of outsiders “scooping” members of the collaboration. But the chance of this happening for any single member of the collaboration is rather small, especially if there is a brief embargo period before data is publicly released. By contrast, for a small experiment run in a single lab, the benefits of open data are much smaller, but the costs are comparable.

This analysis can be slotted into a more sophisticated three-part analysis. First, the person running the collaboration often isn’t concerned about being scooped themselves. This isn’t always true, but it is often true, for the leader or leaders of such projects often become more invested in the big picture than they are in making individual discoveries. They will instead tend to view any discovery from data produced by the project as a victory for the project, regardless of who actually makes the discovery. To the extent that the leadership is unconcerned about being scooped, they therefore have every incentive to go for open data. Second, if someone wants to join the collaboration, while they have researvations about an open data policy, they may also feel that it is worth giving up exclusive rights over data in exchange for a more limited type of exclusive access to a much richer data set. Third, as I argued in the previous paragraph, the trade-offs involved in open data are in any case more favourable for large collaborations than they are in small experiments.

Olson’s analysis suggests asking whether it might be easier to transition to a more open scientific culture in small, relatively close-knit research communities? If a community has only a dozen or so active research groups, might a few of those groups decide to “go open”, and then perhaps convince their peers to do so as well? With passionate, persuasive and generous leadership maybe this would be possible.

When is collective action possible? Roughly speaking, Olson identifies the following possibilities:

When it is made compulsory. This is the case in many trade unions, with Government taxes, and so on.
When social pressure is brought to bear. This is usually more effective in small groups that are already bound by a common interest. With suitable skills, it can also have an impact in larger groups, but this is usually much harder to achieve.
When it is people’s own best interests, and so occurs voluntarily. Olson argues that this mostly occurs in small groups, and that there is a tendency for “exploitation of the great by the small”. More generally, he argues that in a voluntary situation while some collective action may take place, the level is usually distinctly suboptimal.
When people are offered some other individual incentive. Olson offers many examples: one of the more amusing was the report that some trade unions spend more than ten percent of their budget on Christmas parties, simply to convince their members that membership is worthwhile.

Many of these ideas will already be familiar in the context of open science. Compulsion can be used to force people to share openly, as in the NIH public access policy. Alternately, by providing ways of measuring scientific contributions made in the open, it is possible to incentivize researchers to take a more open approach. This has contributed to the success of the preprint arXiv, with citation services such as Citebase making it straightforward to measure the impact a preprint is having.

This use of incentives means that the provision of open data (and other open knowledge) can gradually change from being a pure collective good to being a blend of a collective and a non-collective good. It becomes non-collective in the sense that the individual sharing the data derives some additional (unshared) benefit due to the act of sharing.

A similar transition occurred early in the history of science. As I have told elsewhere, early scientists such as Galileo, Hooke and Newton often went to great lengths to avoid sharing their scientific discoveries with others. They preferred to hoard their discoveries, and continue working in secret. The reason, of course, was that at the time shared results were close to a pure collective good; there was little individual incentive to share. With the introduction of the journal system, and the gradual professionalization of science, this began to change, with individuals having an incentive to share. Of course, that change only occurred very gradually, over a period of many decades. Nowadays, we take the link between publication and career success for granted, but that was something early journal editors (and others) had to fight for.

Similarly, online media are today going through a grey period. For example, a few years back, blogging was in many ways quite a disreputable activity for a scientist, fine for a hobby, but certainly not seen as a way of making a serious scientific contribution. It’s still a long way from being mainstream, but I think there are many signs that it’s becoming more accepted. As this process continues, online open science will shift from being a pure collective good to being a blend of a collective and non-collective good. As Olson suggests, this is a good way to thrive!

So, what use are networked tools for science? I’m occasionally asked: “If networked tools are so good for science, why haven’t we seen more aggressive adoption of those tools by scientists? Surely that shows that we’ve already hit the limits of what can be done, with email, Skype, and electronic journals?” Underlying this question is a presumption, the presumption that if the internet really has the potential to be as powerful a tool for science as I and others claim, then surely we scientists would have gotten together already to achieve it. More generally, it’s easy to presume that if a group of people (e.g., scientists) have a common goal (advancing science), then they will act together to achieve that goal. What’s important about Olson’s work is that it comprehensively shows the flaws in this argument. A group of people may all benefit greatly from some collective action, yet be unable to act together to achieve it. Olson shows that far from being unusual, this is in many ways to be expected.

Doing science online

This post is the text for an invited after-dinner talk about doing science online, given at the banquet for the Quantum Information Processing 2009 conference, held in Santa Fe, New Mexico, January 12-16, 2009.

Good evening.

Let me start with a few questions. How many people here tonight know what a blog is?

How many people read blogs, say once every week or so, or more often?

How many people actually run a blog themselves, or have contributed to one?

How many people read blogs, but won’t admit it in polite company?

Let me show you an example of a blog. It’s a blog called What’s New, run by UCLA mathematician Terence Tao. Tao, as many of you are probably aware, is a Fields-Medal winning mathematician. He’s known for solving many important mathematical problems, but is perhaps best known as the co-discover of the Green-Tao theorem, which proved the existence of arbitrarily long arithmetic progressions of primes.

Tao is also a prolific blogger, writing, for example, 118 blog posts in 2008. Popular stereotypes to the contrary, he’s not just sharing cat pictures with his mathematician buddies. Instead, his blog is a firehose of mathematical information and insight. To understand how valuable Tao’s blog is, let’s look at a example post, about the Navier-Stokes equations. As many of you know, these are the standard equations used by physicists to describe the behaviour of fluids, i.e., inside these equations is a way of understanding an entire state of matter.

The Navier-Stokes equations are notoriously difficult to understand. People such as Feynman, Landau, and Kolmogorov struggled for years attempting to understand their implications, mostly without much success. One of the Clay Millenium Prize problems is to prove the existence of a global smooth solution to the Navier-Stokes equations, for reasonable initial data.

Now, this isn’t a talk about the Navier-Stokes equations, and there’s far too much in Terry Tao’s blog post for me to do it justice! But I do want to describe some of what the post contains, just to give you the flavour of what’s possible in the blog medium.

Tao begins his post with a brief statement explaining what the Clay Millenium Problem asks. He shares the interesting tidibt that in two spatial dimenions the solution to the problem is known(!), and asks why it’s so much harder in three dimensions. He tells us that the standard answer is turbulence, and explains what that means, but then says that he has a different way of thinking about the problem, in terms of what he calls supercriticality. I can’t do his explanation justice here, but very roughly, he’s looking for invariants which can be used to control the behaviour of solutions to the equations at different length scales. He points out that all the known invariants give weaker and weaker control at short length scales. What this means is that the invariants give us a lot of control over solutions at long length scales, where things look quite regular, but little control at short length scales, where you see the chaotic variation characteristic of turbulence. He then surveys all the known approaches to proving global existence results for nonlinear partial differential equations — he says there are just three broad approaches – and points out that supercriticality is a pretty severe obstruction if you want to use one of these approaches.

The post has loads more in it, so let me speed this up. He describes the known invariants for the equations, and what they can be used to prove. He surveys and critiques existing attempts on the problem. He makes six suggestions for ways of attacking the problem, including one which may be interesting to some of the people in this audience: he suggests that pseudorandomness, as studied by computer scientists, may be connected to the chaotic, almost random behaviour that is seen in the solutions the Navier-Stokes equations.

The post is filled to the brim with clever perspective, insightful observations, ideas, and so on. It’s like having a chat with a top-notch mathematician, who has thought deeply about the Navier-Stokes problem, and who is willingly sharing their best thinking with you.

Following the post, there are 89 comments. Many of the comments are from well-known professional mathematicians, people like Greg Kuperberg, Nets Katz, and Gil Kalai. They bat the ideas in Tao’s post backwards and forwards, throwing in new insights and ideas of their own. It spawned posts on other mathematical blogs, where the conversation continued.

That’s just one post. Terry Tao has hundreds of other posts, on topics like Perelman’s proof of the Poincare conjecture, quantum chaos, and gauge theory. Many posts contain remarkable insights, often related to open research problems, and they frequently stimulate wide-ranging and informative conversations in the comments.

That’s just one blogger. There are, of course, many other top-notch mathematician bloggers. Cambridge’s Tim Gowers, another Fields Medallist, also runs a blog. Like Tao’s blog, it’s filled with interesting mathematical insights and conversation, on topics like how to use Zorn’s lemma, dimension arguments in combinatorics, and a thought-provoking post on what makes some mathematics particularly deep.

Alain Connes, another Fields Medallist, is also a blogger. He only posts occasionally, but when he does his posts are filled with interesting mathematical tidbits. For example, I greatly enjoyed this post, where he talks about his dream of solving one of the deepest problems in mathematics – the problem of proving the Riemann Hypothesis – using non-commutative geometry, a field Connes played a major role in inventing.

Berkeley’s Richard Borcherds, another Fields Medallist, is also a blogger, although he is perhaps better described as an ex-blogger, as he hasn’t updated in about a year.

I’ve picked on Fields Medallists, in part because at least four of the 42 living Fields Medallists have blogs. But there are also many other excellent mathematical blogs, including blogs from people closely connected to the quantum information community, like Scott Aaronson, Dave Bacon, Gil Kalai, and many others.

Let me make a few observations about blogging as a medium.

It’s informal.

It’s rapid-fire.

Many of the best blog posts contain material that could not easily be published in a conventional way: small, striking insights, or perhaps general thoughts on approach to a problem. These are the kinds of ideas that may be too small or incomplete to be published, but which often contain the seed of later progress.

You can think of blogs as a way of scaling up scientific conversation, so that conversations can become widely distributed in both time and space. Instead of just a few people listening as Terry Tao muses aloud in the hall or the seminar room about the Navier-Stokes equations, why not have a few thousand talented people listen in? Why not enable the most insightful to contribute their insights back?

You can also think of blogs as a way of making scientific conversation searchable. If you type “Navier-Stokes problem” into Google, the third hit is Terry Tao’s blog post about it. That means future mathematicians can easily benefit from his insight, and that of his commenters.

You might object that the most important papers about the Navier-Stokes problem should show up first in the search. There is some truth to this, but it’s not quite right. Rather, insofar as Google is doing its job well, the ranking should reflect the importance and significance of the respective hits, regardless of whether those hits are papers, blog posts, or some other form. If you look at this way, it’s not so surprising that Terry Tao’s blog post is near the top. As all of us know, when you’re working on a problem, a good conversation with an insightful colleague may be worth as much (and sometimes more) than reading the classic papers. Furthermore, as search engines become better personalized, the search results will better reflect your personal needs; in a search utopia, if Terry Tao’s blog post is what you most need to see, it’ll come up first, while if someone else’s paper on the Navier-Stokes problem is what you most need to see, then that will come up first.

I’ve started this talk by discussing blogs because they are familiar to most people. But ideas about doing science in the open, online, have been developed far more systematically by people who are explicitly doing open notebook science. People such as Garrett Lisi are using mathematical wikis to develop their thinking online; Garrett has referred to the site as “my brain online”. People such as chemists Jean-Claude Bradley and Cameron Neylon are doing experiments in the open, immediately posting their results for all to see. They’re developing ideas like lab equipment that posts data in real time, posting data in formats that are machine-readable, enabling data mining, automated inference, and other additional services.

Stepping back, what tools like blogs, open notebooks and their descendants enable is filtered access to new sources of information, and to new conversation. The net result is a restructuring of expert attention. This is important because expert attention is the ultimate scarce resource in scientific research, and the more efficiently it can be allocated, the faster science can progress.

How many times have you been obstructed in your research by the need to prove or disprove a small result that is a little outside your core expertise, and so would take you days or weeks, but which you know, of a certainty, the right person could resolve in minutes, if only you knew who that person was, and could easily get their attention. This may sound like a fantasy, but if you’ve worked on the right open source software projects, you’ll know that this is exactly what happens in those projects – discussion forums for open source projects often have a constant flow of messages posing what seem like tough problems; quite commonly, someone with a great comparative advantage quickly posts a clever way to solve the problem.

If new online tools offer us the opportunity to restructure expert attention, then how exactly might it be restructured? One of the things we’ve learnt from economics is that markets can be remarkably effective ways of efficiently allocating scarce resources. I’ll talk now about an interesting market in expert attention that has been set up by a company named InnoCentive.

To explain InnoCentive, let me start with an example involving an Indian not-for-profit called the ASSET India Foundation. ASSET helps at-risk girls escape the Indian sex industry, by training them in technology. To do this, they’ve set up training centres in several large cities across India. They’ve received many requests to set up training centres in smaller towns, but many of those towns don’t have the electricity needed to power technologies like the wireless routers that ASSET uses in its training centers.

On the other side of the world, in the town of Waltham, just outside Boston, is the company InnoCentive. InnoCentive is, as I said, an online market in expert attention. It enables companies like Eli Lilly and Proctor and Gamble to pose “Challenges” over the internet, scientific research problems theyâ€™d like solved, with a prize for solution, often many thousands of dollars. Anyone in the world can download a detailed description of the Challenge, and attempt to win the prize. More than 160,000 people from 175 countries have signed up for the site, and prizes for more than 200 Challenges have been awarded.

What does InnoCentive have to do with ASSET India? Well, ASSET got in touch with the Rockefeller Foundation, and explained their desire for a low-cost solar-powered wireless router. Rockefeller put up 20,000 in prize money to post an InnoCentive Challenge to design a suitable wireless router. The Challenge was posted for two months at InnoCentive. 400 people downloaded the Challenge, and 27 people submitted solutions. The prize was awarded to a 31-year old Texan software engineer named Zacary Brown, who delivered exactly the kind of design that ASSET was looking for; a prototype is now being built by engineering students at the University of Arizona.

Let’s come back to the big picture. These new forms of contribution – blogs, wikis, online markets and so forth – might sound wonderful, but you might reasonably ask whether they are a distraction from the real business of doing science? Should you blog, as a young postdoc trying to build up a career, rather than writing papers? Should you contribute to Wikipedia, as a young Assistant Professor, when you could be writing grants instead? Crucially, why would you share ideas in the manner of open notebook science, when other people might build on your ideas, maybe publishing papers on the subjects you’re investigating, but without properly giving you credit?

In the short term, these are all important questions. But I think a lot of insight into these questions can be obtained by thinking first of the long run.

At the beginnning of the 17th century, Galileo Galilei constructed the first astronomical telescope, looked up at the sky, and turned his new instrument to Saturn. He saw, for the first time in human history, Saturn’s astonishing rings. Did he share this remarkable discovery with the rest of the world? He did not, for at the time that kind of sharing of scientific discovery was unimaginable. Instead, he announced his discovery by sending a letter to Kepler and several other early scientists, containing a latin anagram, “smaismrmilmepoetaleumibunenugttauiras”. When unscrambled this may be translated, roughly, as “I have discovered Saturn three-formed”. The reason Galileo announced his discovery in this way was so that he could establish priority, should anyone after him see the rings, while avoiding revealing the discovery.

Galileo could not imagine a world in which it made sense for him to freely share a discovery like the rings of Saturn, rather than hoarding it for himself. Certainly, he couldn’t share the discovery in a journal article, for the journal system was not invented until more than 20 years after Galileo died. Even then, journals took decades to establish themselves as a legitimate means of sharing scientific discoveries, and many early scientists looked upon journals with some suspicion. The parallel to the suspicion many scientists have of online media today is striking.

Think of all the knowledge we have, which we do not share. Theorists hoard clever observations and questions, little insights which might one day mature into a full-fledged paper. Entirely understandably, we hoard those insights against that day, doling them out only to trusted friends and close colleagues. Experimentalists hoard data; computational scientists hoard code. Most scientists, like Galileo, can’t conceive of a world in which it makes sense to share all that information, in which sharing information on blogs, wikis, and their descendents is viewed as being (potentially, at least) an important contribution to science.

Over the short term, things will only change slowly. We are collectively very invested in the current system. But over the long run, a massive change is, in my opinion, inevitable. The advantages of change are simply too great.

There’s a story, almost certainly apocryhphal, that the physicist Michael Faraday was approached after a lecture by Queen Victoria, and asked to justify his research on electricity. Faraday supposedly replied “Of what use is a newborn baby?”

Blogs, wikis, open notebooks, InnoCentive and the like aren’t the end of online innovation. They’re just the beginning. The coming years and decades will see far more powerful tools developed. We really will enormously scale up scientific conversation; we will scale up scientific collaboration; we will, in fact, change the entire architecture of expert attention, developing entirely new ways of navigating data, making connections and inferences from data, and making connections between people.

When we look back at the second half of the 17th century, it’s obvious that one of the great changes of the time was the invention of modern science. When historians look back at the early part of the twentyfirst century, they will also see several major changes. I know many of you in this room believe that one of those changes will be related to the sustainability of how humans live on this planet. But I think there are at least two other major historical changes. The first is the fact that this is the time in history when the world’s information is being transformed from an inert, passive, widely separated state, and put into a single, unified, active system that can make connections, that brings that information alive. The world’s information is waking up.

The second of those changes, closely related to the first, is that we are going to change the way scientists work; we are going to change the way scientists share information; we are going to change the way expert attention itself is allocated, developing new methods for connecting people, for organizing people, for leveraging people’s skills. They will be redirected, organized, and amplified. The result will speed up the rate at which discoveries are made, not in one small corner of science, but across all of science.

Quantum information and computation is a wonderful field. I was touched and surprised by the invitation to speak tonight. I have, I think, never felt more honoured in my professional life. But, I trust you can understand when I say that I am also tremendously excited by the opportunities that lie ahead in doing science online.

Biweekly links for 01/26/2009

Building an Inverted Index with Hadoop and Pig Â« SquareCogâ€™s SquareBlog
- “In this post, I present a (very) brief description of the Pig project and demonstrate how one can construct an inverted index from a collection of text files using just a few lines of PigLatin.
  Pig offers SQL-like data processing instructions (select, project, filter, group), while being both more flexible by allowing simple integration of user-defined functions, and more straightforward by allowing users to issue command proceduraly, rather than declaratively, as in SQL. “
Yahoo! Hadoop Tutorial
Comparison of biological wikis
- Andrew Su’s survey of biological wikis (if you click again it links through to a spreadsheet). Lots of very interesting data about number of edits, number of editors, etc.
Datawocky: The Real Long Tail: Why both Chris Anderson and Anita Elberse are Wrong

Click here for all of my del.icio.us bookmarks.

Biweekly links for 01/23/2009

Visual Wikipedia
- This works surprisingly well, showing visually what different Wikipedia articles are linked to. The example I’ve chosen is the open notebook science article; many others work very well also.
the physics arXiv blog Â» How Googleâ€™s PageRank predicts Nobel Prize winners
- The title is over the top, but the results from the paper are very interesting.
A New Kind of Big Science – Olivia Judson Blog – NYTimes.com
- Thoughtful piece on big science, citizen science, and the relationship between them, from Aaron Hirsh.
The Inner Ring, by C.S. Lewis
- “When you invite a middle-aged moralist to address you, I suppose I must conclude, however unlikely the conclusion seems, that you have a taste for middle-aged moralizing. I shall do my best to gratify it.” The essay is entertaining throughout; confused in a couple of places, and enlightening in others. Well worth the read.
Evaluating MapReduce for Multi-core and Multiprocessor Systems
Machine Learning (Theory) Â» Adversarial Academia
- Nice discussion of the idea that academia is a zero-sum game.
Controversial Tell-All Book Reveals Wrestling Fans Are Fake | The Onion
- Who knew?
European Commission Â» Report on the Copyright Law for Protection of Databases
- In the late 90s, the EU introduced a copyright law intended to protect some kinds of databases. This report is an evaluation of the impact of that law on innovation in the EU.
MediaWiki database schema
- Lovely visualization.
ISIS Biolab
- FriendFeed room aggregating some (all?) of Cameron Neylon’s open notebook activities
The Semantic Web in Action: Scientific American
Virtual conferences in Second Life Â« Buried Treasure

Click here for all of my del.icio.us bookmarks.

When can the long tail be leveraged?

In 2006, Chris Anderson, the editor-in-chief of Wired magazine, wrote a bestselling book about an idea he called the long tail. The long tail is nicely illustrated by the bookselling business. Until recently, the conventional wisdom in bookselling was to stock only bestsellers. But internet bookstores such as Amazon.com take a different approach, stocking everything in print. According to Anderson, about a quarter of Amazon’s sales come from the long tail of books outside the top 100,000 bestselling titles (see here for the original research). While books in the long tail don’t individually sell many copies, they greatly outnumber the bestsellers, and so what they lack in individual sales they make up in total sales volume.

The long tail attracted attention because it suggested a new business model, selling into the long tail. Companies like Amazon, Netflix, and Lulu have built businesses doing just that. It also attracted attention because it suggested that online collaborations like Wikipedia and Linux might be benefitting greatly from the long tail of people who contribute just a little.

The problem if you’re building a business or online collaboration is that it can be difficult to tell whether participation is dominated by the long tail or not. Take a look at these two graphs:

The first graph is an idealized graph of Amazon’s book sales versus the sales rank, [tex]r[/tex], of the book. The second graph is an idealized graph of the number of edits made by the [tex]r[/tex]th most prolific contributor to Wikipedia. Superficially, the two graphs look similar, and it’s tempting to conclude that both graphs have a long tail. In fact, the two have radically different behaviour. In this post I’ll describe a general-purpose test that shows that Amazon.com makes it (just!) into the long tail regime, but in Wikipedia contributions from the short head dominate. Furthermore, this difference isn’t just an accident, but is a result of design decisions governing how people find information and make contributions.

Let’s get into more detail about the specifics of the Amazon and Wikipedia cases, before turning to the big picture. The first graph above shows the function

[tex]a / r^{0.871},[/tex]

where [tex]a[/tex] is a constant of proportionality, and [tex]r[/tex] is the rank of the book. The exponent is chosen to be [tex]0.871[/tex] because as of 2003 that makes the function a pretty close approximation to the number of books sold by Amazon. For our analysis, it doesn’t much matter what the value of [tex]a[/tex] is, so we won’t worry about pinning it down. All the important stuff is contained in the [tex]r^{0.871}[/tex] in the denominator.

The second graph shows the function

[tex]a / r^{1.7}.[/tex]

As with the Amazon sales formula, the Wikipedia edit formula isn’t exact, but rather is an approximation. I extracted the formula from a blog post written by a researcher studying Wikipedia at the Xerox PARC Augmented Cognition Center. I mention this because they don’t actually determine the exponent 1.7 themselves – I backed it out from one of their graphs. Note that, as for the Amazon formula, [tex]a[/tex] is a constant of proportionality whose exact value doesn’t matter. There’s no reason the values of the Wikipedia [tex]a[/tex] and the Amazon [tex]a[/tex] should be the same; I’m using the same letter in both formulas simply to avoid a profusion of different letters.

(A little parenthetical warning: figuring out power law exponents is a surprisingly subtle problem. It’s possible that my estimate of the exponent in the last paragraph may be off. See, e.g., this paper for a discussion of some of the subtleties, and references to the literature. If someone with access to the raw data wants to do a proper analysis, I’d be interested to know the results. In any case, we’ll see that the correct value for the exponent would need to be wildly different from my estimate before it could make any difference to the qualitative conclusions we’ll reach.)

Now suppose the total number of different books Amazon stocks in their bookstore is [tex]N[/tex]. We’ll show a bit later that the total number of books sold is given approximately by:

[tex]7.75 \times a \times N^{0.129}.[/tex]

The important point in this formula is that as [tex]N[/tex] increases the total number of books sold grows fairly rapidly. Double [tex]N[/tex] and you get a nearly ten percent increase in total sales. There’s a big benefit to being in the business of the long tail of books.

Let’s move to the second graph, the number of Wikipedia edits. If the total number of editors is [tex]N[/tex], then we’ll show below that the total number of edits made is approximately

[tex]2.05 \times a – O\left( \frac{a}{N^{1.7}} \right).[/tex]

The important point here is that, in contrast to the Amazon example, as [tex]N[/tex] increases it makes little difference to the total number of edits made. In Wikipedia, the total number of edits is dominated by the short head of editors who contribute a great deal.

A general rule to decide whether the long tail or the short head dominates

Let’s generalize the above discussion. We’ll find a simple general rule that can be used to determine whether the long tail or the short head dominates. Suppose the pattern of participation is governed by a power law distribution, with the general form

[tex]\frac{a}{r^b},[/tex]

where [tex]a[/tex] and [tex]b[/tex] are both constants. Both the Amazon and Wikipedia data can be described in this way, and it turns out that many other phenomena are described similarly – if you want to dig into this, I recommend the review papers on power laws by Mark Newman and Michael Mitzenmacher.

Let’s also suppose the total number of “participants” is [tex]N[/tex], where I use the term participants loosely – it might mean the total number of books on sale, the total number of contributors to Wikipedia, or whatever is appropriate to the situation. Our interest will be in summing the contributions of all participants.

When [tex]b < 1[/tex], the sum over all values of [tex]r[/tex] is approximately [tex]\frac{a N^{1-b}}{1-b}.[/tex] Thus, this case is tail-dominated, with the sum continuing to grow reasonably rapidly as [tex]N[/tex] grows. As we saw earlier, this is the case for Amazon's book sales, so Amazon really is a case where the long tail is in operation. When [tex]b = 1[/tex], the total over all values of [tex]r[/tex] is approximately [tex]a \log N.[/tex] This also grows as [tex]N[/tex] grows, but extremely slowly. It's really an edge case between tail-dominated and head-dominated. Finally, when [tex]b > 1[/tex], the total over all values of [tex]r[/tex] is approximately

[tex]a\zeta(b)-O\left(\frac{a}{N^b}\right),[/tex]

where [tex]\zeta(b)[/tex] is just a constant (actually, the Riemann zeta function, evaluated at [tex]b[/tex]), and the size of the corrections is of order [tex]a/N^b[/tex]. It follows that for large [tex]N[/tex] this approaches a constant value, and increasing the value of [tex]N[/tex] has little effect, i.e., this case is head-dominated. So, for example, it means that the great majority of edits to Wikipedia really are made by a small handful of dedicated contributors.

There is a caveat to all this discussion, which is that in the real world power laws are usually just an approximation. For many real world cases, the power law breaks down at the end of the tail, and at the very head of the distribution. The practical implication is that the quantitative values predicted by the above formula may be somewhat off. In practice, though, I don’t think this caveat much matters. Provided the great bulk of the distribution is governed by a power law, this analysis gives insight into whether it’s dominated by the head or by the tail.

Implications

If you’re developing a long tail business or collaboration, you need to make sure the exponent [tex]b[/tex] in the power law is less than one. The smaller the exponent, the better off you’ll be.

How can you make the exponent as small as possible? In particular, how can you make sure it’s smaller than the magic value of one? To understand the answer to this question, we need to understand what actually determines the value of the exponent. There’s some nice simple mathematical models explaining how power laws emerge, and in particular how the power law exponent emerges. At some point in the future I’d like to come back and discuss those in detail, and what implications they have for site architecture. This post is already long enough, though, so let me make just make three simple comments.

First, focus on developing recommendation and search systems which spread attention out, rather than concentrating it in the short head of what’s already popular. This is difficult to do without sacrificing quality, but there’s some interesting academic work now being done on such recommendation systems – see, for example, some of the work described in this recent blog post by Daniel Lemire.

Second, in collaborative projects, ensure a low barrier to entry for newcomers. One problem Wikipedia faces is a small minority of established Wikipedians who are hostile to new editors. It’s not common, but it is there. This drives newcomers away, and so concentrates edits within the group of established editors, effectively increasing the exponent in the power law.

Third, the essence of these and other similar recommendations is that they are systematic efforts to spread attention and contribution out, not one-off efforts toward developing a long tail of sales or contributions. The problem with one-off efforts is that they do nothing to change the systematic architectural factors which actually determine the exponent in the power law, and it is that exponent which is the critical factor.

The role of open licensing in open science

The open science movement encourages scientists to make scientific information freely available online, so other scientists may reuse and build upon that information. Open science has many striking similarities to the open culture movement, developed by people like Lawrence Lessig and Richard Stallman. Both movements share the idea that powerful creative forces are unleashed when creative artifacts are freely shared in a creative commons, enabling other people to build upon and extend those artifacts. The artifact in question might be a set of text documents, like Wikipedia; it might be open source software, like Linux; or open scientific data, like the data from the Sloan Digital Sky Survey, used by services such as Galaxy Zoo. In each case, open information sharing enables creative acts not conceived by the originators of the information content.

The advocates of open culture have developed a set of open content licenses, essentially a legal framework, based on copyright law, which strongly encourages and in some cases forces the open sharing of information. This open licensing strategy has been very successful in strengthening the creative commons, and so moving open culture forward.

When talking to some open science advocates, I hear a great deal of interest and enthusiasm for open licenses for science. This enthusiasm seems prompted in part by the success of open licenses in promoting open culture. I think this is great – with a few minor caveats, I’m a proponent of open licenses for science – but the focus on open licenses sometimes bothers me. It seems to me that while open licenses are important for open science, they are by no means as critical as they are to open culture; open access is just the beginning of open science, not the end. This post discusses to what extent open licenses can be expected to play a role in open scientific culture.

Open licenses and open culture

Let me review the ideas behind the licensing used in the open culture movement. If you’re familiar with the open culture movement, you’ll have heard this all before; if you haven’t, hopefully it’s a useful introduction. In any case, it’s worth getting all this fixed in our heads before addressing the connection to open science.

The obvious thing for advocates of open culture to do is to get to work building a healthy public domain: writing software, producing movies, writing books and so on, releasing all that material into the public domain, and encouraging others to build upon those works. They could then use a moral suasion argument to encourage others to contribute back to the public domain.

The problem is that many people and organizations don’t find this kind of moral suasion very compelling. Companies take products from the public domain, build upon them, and then, for perfectly understandable reasons, fiercely protect the intellectual property they produce. Disney was happy to make use of the old tale of Cinderella, but they take a distinctly dim view of people taking their Cinderella movie and remixing it.

People like Richard Stallman and Lawrence Lessig figured out how to add legal teeth to the moral suasion argument. Instead of relying on goodwill to get people to contribute back to the creative commons, they invented a new type of licensing that compels people to contribute back. There’s now a whole bunch of such open licenses – the various varieties of the GNU Public License (GPL), Creative Commons licenses, and many others – with various technical differences between them. But there’s a basic idea of viral licensing that’s common to many (though not all) of the open licenses. This is the idea that anyone who extends a product released under such a license must release the extension under the same terms. Using such an open license is thus a lot like putting material into the public domain, in that both result in content being available in the creative commons, but the viral open licenses differ from the public domain in compelling people to contribute back into the creative commons.

The consequences of this compulsion are interesting. In the early days of open licensing, the creative commons grew slowly. As the amount of content with an open license grew, though, things began to change. This has been most obvious in software development, which was where viral open licenses first took hold. Over time it became more tempting for software developers to start development with an existing open source product. Why develop a new product from scratch, when you can start with an existing codebase? This means that you can’t use the most obvious business model – limit distribution to executable files, and charge for them – but many profitable open source companies have shown that alternate business models are possible. The result is that as time has gone on, even the most resolutely closed source companies (e.g., Microsoft) have found it difficult to avoid infection by open source. The result has been a gradually accelerating expansion of the creative commons, an expansion that has enabled extraordinary creativity.

Open licenses and open science

I’m not sure what role licensing will play in open science, but I do think there are some clear signs that it’s not going to be as central a role as it’s played in open culture.

The first reason for thinking this is that a massive experiment in open licensing has already been tried within science. By law, works produced by US Federal Government employees are, with some caveats, automatically put into the public domain. Every time I’ve signed a “Copyright Transfer” agreement with an academic journal, there’s always been in the fine print a clause exclusing US Government employees from having to transfer copyright. You can’t give away what you don’t own.

This policy has greatly enriched the creative commons. And it’s led to enormous innovation – for example, I’ve seen quite a few mapping services that build upon US Government data, presumably simply because that data is in the public domain. But in the scientific realm I don’t get the impression that this is doing all that much to promote the growth of the same culture of mass collaboration as open licenses are enabling.

(A similar discussion can be had about open access journals. The discucssion there is more complex, though, because (a) many of the journals have only been open access for a few years, and (b) the way work is licensed varies a lot from journal to journal. That’s why I’ve focused on the US Government.)

The second reason for questioning the centrality of open licenses is the observation that the main barriers to remixing and extension of scientific content aren’t legal barriers. They are, instead, cultural barriers. If someone copies my work, as a scientist, I don’t sue them. If I were to do that, it’s in any case doubtful that the courts would do more than slap the violator on the wrist – it’s not as though they’ll directly make money. Instead, there’s a strong cultural prohibition against such copying, expressed through widely-held community norms about plagiarism and acceptable forms of attribution. If someone copies my work, the right way to deal with it is to inform their colleagues, their superiors, and so on – in short, to deal with it by cultural rather than legal means.

That’s not to say there isn’t a legal issue here. But it’s a legal issue for publishers, not individual scientists. Many journal publishers have business models which are vulnerable to systematic large-scale attempts to duplicate their content. Someone could, for example, set up a “Pirate Bay” for scientific journal articles, making the world’s scientific articles freely available. That’s something those journals have to worry about, for legitimate short-term business reasons, and copyright law provides them with some form of protection and redress.

My own opinion is that over the long run, it’s likely that the publishers will move to open access business models, and that will be a good thing for open science. I might be wrong about that; I can imagine a world in which that doesn’t happen, yet certain varieties of open science still flourish. Regardless of what you think about the future of journals, the larger point is that the legal issues around openness are only a small part of a much larger set of issues, issues which are mostly cultural. The key to moving to a more open scientific system is changing scientist’s hearts and minds about the value and desirability of more openly sharing information, not reforming the legal rights under which they publish content.

So, what’s the right approach to licensing? John Wilbanks has argued, persuasively in my opinion, that data should be held in the public domain. I’ve sometimes wondered if this argument shouldn’t be extended beyond data, to all forms of scientific content, including papers, provided (and this is a big “provided”) the publisher’s business interests can be met in way that adequately serves all parties. After all, if the scientific community is primarily a reputation economy, built around cultural norms, then why not simply remove the complication of copyright from the fray?

Now, I should say that this is speculation on my part, and my thinking is incomplete on this set of issues. I’m most interested to hear what others have to say! I’m especially interested in efforts to craft open research licenses, like the license Victoria Stodden has been developing. But I must admit that it’s not yet clear to me why, exactly, we need such licenses, or what interests they serve.

The sincerest form of flattery…

… probably isn’t copying without attribution. Someone named Will Choi has apparently copied without attribution my essay about Shirky’s Law. (All links given a “nofollow” tag.) I’ve had this kind of copying done by spammers before, many times, but it’s a little more annoying to see on someone’s personal blog.

Month: January 2009

Is massively collaborative mathematics possible?

Biweekly links for 01/30/2009

Connecting scientists to scientists

Further reading:

The Logic of Collective Action

Further reading

Doing science online

Further reading

Biweekly links for 01/26/2009

Biweekly links for 01/23/2009

When can the long tail be leveraged?

A general rule to decide whether the long tail or the short head dominates

Implications

Further reading

The role of open licensing in open science

Open licenses and open culture

Open licenses and open science

Further reading

The sincerest form of flattery…