Micropublication and open source research

This is an extract from my (very early) draft book on the way the internet is changing how science is done.

I would like to legitimize a new kind of proof: `The Incomplete Proof’. The reason that the output of mathematicians is so meager is that we only publish that tiny part of our work that ended up in complete success. The rest goes to the recycling bin. […] Why did [the great mathematician] Paul Cohen stop publishing at the age of 30? My guess is that he was trying, and probably still is, to prove [the Riemann Hypothesis]. I would love to be able to see his `failed attempts’. […] So here is my revolutionary proposal. Publish all your (good) thoughts and ideas, regardless of whether they are finished or not.
Doron Zeilberger

Imagine you are reading a research article. You notice a minor typo in the article, which you quickly fix, using a wiki-like editing system to create a new temporary “branch” of the article – i.e., a copy of the article, but with some modifications that you’ve made. The original authors of the article are notified of the branch, and one quickly contacts you to thank you for the fix. The default version of the article is now updated to point to your branch, and your name is automatically added to a list of people who have contributed to the article, as part of a complete version history of the article. This latter information is also collected by an aggregator which generates statistics about contributions, statistics which you can put into your curriculum vitae, grant applications, and so on.

Later on while reading, you notice a more serious ambiguity, an explanation that could be interpreted in several inconsistent ways. After some time, you figure out which explanation the authors intend, and prepare a corrected version of the article in a temporary branch. Once again, the original authors are notified. Soon, one contacts you with some queries about your fix, pointing out some subtleties that you’d failed to appreciate. After a bit of back and forth, you revise your branch further, until both you and the author agree that the result is an improvement on both the original article and on your first attempt at a branch. The author approved default version of the article is updated to point to the improved version, and you are recognized appropriately for your contribution.

Still later, you notice a serious error in the article – maybe a flaw in the logic, or a serious error of omission material to the argument – which you don’t immediately see how to fix. You prepare a temporary branch of the article, but this time, rather than correcting the error, you insert a warning explaining the existence and the nature of the error, and how you think it affects the conclusions of the article.

Once again, the original authors are notified of your branch. This time they aren’t so pleased with your modifications. Even after multiple back and forth exchanges, and some further revisions on your part, they disagree with your assessment that there is an error. Despite this, you remain convinced that they are missing your point.

Believing that the situation is not readily resolvable, you create a more permanent branch of the article. Now there are two branches of the article visible to the public, with slightly differing version histories. Of course, these version histories are publicly accessible, and so who contributed what is a matter of public record, and there is no danger that there will be any ambiguity about the origins of the new material, nor about the origin of the disagreement between the two branches.

Initially, most readers look only at the original branch of the article, but a few look at yours as well. Favourable commentary and a gradual relative increase in traffic to your branch (made suitably visible to potetial readers) encourages still more people to read your version preferentially. Your branch gradually becomes more highly visible, while the original fades. Someone else fixes the error you noticed, leading to your branch being replaced by a still further improved version, and still more traffic. After some months, reality sets in and the original authors come around to your point of view, removing their original branch entirely, leaving just the new improved version of the article. Alternately, perhaps the original authors, alarmed by their dimunition, decide to strike back with a revised version of their article, explaining in detail why you are wrong.

These stories illustrate a few uses of micropublication and open source research. These are simple ideas for research publication, but ones that have big consequences. The idea of micropublication is to enable publication in smaller increments and more diverse formats than in the standard scientific research paper. The idea of open source research is to open up the licensing model of scientific publication, providing more flexible ways in which prior work can be modified and re-used, while ensuring that all contributions are fully recognized and acknowledged.

Let’s examine a few more potential applications of micropublication and open source research.

Imagine you are reading an article about the principles of population control. As you read, you realize that you can develop a simulator which illustrates in a vivid visual form one of the main principles described in the article, and provides a sandbox for readers to play with and better understand that principle. After dropping a (favourably received) note to the authors, and a little work, you’ve put together a nice simulation. After a bit of back and forth with the authors, a link to your simulation is now integrated into the article. Anyone reading the article can now click on the relevant equation and will immediately see your simulation (and, if they like, the source code). A few months later, someone takes up your source code and develops the simulation further, improving the reader experience still further.

Imagine reading Einstein’s original articles on special relativity, and being able to link directly to simulations (or, even better, fully-fledged computer games) that vividly demonstrate the effects of length contraction, time dilation, and so on. In mathematical disciplines, this kind of content enhancement might even be done semi-automatically. The tools could gradually integrate the ability to make inferences and connections – “The automated reasoning software has discovered a simplification of Equation 3; would you like to view the simplification now?”

Similar types of content enhancement could, of course, be used in all disciplines. Graphs, videos, explanations, commentary, background material, data sets, source code, experimental procedures, links to wikipedia, links to other related papers, links to related pedagogical materials, talks, media releases – all these and more could be integrated more thoroughly into research publishing. Furthermore, rather than being second-class add-ons to “real” research publications, a well-designed citation and archival system would ensure that all these forms have the status of first-class research publications, raising their stature, and helping ensure that people put more effort into adding value in these ways.

Another use for open source research is more pedagogical in flavour. Imagine you are a student assigned to rewrite Einstein’s article on general relativity in the language of modern differential geometry. Think of the excitement of working with the master’s original text, fully inhabiting it, and then improving it still further! Of course, such an assignment is technologically possible even now. However, academia has strong cultural inhibitions against making such modifications to original research articles. I will argue that with properly authenticated archival systems these issues could be addressed, the inhibitions could be removed, and a world of new possibilities opened up.

Having discussed micropublication and open source research in concrete terms, let’s now describe them in more abstract terms, and briefly discuss some of the problems that must be overcome if they are to become viable modes of publication. More detailed resolutions to these problems will be discussed in a later post.

Micropublication does three things. First, it decreases the size of the smallest publishable unit of research. Second, it broadens the class of objects considered as first-class publishable objects so that it includes not just papers, but also items such as data, computer code, simulations, commentary, and so on. Third, it eliminates the barrier of peer review, a point we’ll come back to shortly. The consequence is to greatly reduce the friction slowing down the progress of the research community, by lowering the barriers to publication. Although promising, this lowering of the barriers to publication also creates three problems that must be addressed if the research community is to adopt the concept of micropublication.

The first problem is providing appropriate recognition for people’s contributions. This can be achieved through appropriate archival and citation systems, and is described in detail in a later post.

The second problem is quality assurance. The current convention in science is to filter content before publishing it through a system of peer review. In principle, this ensures that only the best research gets published in the top journals. While this system has substantial failures in practice, on the whole it has improved our access to high-quality research. To ensure similar quality, micropublication must use a publish-then-filter model which enables the highest quality research to be accurately identified. We will discuss the development of such filtering systems in a later post. Note, however, that publish-then-filter already works surprisingly well on the web, due to tools such as Google, which is capable of picking out high value webpages. Such filtering systems are far from perfect, of course, and there are serious obstacles to be overcome if this is to be a successful model.

The third problem is providing tools to organize and search through the mass of publication data. This is, in some sense, the flip side of the quality assurance problem, since it is also about organizing information in meaningful and useful ways, and there is considerable overlap in how these tools must work. Once again, we will discuss the development of these tools in a later post.

Open source research opens up the licensing model used in research publication so that people may make more creative reuse of existing work, and thus speed the process of research. It removes the cumbersome quote-and-cite licensing model in current use in sciece. This makes sense if one is publishing on paper, but is not necessary in electronic publication. Instead, it is replaced by a trustworthy authenticated archive of publication data which allows one to see an entire version history of a document, so that we can see who contributed what and when. This will allow people to rapidly improve, extend and enhance other people’s work, in all the ways described above.

Academics have something of a horror of the informal re-use that I may appear to be advocating. The reason is that the principal currency of research is attention and reputation, not (directly) money. In such a system, not properly citing sources is taken very seriously; even very illustrious researchers have fallen from grace over accusations of plagiarism. For these reasons, it is necessary to design the archival system carefully to ensure that one can gain the benefits of a more informal licensing model, while still adequately recognizing people’s contributions.

Overarching and unifying all these problems is one main problem, the problem of migration, i.e., convincing researchers that it is in their best interest to move to the new system. How can this possibly be achieved? The most obvious implementations of micropublication and open source research will require researchers to give up their participation in the standard recognition system of science — the existing journal system. Such a requirement will undoubtedly result in the migration failing. Fortunately, I believe it is possible to find a migratory path which integrates and extends the standard recognition system of science in such a way that researchers have only positive incentives to make the migration. This path does not start with a single jump to micropublication and open source research, but rather involves a staged migration, with each stage integrating support for legacy systems such as citation and peer review, but also building on new systems that can take the place of the legacy systems, and which are better suited for the eventual goals of micropublication and open source research. This process is quite flexible, but involves many separate ideas, which will be described in subsequent posts.

Incentives

In the comments, Franklin writes on the subject of open source research:

On the other side of the coin, what would be the incentives for contributing to other people’s research?

This is an excellent question. Generalizing, any proposed change to how people do research, collaborate, or publish necessarily must face the question: what are the incentives to participate in the change? One must find a migration path which provides positive incentives at each step of the way, or else the migration is doomed to failure. I am proposing very significant changes to how research is done, and so the incentives along the migration path necessarily require considerable thought. Addressing these issues systematically is one of the main reasons I’ve written a book.

Published
Categorized as General

What I’m imbibing

Commenter Martin points to the January 2007 issue of Physics World, which contains lot of very interesting information about Web 2.0 and Science. In a similar vein, Corie Lok has some thoughtful recent reflections on getting scientists to adopt new tools for research. Finally, let me mention Jon Udell’s interview with Lewis Shepherd, talking about the US Defense Intelligence Agency’s use of wikis, blogs, Intellipedia, and many other interesting things. Some of the challenges he faced in bringing such social tools to Defense are similar to the problems in bringing them to science.

On a completely different topic, let me mention a fantastic presentation about green technology given earlier this year by John Doerr at the TED conference. I’ve been working my way through all the online TED talks, many of which are really good. While I’m at it, I may as well plug the Long Now talks, which is also a great series, with talks by people like Danny Hillis, John Baez, Stewart Brand, Jimmy Wales and many others.

Published
Categorized as General

More on funding

Chad Orzel has some thoughtful comments on my earlier questions about research funding. Here’s a few excerpts and some further thoughts:

… a good deal of the image problems that science in general has at the moment can be traced to a failure to grapple more directly with issues of funding and the justification of funding… In the latter half of the 20th century, we probably worked out the quantum details of 1000 times as many physical systems as in the first half, but that sort of thing feels a little like stamp collecting– adding one new element to a mixture and then re-measuring the band structure of the resulting solid doesn’t really seem to be on the same level as, say, the Schrödinger equation, but I’m at a loss for how to quantify the difference… The more important question, though, is should we really expect or demand that learning be proportional to funding?

This really gets to the nub of it. In research, as in so many other things, funding may hit a point of diminishing returns beyond which what we learn becomes more and more marginal. However, it is by no means obvious where the threshold is beyond which society as a whole would be better off allocating its resources to other more worthy causes.

And what, exactly, do we as a society expect to get out of fundamental research?

For years, the argument has been based on technology– that fundamental research is necessary to understand how to build the technologies of the future, and put a flying car in every garage. This has worked well for a long time, and it’s still true in a lot of fields, but I think it’s starting to break down in the really big-ticket areas. You can make a decent case that, say, a major neutron diffraction facility will provide materials science information that will allow better understanding of high-temperature superconductors, and make life better for everyone. It’s a little harder to make that case for the Higgs boson, and you’re sort of left with the Tang and Velcro argument– that working on making the next generation of whopping huge accelerators will lead to spin-off technologies that benefit large numbers of people. It’s not clear to me that this is a winning argument– we’ve gotten some nice things out of CERN, the Web among them, but I don’t know that the return on investment really justifies the expense.

The spinoff argument also has the problem that it’s hard to argue that these things wouldn’t have happened anyway. No disrespect to Tim Berners-Lee’s wonderful work, but it’s hard to believe that if he hadn’t started the web, some MIT student in a dorm room wouldn’t have done so shortly thereafter.

Of course, it’s not like I have a sure-fire argument. Like most scientists, I think that research is inherently worth funding– it’s practically axiomatic. Science is, at a fundamental level, what sets us apart from other animals. We don’t just accept the world around us as inscrutable and unchangeable, we poke at it until we figure out how it works, and we use that knowledge to our advantage. No matter what poets and musicians say, it’s science that makes us human, and that’s worth a few bucks to keep going. And if it takes millions or billions of dollars, well, we’re a wealthy society, and we can afford it.

We really ought to have a better argument than that, though.

As for the appropriate level of funding, I’m not sure I have a concrete number in mind. If we’ve got half a trillion to piss away on misguided military adventures, though, I think we can throw a few billion to the sciences without demanding anything particular in return.

One could attempt to frame this in purely economic terms: what’s the optimal rate at which to invest in research in order to maximize utility, under reasonable assumptions? This framing misses some of the other social benefits that Chad alludes to – all other things being equal, I’d rather live in a world where we understand general relativity, just because – but has the benefit of being at less passably well posed. I don’t know a lot about their conclusions, but I believe this kind of question has recently come under a lot of scrutiny from economists like Paul Romer, under the name endogeneous growth theory.

Published
Categorized as Science

The Future of Science

How is the web going to impact science?

At present, the impact of the web on science has mostly been to make access to existing information easier, using tools such as online journals and databases such as the ISI Web of Knowledge and Google Scholar. There have also been some interesting attempts at developing other forms of tools, although so far as I am aware none of them have gained a lot of traction with the wider scientific community. (There are signs of exceptions to this rule on the horizon, especially some of the tools being developed by Timo Hannay’s team at Nature.)

The contrast with the internet at large is striking. Ebay, Google, Wikipedia, Facebook, Flickr and many others are new types of institution enabling entirely new forms of co-operation. Furthermore, the rate of innovation in creating such new institutions is enormous, and these examples only scratch the surface of what will soon be possible.

Over the past few months I’ve drafted a short book on how I think science will change over the next few years as a result of the web. Although I’m still revising and extending the book, over the next few weeks I’ll be posting self-contained excerpts here that I think might be of some interest. Thoughtful feedback, argument, and suggestions are very welcome!

A few of the things I discuss in the book and will post about here include:

  • Micropublication: Allowing immediate publication in small incremental steps, both of conventional text, and in more diverse media formats (e.g. commentary, code, data, simulations, explanations, suggestions, criticism and correction). All are to be treated as first class fully citable publications, creating an incentive for people to contribute far more rapidly and in a wider range of ways than is presently the case.
  • Open source research: Using version control systems to open up scientific publications so they can be extended, modified, reused, refactored and recombined by other users, all the while preserving a coherent and citable record of who did what, and when.
  • The future of peer review: The present quality assurance system relies on refereeing as a filtering system, prior to publication. Can we move to a system where the filtering is done after publication?
  • Collaboration markets: How can we fully leverage individual expertise? Most researchers spend much of their time reinventing the wheel, or doing tasks at which they have relatively little comparative advantage. Can we provide mechanisms to easily outsource work like this?
  • Legacy systems and migration: Why is it that the scientific community has been so slow to innovate on the internet? Many of the ideas above no doubt look like pipedreams. Nonetheless, I believe that by carefully considering and integrating with today’s legacy incentive systems (citation, peer review, and journal publication), it will be possible to construct a migration path that incentivizes scientists to make the jump to new tools for doing research.

The Research Funding “Crisis”

If you talk with academics for long, sooner or later you’ll hear one of them talk about a funding crisis in fundamental research (e.g. Google and Cosmic Variance).

There are two related questions that bother me.

First, how much funding is enough for fundamental research? What criterion should be used to decide how much money is the right amount to spend on fundamental research?

Second, the human race spent a lot lot more on fundamental research in the second half of the twentieth century than it did in the first. It’s hard to get a good handle on exactly how much, in part because it depends on what you mean by fundamental research. At a guess, I’d say at least 1000 times as much was spent in the second half of the twentieth century. Did we learn 1000 times as much? In fact, did we learn as much, even without a multiplier?

Question for Marc Andreessen

A few weeks ago, Marc Andreessen invited his readers to submit a question to him. Here’s mine:

My question is whether you think a technological singularity of the type Vernor Vinge has proposed is likely in the near-term future? If so, what shape do you think the singularity is likely to take? If not, why do you think it won’t occur?

I hope you have time to answer. My own (outsider’s) perspective is that an awfully large number of people (Google, Ebay, Wikipedia, etc) now seem to be working more or less directly towards such a singularity, and it is very suggestive that more and more of the world’s resources are being directed toward this end. Of course, Ebay, Google etc don’t look at it that way, but from the perspective of a posthuman historian 50 years from now that may well be how it looks.

Andreessen hasn’t replied, but I think this fact about the growing commerical utility of AI is fascinating. Here’s a couple of quotes from Google co-founder Larry Page that could easily be quoted by my putative posthuman historian:

We have some people at Google who are really trying to build artificial intelligence and to do it on a large scale […] to do the perfect job of search you could ask any query and it would give you the perfect answer and that would be artificial intelligence […] I don’t think it’s as far off as people think.

You think Google is good, I still think it’s terri ble. […] There’s still a huge number of things that we can’t answer. You might have a more complicated question. Like why did the GNP of Uganda decline relative to the weather last year? You type that into Google, the keywords for that, and you might get a reasonable answer. But there is probably something there that
explains that, which we may or may not find. Doing a good job doing search is basically artificial intelligence, we want it to be smart.

It’s interesting that the Director of Google research, Peter Norvig, wrote what appears to be the standard text on artificial intelligence. He’s also got a pretty interesting page of book reviews.

Open source Google

Why can’t we ask arbitrarily complex questions of the whole web?

Consider the questions we can ask the web. Type a name into Google and you see, very roughly, the top sites mentioning that name, and how often it is mentioned on the web. At a more sophisticated level, Google makes available a limited API (see here, here, and here) that lets you send simple queries to their back-end database.

Compare that to what someone working internally for Google can do. They can ask arbitrarily complex questions of the web as a whole, using powerful database query techniques. They can even apply algorithms that leverage all the information available on the web, incorporating ideas from fields like machine learning to extract valuable information. This ability to query the web as a whole, together with Google’s massive computer cluster, enables not only Google search, but also many of the dozens of other applications offered by Google. To do all this, Google constructs a local mirror of the web, which they then enhance by indexing and structuring it to make complex queries of the web possible.

What I want is for all developers to have full access to such a mirror, enabling anyone to query the web as a whole. Such a mirror would be an amazing development platform, leading to many entirely new types of applications and services. If developed correctly it would, in my opinion, eventually become a public good on a par with the electricity grid.

A related idea was announced last week by Wikipedia’s Jimbo Wales: the Search Wikia search engine is making available an open source web crawler which can be improved by the community at large. This great idea is, however, just the tip of a much larger iceberg. Sure, an open source search tool might improve the quality and transparency of search, and provide some serious competition to Google. But search is just a single application, no matter how important; it would be far more valuable to open up the entire underlying platform and computing infrastructure to developers. I predict that if Search Wikia is successful, then the developers contibuting to it will inevitably drive it away from being a search application, and towards being a development platform.

I believe such a platform can be developed as an open source project, albeit a most unconventional one. So far as I am aware, no-one has ever attempted to develop an open source massively distributed computing platform. Many of the required ideas can of course be found in massively distributed applications such as SETI@Home, Folding@Home, and Bram Cohen’s BitTorrent. However, this project has many very challenging additional problems, such as privacy (who gets to see what data?) and resource allocation (how much time does any party get on the platform?)

Once these problems are overcome, such an open source platform will enable us to query not only the web as a whole, but also what John Battelle has called the “database of human intentions” – all the actions ever taken by any user of the platform. Indeed, Google’s most powerful applications increasingly integrate their mirror of the web with their proprietary database of human intentions. It’d be terrific if these two databases – the web as a whole, and the database of human intentions – were available to and fully queryable by humanity at large.

Was the Universe formerly a black hole?

This is a question that’s bugged me for a while.

First, here’s why I think this is a reasonable question to ask.

Suppose you cram a mass M into a spherical volume of radius R such that R is less than the Schwarzschild radius, i.e., R \leq 2GM/c^2. Then it’s a pretty well understood consequence of general relativity that the mass will collapse to form a black hole.

Current estimates of the mass of the (observable) Universe vary quite a bit. This webpage seems pretty representative, though, giving a value for the mass of 3 x 10^52 kg.

This gives a value for the corresponding Schwarzschild radius of about 6 billion light years.

The radius of the observable Universe is, of course, quite a bit bigger than this. But the Universe is also expanding, and at some point in the past its radius was quite a bit less than six billion light years.

If that was the case, why didn’t it collapse to form a singularity? In short, how come we’re still here?

Any cosmologists out there who can enlighten me?

Update: In comments, Dave Bacon points to an enlightening essay from John Baez, explaining some of what’s going on.

My interpretation of the essay is that the standard lore I learned as an undergraduate (namely, that if you take a mass M and compress it into a smaller radius than the Schwarzschild radius then a black hole must inevitably form) is wrong, and that the FRW cosmology provides a counterexample.

This begs the question of when, exactly, a black hole can be guaranteed to form.

Published
Categorized as Physics

The Academic Reader and RSS readers

In comments, Yue Li writes of the Academic Reader:

This seems very similar with a production of Google, the google reader, a quantum specific google reader:)

In the existing site, the main difference between the Academic Reader and RSS readers like the Google Reader is that we have a variety of ways of searching and browsing older papers. This means the Academic Reader allows you to both (1) keep abreast of your current reading, and (2) look back into the past, discovering older papers and so on. RSS readers typically focus on just the first of these problems.

This functionality will be greatly extended in coming months!