Micropublication and open source research

This is an extract from my (very early) draft book on the way the internet is changing how science is done.

I would like to legitimize a new kind of proof: `The Incomplete Proof’. The reason that the output of mathematicians is so meager is that we only publish that tiny part of our work that ended up in complete success. The rest goes to the recycling bin. […] Why did [the great mathematician] Paul Cohen stop publishing at the age of 30? My guess is that he was trying, and probably still is, to prove [the Riemann Hypothesis]. I would love to be able to see his `failed attempts’. […] So here is my revolutionary proposal. Publish all your (good) thoughts and ideas, regardless of whether they are finished or not.
Doron Zeilberger

Imagine you are reading a research article. You notice a minor typo in the article, which you quickly fix, using a wiki-like editing system to create a new temporary “branch” of the article – i.e., a copy of the article, but with some modifications that you’ve made. The original authors of the article are notified of the branch, and one quickly contacts you to thank you for the fix. The default version of the article is now updated to point to your branch, and your name is automatically added to a list of people who have contributed to the article, as part of a complete version history of the article. This latter information is also collected by an aggregator which generates statistics about contributions, statistics which you can put into your curriculum vitae, grant applications, and so on.

Later on while reading, you notice a more serious ambiguity, an explanation that could be interpreted in several inconsistent ways. After some time, you figure out which explanation the authors intend, and prepare a corrected version of the article in a temporary branch. Once again, the original authors are notified. Soon, one contacts you with some queries about your fix, pointing out some subtleties that you’d failed to appreciate. After a bit of back and forth, you revise your branch further, until both you and the author agree that the result is an improvement on both the original article and on your first attempt at a branch. The author approved default version of the article is updated to point to the improved version, and you are recognized appropriately for your contribution.

Still later, you notice a serious error in the article – maybe a flaw in the logic, or a serious error of omission material to the argument – which you don’t immediately see how to fix. You prepare a temporary branch of the article, but this time, rather than correcting the error, you insert a warning explaining the existence and the nature of the error, and how you think it affects the conclusions of the article.

Once again, the original authors are notified of your branch. This time they aren’t so pleased with your modifications. Even after multiple back and forth exchanges, and some further revisions on your part, they disagree with your assessment that there is an error. Despite this, you remain convinced that they are missing your point.

Believing that the situation is not readily resolvable, you create a more permanent branch of the article. Now there are two branches of the article visible to the public, with slightly differing version histories. Of course, these version histories are publicly accessible, and so who contributed what is a matter of public record, and there is no danger that there will be any ambiguity about the origins of the new material, nor about the origin of the disagreement between the two branches.

Initially, most readers look only at the original branch of the article, but a few look at yours as well. Favourable commentary and a gradual relative increase in traffic to your branch (made suitably visible to potetial readers) encourages still more people to read your version preferentially. Your branch gradually becomes more highly visible, while the original fades. Someone else fixes the error you noticed, leading to your branch being replaced by a still further improved version, and still more traffic. After some months, reality sets in and the original authors come around to your point of view, removing their original branch entirely, leaving just the new improved version of the article. Alternately, perhaps the original authors, alarmed by their dimunition, decide to strike back with a revised version of their article, explaining in detail why you are wrong.

These stories illustrate a few uses of micropublication and open source research. These are simple ideas for research publication, but ones that have big consequences. The idea of micropublication is to enable publication in smaller increments and more diverse formats than in the standard scientific research paper. The idea of open source research is to open up the licensing model of scientific publication, providing more flexible ways in which prior work can be modified and re-used, while ensuring that all contributions are fully recognized and acknowledged.

Let’s examine a few more potential applications of micropublication and open source research.

Imagine you are reading an article about the principles of population control. As you read, you realize that you can develop a simulator which illustrates in a vivid visual form one of the main principles described in the article, and provides a sandbox for readers to play with and better understand that principle. After dropping a (favourably received) note to the authors, and a little work, you’ve put together a nice simulation. After a bit of back and forth with the authors, a link to your simulation is now integrated into the article. Anyone reading the article can now click on the relevant equation and will immediately see your simulation (and, if they like, the source code). A few months later, someone takes up your source code and develops the simulation further, improving the reader experience still further.

Imagine reading Einstein’s original articles on special relativity, and being able to link directly to simulations (or, even better, fully-fledged computer games) that vividly demonstrate the effects of length contraction, time dilation, and so on. In mathematical disciplines, this kind of content enhancement might even be done semi-automatically. The tools could gradually integrate the ability to make inferences and connections – “The automated reasoning software has discovered a simplification of Equation 3; would you like to view the simplification now?”

Similar types of content enhancement could, of course, be used in all disciplines. Graphs, videos, explanations, commentary, background material, data sets, source code, experimental procedures, links to wikipedia, links to other related papers, links to related pedagogical materials, talks, media releases – all these and more could be integrated more thoroughly into research publishing. Furthermore, rather than being second-class add-ons to “real” research publications, a well-designed citation and archival system would ensure that all these forms have the status of first-class research publications, raising their stature, and helping ensure that people put more effort into adding value in these ways.

Another use for open source research is more pedagogical in flavour. Imagine you are a student assigned to rewrite Einstein’s article on general relativity in the language of modern differential geometry. Think of the excitement of working with the master’s original text, fully inhabiting it, and then improving it still further! Of course, such an assignment is technologically possible even now. However, academia has strong cultural inhibitions against making such modifications to original research articles. I will argue that with properly authenticated archival systems these issues could be addressed, the inhibitions could be removed, and a world of new possibilities opened up.

Having discussed micropublication and open source research in concrete terms, let’s now describe them in more abstract terms, and briefly discuss some of the problems that must be overcome if they are to become viable modes of publication. More detailed resolutions to these problems will be discussed in a later post.

Micropublication does three things. First, it decreases the size of the smallest publishable unit of research. Second, it broadens the class of objects considered as first-class publishable objects so that it includes not just papers, but also items such as data, computer code, simulations, commentary, and so on. Third, it eliminates the barrier of peer review, a point we’ll come back to shortly. The consequence is to greatly reduce the friction slowing down the progress of the research community, by lowering the barriers to publication. Although promising, this lowering of the barriers to publication also creates three problems that must be addressed if the research community is to adopt the concept of micropublication.

The first problem is providing appropriate recognition for people’s contributions. This can be achieved through appropriate archival and citation systems, and is described in detail in a later post.

The second problem is quality assurance. The current convention in science is to filter content before publishing it through a system of peer review. In principle, this ensures that only the best research gets published in the top journals. While this system has substantial failures in practice, on the whole it has improved our access to high-quality research. To ensure similar quality, micropublication must use a publish-then-filter model which enables the highest quality research to be accurately identified. We will discuss the development of such filtering systems in a later post. Note, however, that publish-then-filter already works surprisingly well on the web, due to tools such as Google, which is capable of picking out high value webpages. Such filtering systems are far from perfect, of course, and there are serious obstacles to be overcome if this is to be a successful model.

The third problem is providing tools to organize and search through the mass of publication data. This is, in some sense, the flip side of the quality assurance problem, since it is also about organizing information in meaningful and useful ways, and there is considerable overlap in how these tools must work. Once again, we will discuss the development of these tools in a later post.

Open source research opens up the licensing model used in research publication so that people may make more creative reuse of existing work, and thus speed the process of research. It removes the cumbersome quote-and-cite licensing model in current use in sciece. This makes sense if one is publishing on paper, but is not necessary in electronic publication. Instead, it is replaced by a trustworthy authenticated archive of publication data which allows one to see an entire version history of a document, so that we can see who contributed what and when. This will allow people to rapidly improve, extend and enhance other people’s work, in all the ways described above.

Academics have something of a horror of the informal re-use that I may appear to be advocating. The reason is that the principal currency of research is attention and reputation, not (directly) money. In such a system, not properly citing sources is taken very seriously; even very illustrious researchers have fallen from grace over accusations of plagiarism. For these reasons, it is necessary to design the archival system carefully to ensure that one can gain the benefits of a more informal licensing model, while still adequately recognizing people’s contributions.

Overarching and unifying all these problems is one main problem, the problem of migration, i.e., convincing researchers that it is in their best interest to move to the new system. How can this possibly be achieved? The most obvious implementations of micropublication and open source research will require researchers to give up their participation in the standard recognition system of science — the existing journal system. Such a requirement will undoubtedly result in the migration failing. Fortunately, I believe it is possible to find a migratory path which integrates and extends the standard recognition system of science in such a way that researchers have only positive incentives to make the migration. This path does not start with a single jump to micropublication and open source research, but rather involves a staged migration, with each stage integrating support for legacy systems such as citation and peer review, but also building on new systems that can take the place of the legacy systems, and which are better suited for the eventual goals of micropublication and open source research. This process is quite flexible, but involves many separate ideas, which will be described in subsequent posts.

24 comments

  1. I know that you are starting in the middle here, but perhaps you can back up and say what you think the advantages of this kind of system are?

    (Or is that even your point? Are you arguing that this kind of system is inevitable, something to prepare for — or that it is something to aim for?)

    The initial problem you describe could be handled with today’s system. The person who notices a typo emails the authors and they correct it (on the arxiv). If he notices an error in a proof and corrects it, then he gets his acknowledgment. 🙂 (If the paper is already in press at that point.. well it shouldn’t be.)

    Links to simulations can be maintained on the author’s webpage. No, not everybody can add to it, but I think that is a benefit. The person most qualified to decide what to link to is the author. Aside from that, there’s google and no need to centralize everything on one page.

    Judging from the arxiv, there is already a lot of micropublication going on. In my opinion, there can certainly be too much of it. Benefits are easily overwhelmed by disadvantages: more quantity than quality, flaws in proofs, poorly written and edited papers that would have benefited from a serious review, … I think that extended micropublication might lead to some benefits, but it would increase these problems more: a net negative.

  2. http://scottaaronson.com/blog/?p=5

    Wikipedia is closer to what you describe than journals.

    Suppose it could have articles with a primary author who had the power to choose the default version in the edit history – which would of course require preserving alternate branches in a tree-like rather than linear structure, but that is not so hard.

    Why does Wikipedia not allow this? Because if people want to change your protected article, they will simply copy the content and the copied page will become the new default. The net result would probably be a trail of dead protected pages.

  3. bhauth: I don’t think dead pages are necessarily a problem. Most pages on the web / most blogs are dead, yet you rarely see them, because our search and filtering tools are not too bad. (By contrast, while wikipedia does many things well, its search functionality isn’t very good, and it is very easy to find yourself looking at low quality pages.) A well-designed publication platform will direct people to high-quality branches, and away from low-quality dead ends. Of course, it’s a really interesting problem to develop good methods for doing such filtering.

  4. “w foo” in my address bar becomes
    http://www.google.com/search?hl=en&lr=&safe=off&c2coff=1&q=site%3Aen.wikipedia.org+foo&btnI=I%27%27m+Feeling+Lucky

    Works OK.

    Wikipedia works because it lumps all vandalism and all fixing together, rather than compartmentalizing them. Then fixing must only be overall greater than vandalism. Roughly.

    Perhaps what you are thinking of is a trust network, or recommendation system, that allows you to value certain people’s edits more. I agree that is an interesting problem, and really should finally finish that Netflix prize entry soon. =/

  5. One concern I have is about the writing quality. I think that single-author manuscripts have a certain focus and style that gets watered down when documents are written by committee. Some have said this is the reason that Hollywood scripts are often so weak, even though in principle they have enough money to pay for better writing. Likewise, the Wiki approach might be more appropriate for producing encyclopedias than for producing crisp research papers.

  6. That is rather inconsistent with the way things currently work with coauthors and the current role reviewers play in publication. You can see it as a continuum if you want.

  7. Nice extract! Looking forward to the book.

    I’m really interested in the possibilities offered by some clever linking between articles. It should be possible to set up a citation mechanism that works both backwards in time (as with current papers) and forward in time, ie what articles cite this one? The links could have an associated context too. Maybe the way to go is to establish official link servers and a static id for each article. Then questions of which are the important articles and threads of ideas become questions of analysing the resulting graph. This together with micropublishing and properly assigning credit could have a drastic effect on how science is done.

  8. Aram,

    Yeah, that’s a real potential problem, notwithstanding the fact that “crisp” research papers aren’t as common as we all might like.

    Two partial solutions suggest themselves: (1) better incentives for well-written work; and (2) filters that help suppress poorly-written stuff. (2) is, in a sense, a special case of (1). On the general web, the combination of Google and pageviews already work this way, to some extent, providing an incentive for people to produce relatively high quality stuff.

  9. bhauth: “Perhaps what you are thinking of is a trust network, or recommendation system, that allows you to value certain people’s edits more. I agree that is an interesting problem, and really should finally finish that Netflix prize entry soon. =/”

    Sort of. My ideas aren’t yet fixed, but it does seem likely that some sort of filtering will help prevent dead branches becoming too much of a problem. Forked open software is not unusual. What typically seems to happen is that either (a) one project dies (and usually gradually drops off Google), and the other continues, or (b) both projects thrive, but ultimately serve different communities. I can imagine something similar happening with open source articles, with no ill effects.

    With that said, although it’s relatively easy to fork much open source software, there are certainly some barriers to forking, and that prevents too much forking occurring. It seems like you might want some such barriers in research articles as well, to prevent proliferation out of control.

    (The 14 year old in me delights in reading that last paragraph aloud.)

  10. bhauth: Incidentally, are you participating in the Netflix prize? It seems really interesting, although I’ve been put off by the relatively long timeline.

  11. Hey Alexei!!

    As I’m sure you realize, DOI / CrossRef comes pretty close to the sort of service you suggest. In particular, they seem to have a lot of citation data in their records, from which the full graph can be reconstructed. They don’t have forward data (I think), but the way their XML works with the DOI identifier it looks dead simple to reconstruct the forward links. It’s a pity it’s not an open platform, although it wouldn’t be that hard to build a similar open platform.

    On a related note, there’s a service that’s just been launched here in Waterloo called AideRSS that does some very interesting link analysis for blog posts using “PostRank” (PageRank for blogs). (See also Ilya’s blog for a lot more on this.)

    It’d be great if you got a blog and posted some of your thoughts on Science and the web! Funnily enough, I’ve spent a lot of time over the last week telling people (Robin Blume-Kohout, Chris Fuchs, and Matt Leifer) about your experiments with version control and paper authoring. There’s obviously a lot of interest in stuff like this…

  12. I had an idea that is orthogonal to micropublication, but applies to publishing technical research on the web. The idea is being able to publish a paper where a reader can “zoom in” on the mathematics to whatever level of detail they would like.

    For example, in the body of a paper, each step of a proof would be presented with only sparse details, but enough to see the major steps of the proof. This way readers could see the forest from trees. Then a reader could click on each step to expand it, and see the substeps. A substep could in fact be a proof that is contained in another paper. If this is the case, the contents of that proof could appear inline. Then, for more novice readers, even the substeps could be expanded into sub-sub steps. At some level, the substeps could be automatically generated.

  13. Hi Jon,

    This kind of zooming seems very attractive. Wikipedia does something related already, with large topics often having a single main page, and many subpages, with the subpages typically being summarized on the main page.

    I suspect that having a very large number of levels may be unattractive to authors – it would be a lot of effort to prepare (although open sourcing the license may help with this). But having two or three levels of detail may be quite an attractive proposition to authors. At some level, many already try to write this way, both providing an outline of their argument, and providing the details in separate sections (or appendices). Of course, at present publishing tools aren’t very well adapted to presenting multiple levels, and so you lose a lot of the impact.

  14. In addition to the equation simplifier you mention, two more AI based helpers come to mind:
    – Math checker (verify if the the math, esp. the statistics is sound; this could go some way in addressing the shoddy statistics problem addressed by John Ioannidis covered in the Atlantic post, “Lies, Damned Lies, and Medical Science,” http://bit.ly/bU6o49).
    – Reference checker: Show possible citations that author might have missed out by examining the text.

    Also, I feel the staged adoption approach you mention in the end is logical but too safe. One or more bolder leaps have to be taken (maybe Tahrir has whet the appetite for disruptive changes). Perhaps these leaps have already been taken by Google, Wikipedia et. al. and just need to be identified and integrated.

Comments are closed.