(Some) garbage in, gold out

During a recent talk David Weinberger asked me (paraphrasing) whether and how the nature of scientific knowledge will change when it’s produced by large networked collaborations?

It’s a great question. Suppose it’s announced in the next few years that the LHC has discovered the Higgs boson. There will, no doubt, be a peer-reviewed scientific paper describing the result.

How should we regard such an announcement?

The chain of evidence behind the result will no doubt be phenomenally complex. The LHC analyses about 600 million particle collisions per second. The data analysis is done using a cluster of more than 200,000 processing cores, and tens of millions of lines of software code. That code is built on all sorts of extremely specialized knowledge and assumptions about detector and beam physics, statistical inference, quantum field theory, and so on. Whatsmore that code, like any large software package, no doubt has many bugs, despite enormous effort to eliminate bugs.

No one person in the world will understand in detail the entire chain of evidence that led to the discovery of the Higgs. In fact, it’s possible that very few (no?) people will understand in much depth even just the principles behind the chain of evidence. How many people have truly mastered quantum field theory, statistical inference, detector physics, and distributed computing?

What, then, should we make of any paper announcing that the Higgs boson has been found?

Standard pre-publication peer review will mean little. Yes, it’ll be useful as an independent sanity check of the work. But all it will show is that there’s no glaringly obvious holes. It certainly won’t involve more than a cursory inspection of the evidence.

A related situation arose in the 1980s in mathematics. It was announced in the early 1980s that an extremely important mathematical problem had been solved: the classification of the finite simple groups. The proof had taken about 30 years, and involved an effort by 100 or so mathematicians, spread across many papers and thousands of pages of proof.

Unfortunately, the original proof had gaps. Most of them were not serious. But at least one serious gap remained. In 2004, two mathematicians published a two-volume, 1,200 page supplement to the original proof, filling in the gap. (At least, we hope they’ve filled in the gap!)

When discoveries rely on hundreds of pieces of evidence or steps of reasoning, we can be pretty sure of our conclusions, provided our error rate is low, say one part in a hundred thousand. But when we start to use a million or a billion (or a trillion or more) pieces of evidence or steps of reasoning, an error rate of one part in a million
becomes a guarantee of failure, unless we develop systems that can tolerate those errors.

It seems to me that one of the core questions the scientific community will wrestle with over the next few decades is what principles and practices we use to judge whether or not a conclusion drawn from a large body of networked knowledge is correct? To put it another way, how can we ensure that we reliably come to correct conclusions, despite the fact that some of our evidence or reasoning is almost certainly wrong?

At the moment each large-scale collaboration addresses this in their own way. The people at the LHC and those responsible for the classification of finite simple groups are certainly very smart, and I’ve no doubt they’re doing lots of smart things to eliminate or greatly reduce the impact of errors. But it’d be good to have a principled way of understanding how and when we can come to correct scientific conclusions, in the face of low-level errors in the evidence and reasoning used to arrive at those errors.

If you doubt there’s a problem here, then think about the mistakes that led to the Pentium floating point bug. Or think of the loss of the Mars Climate Orbiter. That’s often described as a failure to convert between metric and imperial units, which makes it sound trivial, like the people at NASA are fools. The real problem was deeper. As a NASA official said:

People sometimes make errors. The problem here was not the error [of unit conversion], it was the failure of NASA’s systems engineering, and the checks and balances in our processes to detect the error. That’s why we lost the spacecraft.

In other words, when you’re working at NASA scale, problems that are unlikely at small scale, like failing to do a unit conversion, are certain to occur. It’s foolish to act as though they won’t happen. Instead, you need to develop systems which limit the impact of such errors.

In the context of science, what this means is that we need new methods of fault-tolerant discovery.

I don’t have well-developed answers to the questions I’ve raised above, riffing off David Weinberger’s original question. But I will finish with the notion that one useful source of ideas may be systems and safety engineering, which are responsible for the reliable performance of complex systems such as modern aircraft. According to Boeing, a 747-400 has six million parts, and the first 747 required 75,000 engineering drawings. Not to mention all the fallible human “components” in a modern aircraft. Yet aircraft systems and safety engineers have developed checks and balances that let us draw with very high probability the conclusion “The plane will get safely from point A to B”. Sounds like a promising source of insights to me!

Further reading: An intriguing experiment in distributed verification of a mathematical proof has been done in an article by Doron Zeilberger. Even if you can’t follow the mathematics, it’s stimulating to look through. I’ve taken a stab at some of the issues in this post before, in my essay Science beyond individual understanding. I’m also looking forward to David Weinberger’s new book about networked knowledge, Too Big To Know. Finally, my new book Reinventing Discovery is about the promise and the challenges of networked science.

Open Access: a short summary

I wrote the following essay for one of my favourite online forums, Hacker News, which over the past few months has seen more and more discussion of the issue of open access to scientific publication. It seems like it might have broader interest, so I’m reposting it here. Original link here.

The topic of open access to scientific papers comes up often on Hacker News.

Unfortunately, those discussions sometimes bog down in misinformation and misunderstandings.

Although it’s not exactly my area of expertise, it’s close — I’ve spent the last three years working on open science.

So I thought it might be useful to post a summary of the current state of open access. There’s a lot going on, so even though this essay appears lengthy, it’s actually a very brief and incomplete summary of what’s happening. I have links to further reading at the end.

This is not a small stakes game. The big scientific publishers are phenomenally profitable. In 2009, Elsevier made a profit of 1.1 billion dollars on revenue of 3.2 billion dollars. That’s a margin (and business model) they are very strongly motivated to protect. They’re the biggest commercial journal publisher, but the other big publishers are also extremely profitable.

Even not-for-profit societies often make an enormous profit on their journals. In 2004 (the most recent year for which I have figures) the American Chemical Society made a profit of 40 million dollars on revenues of 340 million dollars. Not bad! This money is reinvested in other society activities, including salaries. Top execs receive salaries in the 500k to 1m range (as of 2006, I’m sure it’s quite a bit higher now.)

The traditional publishers make money by charging journal subscription fees to libraries. Why they make so much money is a matter for much discussion, but I will merely point out one fact: there are big systematic inefficiencies built into the market. University libraries for the most part pay the subscription fees, but they rely on guidance (and often respond to pressure) from faculty members in deciding what journals to subscribe to. In practice, faculty often have a lot of power in making these decisions, without bearing the costs. And so they can be quite price-insensitive.

The journal publishers have wildly varying (and changing) responses to the notion of open access.

For example, most Springer journals are closed access, but in 2008 Springer bought BioMedCentral, one of the original open access publishers, and by some counts the world’s largest. They continue to operate. (More on the deal here.)

[Update: It has been pointed out to me in email that Springer now uses a hybrid open access model for most of their journals, whereby authors can opt to pay a fee to make their articles open access. If the authors don’t pay that fee, the articles remain closed. The other Springer journals, including BioMedCentral, are fully open access.]

Nature Publishing Group is also mostly closed access, but has recently started an open access journal called Scientific Reports, apparently modeled after the (open access) Public Library of Science’s journal PLoS One.

It is sometimes stated that big commercial publishers don’t allow authors to put free-to-access copies of their papers on the web. In fact, policies vary quite a bit from publisher to publisher. Elsevier and Springer, for example, do allow authors to put copies of their papers on their websites, and into institutional repositories. This doesn’t mean that always (or even often) happens, but it’s at least in principle possible.

Comments on HN sometimes assume that open access is somehow a new issue, or an issue that no-one has been doing anything about until recently.

This is far from the case. Take a look at the Open Access Newsletters and you’ll realize that there’s a community of people working very, very hard for open access. They’re just not necessarily working in ways that are visible to hackers.

Nonetheless, as a result of the efforts of people in the open access movement, a lot of successes have been achieved, and there is a great deal of momentum toward open access.

Here’s a few examples of success:

In 2008 the US National Institutes of Health (NIH) — by far the world’s largest funding agency, with a $30+ billion dollar a year budget — adopted a policy requiring that all NIH-funded research be made openly accessible within 12 months of publication. See here for more.

All 7 UK Research Councils have adopted similar open access policies requiring researchers they fund to make their work openly accessible.

Many universities have adopted open access policies. Examples include: Harvard’s Faculty of Arts and Sciences, MIT, and Princeton.

As a result of policies like these, in years to come you should see more and more freely downloadable papers showing up in search results.

Note that there are a lot of differences of detail in the different policies, and those details can make a big difference to the practical impact of the policies. I won’t try to summarize all the nuances here, I’m merely pointing out that there is a lot of institutional movement.

Many more pointers to open access policies may be found at ROARMAP. That site notes 52 open access policies from grant agencies, and 135 from academic institutions.

There’s obviously still a long way to go before there is universal open access to publicly-funded research, but there has been a lot of progress, and a lot of momentum.

One thing that I hope will happen is that the US Federal Research Public Access Act passes. First proposed in 2006 (and again in 2010), this Act would essentially extend the NIH policy to all US Government-funded research (from agencies with budgets over 100 million). My understanding is that at present the Act is tied up in committee.

Despite (or because of) this progress, there is considerable pushback on the open access movement from some scientific publishers. As just one instance, in 2007 some large publishers hired a very aggressive PR firm to wage a campaign to publicly discredit open access.

I will not be surprised if this pushback escalates.

What can hackers do to help out?

One great thing to do is start a startup in this space. Startups (and former startups) like Mendeley, ChemSpider, BioMedCentral, PLoS and others have had a big impact over the past ten or so years, but there’s even bigger opportunities for hackers to really redefine scientific publishing. Ideas like text mining, recommender systems, open access to data, automated inference, and many others can be pushed much, much further.

I’ve written about this in the following essay: Is Scientific Publishing About to be Disrupted? Many of those ideas are developed in much greater depth in my book on open science, Reinventing Discovery.

For less technical (and less time-consuming!) ways of getting involved, you may want to subscribe to the RSS feed at the Alliance for Taxpayer Access. This organization was crucial in helping get the NIH open access policy passed, and they’re helping do the same for the Federal Public Research Access Act, as well as other open access efforts.

If you want to know more, the best single resource I know is Peter Suber’s website.

Suber has, for example, written an extremely informative introduction to open access. His still-active Open Access Newsletter is a goldmine of information, as is his (no longer active) blog. He also runs the open access tracking project.

If you got this far, thanks for reading! Corrections are welcome.

Reinventing Discovery

I’m very excited to say that my new book, “Reinventing Discovery: The New Era of Networked Science”, has just been released!

The book is about networked science: the use of online tools to transform the way science is done. In the book I make the case that networked science has the potential to dramatically speed up the rate of scientific discovery, not just in one field, but across all of science. Furthermore, it won’t just speed up discovery, but will actually amplify our collective intelligence, expanding the range of scientific problems which can be attacked at all.

But, as I explain in the book, there are cultural obstacles that are blocking networked science from achieving its full potential. And so the book is also a manifesto, arguing that networked science must be open science if it is to realize its potential.

Making the change to open science is a big challenge. In my opinion it’s one of the biggest challenges our society faces, one that requires action on many fronts. One of those fronts is to make sure that everyone — including scientists, but also grant agencies, governments, libraries, and, especially, the general public -– understands how important the stakes are, and how urgent is the need for change. And so my big hope for this book is that it will help raise the profile of open science. I want open science to become a part of our general culture, a subject every educated layperson is familiar with, and has an opinion about. If we can cause that to happen, then I believe that a big and positive shift in the culture of science is inevitable. And that will benefit everyone.

The book is shipping in hardcover from Amazon.com, and should ship through other booksellers by October 21. Note that the Kindle edition isn’t out as I write, but should arrive by October 21. A few relevant links:

Two caveats. First, I’m occasionally asked if the book is being released under a Creative Commons license. I discussed this option at length with my publisher, who ultimately declined. A couple of people have said to me that they find this ironic. This isn’t so, since the book argues as a broad principle that publicly funded science should be open science; the book is neither publicly funded nor, strictly speaking, science. However, as a personal preference I’d still like to see it enter the commons sooner rather than later. After the paperback has been out for a while, I will approach my publisher again to see what can be done.

Second, the book is not meant to be a reference work on open science. Instead, I’ve highlighted a small set of focused examples, inevitably leaving many great open science projects out. I hope the people running those other projects can forgive me. My aim wasn’t to write a reference work, but rather to write the kind of book that people will enjoy reading, and which enthusiasts of open science can give to their friends and family to help explain what open science is all about, and why it matters so very much.

Let me conclude by quoting one of my favorite lines from Tolkien: “The praise of the praiseworthy is above all reward”. And so it gives me great delight to finish with quotes from a few of the endorsements and reviews the book has received:

Science has always been a contact sport; the interaction of many minds is the engine of the discipline. Michael Nielsen has given us an unparalleled account of how new tools for collaboration are transforming scientific practice. Reinventing Discovery doesn’t just help us understand how the sciences are changing, it shows us how we can participate in the change. – Clay Shirky

This is the book on how networks will drive a revolution in scientific discovery; definitely recommended. – Tyler Cowen

Anyone who has followed science in recent years has noticed something odd: science is less and less about a solitary scientist working alone in a lab. Scientists are working in networks, and those networks are gaining scope, speed, and power through the internet. Nonscientists have been getting in on the act, too, folding proteins and identifying galaxies. Michael Nielsen has been watching these developments too, but he’s done much more: he’s provided the best synthesis I’ve seen of this new kind of science, and he’s also thought deeply about what it means for the future of how we understand the world. Reinventing Discovery is a delightfully written, thought-provoking book. – Carl Zimmer

Reinventing Discovery will frame serious discussion and inspire wild, disruptive ideas for the next decade. – Chris Lintott in Nature

Nielsen has created perhaps the most compelling and comprehensive case so far for a new approach to science in the Internet age. – Timo Hannay in Nature Physics

Georgia Tech, Duke University, and University of North Carolina

I’ll be speaking about open science at events at Georgia Tech, Duke University, and the University of North Carolina over the next few days. Here’s my current schedule of public and semi-public events:

  • Events on Monday, October 3, at 11:30am and 3:00pm at Georgia Tech: details of both events.
  • Event on Tuesday, October 4, at 4pm at Duke University: details.
  • I will be at the University of North Carolina on Wednesday, October 5. I am not currently doing any public events, but let me know if you’d like to meet, and it’s possible something can be arranged.

Berlin, New York, Boston

I’ll be speaking about open science at events in Berlin, New York and Boston over the next week. Here’s my current schedule of public and semi-public events:

  • Berlin, Friday 16 September, 5pm, event at the Freie Universität of Berlin: more details
  • New York, Courant Institute Colloquium, NYU, Monday 19 September, 3:45pm.
  • New York, event organized by the Coles Science Center and the NYU Libraries Information Futures Group, Monday 19 September, 6:30pm: more details
  • Boston, Harvard, Colloquium at the Institute for Theory and Computation in the Center for Astrophysics, Thursday 22 September: more details.
  • Boston, MIT Physics Colloquium (MIT only), Thursday 22 September: more details

Visiting Europe

I’ll be in Europe for the next couple of weeks, and will be giving several talks about open science. Here’s a rough schedule of where I’ll be and when:

Please come and say hello if you’re at one of the events!

I’m interested in adding more events to my schedule, so if you’re interested in having me speak, or would like to arrange for me to attend some sort of meetup (perhaps with a group), please let me know (mn@michaelnielsen.org).

I’ll add more details over the next couple of days, as details become available.

Public talk about open science in San Francisco

I’m pleased to say that I’ll be giving a public talk about open science in San Francisco, next Wednesday, June 29, at 6pm. The talk is being hosted by the Public Library of Science, and there will be wine, beer and cheese after the event.

The talk is entitled “Why the net doesn’t work for science – and how to fix it” [*]. Here’s my abstract for the talk:

The net is transforming many aspects of our society, from finance to friendship. And yet scientists, who helped create the net, are extremely conservative in how they use it. Although the net has great potential to transform science, most scientists remain stuck in a centuries-old system for the construction of knowledge. I will describe some leading-edge projects that show how online tools can radically change and improve science (using projects in Mathematics and Citizen Science as examples), and will then go on to discuss why these tools haven’t spread to all corners of science, and how we can change that.

The talk will be thematically similar to my recent talk about open science for TEDxWaterloo, but will go much deeper into the challenge and promise of open science.

For more details on the talk, including the address and a map, please see the PLoS blog. Please RSVP to rshah@plos.org if you plan to attend.

Hope to see you there!

[*] The title is a riff on the wonderful phrase “making the web work for science”, which I believe originated with James Boyle. For a recent talk on the subject by Boyle, see here (see also Creative Commons’ work on science).

Data-driven intelligence

I’ve started a new blog, on data-driven intelligence. In future, this is where technical posts like my Google Technology Stack posts will go. There are two posts up:

  • A post describing Pregel, Google’s system for implementing graph-based algorithms on large clusters of machines. In addition to describing how Pregel works, I give a toy single-machine Python implementation which can be used to play with Pregel. The code is up on GitHub.
  • Sex, Einstein, and Lady Gaga: what’s discussed on the most popular blogs. I crawled 50,000 pages from Technorati’s list of the top 1,000 blogs, and determined the percentage of pages containing words such as “sex”, “Einstein”, “Gaga”, and many others. The results were entertaining.

The blog, of course, has an RSS feed.

Quantum computing for the determined

I’ve posted to YouTube a series of 22 short videos giving an introduction to quantum computing. Here’s the first video:

Below I list the remaining 21 videos, which cover subjects including the basic model of quantum computing, entanglement, superdense coding, and quantum teleportation.

To work through the videos you need to be comfortable with basic linear algebra, and with assimilating new mathematical terminology. If you’re not, working through the videos will be arduous at best! Apart from that background, the main prerequisite is determination, and the willingness to work more than once over material you don’t fully understand.

In particular, you don’t need a background in quantum mechanics to follow the videos.

The videos are short, from 5-15 minutes, and each video focuses on explaining one main concept from quantum mechanics or quantum computing. In taking this approach I was inspired by the excellent Khan Academy.

The course is not complete — I originally planned about 8 more videos. The extra videos would complete my summary of basic quantum mechanics (+2 videos), and cover reversible computing (+2 videos), and Grover’s quantum search algorithm (+4 videos). Unfortunately, work responsibilities that couldn’t be put aside meant I had to put the remaining videos on hold. If lots of people work through the existing videos and are keen for more, then I’ll find time to finish them off. As it is, I hope the incomplete series is still useful.

One minor gotcha: originally, I was hoping to integrate the videos with a set of exercises. Again, time prevented me from doing this: there are no exercises. But as a remnant of this plan, in at least one video (video 7, the video on unitary matrices preserving length, and possibly elsewhere) I leave something “to the exercises”. Hopefully it’s pretty clear what needs to be filled in at this point, and viewers can supply the missing details.

Let me finish with two comments on approach. First, the videos treat quantum bits — qubits — as abstract mathematical entities, in a way similar to how we can think of conventional (classical) bits as 0 or 1, not as voltages in a circuit, or magnetic domains on a hard disk. I don’t get into the details of physical implementation at all. This approach bugs some people a lot, and others not at all. If you think it’ll bug you, these videos aren’t for you.

Second, the videos focus on the nuts-and-bolts of how things work. If you want a high-level overview of quantum computing, why it’s interesting, and what quantum computers may be capable of, there are many available online, a Google search away. Here’s a nice one, from Scott Aaronson. You may also enjoy David Deutsch’s original paper about quantum computing. It’s a bit harder to read than an article in Wired or Scientific American, but it’s worth the effort, for the paper gives a lot of insight into some of the fundamental reasons for thinking about quantum computing in the first place. Such higher-level articles may be helpful to read in conjunction with the videos.

Here’s the full list of videos, including the first one above. Note that because this really does get into the nuts and bolts of how things work, it also builds cumulatively. You can’t just skip straight to the quantum teleportation video and hope to understand it, you’ll need to work through the earlier videos, unless you already understand their content.

The basics

Superdense coding

Quantum teleportation

The postulates of quantum mechanics (TBC)

Thanks to Jen Dodd, Ilya Grigorik and Hassan Masum for feedback on the videos, and for many enjoyable discussions about open education.

If you enjoyed these videos, you may be interested in my forthcoming book, Reinventing Discovery, where I describe how online tools and open science are transforming the way scientific discoveries are made.