(Some) garbage in, gold out

During a recent talk David Weinberger asked me (paraphrasing) whether and how the nature of scientific knowledge will change when it’s produced by large networked collaborations?

It’s a great question. Suppose it’s announced in the next few years that the LHC has discovered the Higgs boson. There will, no doubt, be a peer-reviewed scientific paper describing the result.

How should we regard such an announcement?

The chain of evidence behind the result will no doubt be phenomenally complex. The LHC analyses about 600 million particle collisions per second. The data analysis is done using a cluster of more than 200,000 processing cores, and tens of millions of lines of software code. That code is built on all sorts of extremely specialized knowledge and assumptions about detector and beam physics, statistical inference, quantum field theory, and so on. Whatsmore that code, like any large software package, no doubt has many bugs, despite enormous effort to eliminate bugs.

No one person in the world will understand in detail the entire chain of evidence that led to the discovery of the Higgs. In fact, it’s possible that very few (no?) people will understand in much depth even just the principles behind the chain of evidence. How many people have truly mastered quantum field theory, statistical inference, detector physics, and distributed computing?

What, then, should we make of any paper announcing that the Higgs boson has been found?

Standard pre-publication peer review will mean little. Yes, it’ll be useful as an independent sanity check of the work. But all it will show is that there’s no glaringly obvious holes. It certainly won’t involve more than a cursory inspection of the evidence.

A related situation arose in the 1980s in mathematics. It was announced in the early 1980s that an extremely important mathematical problem had been solved: the classification of the finite simple groups. The proof had taken about 30 years, and involved an effort by 100 or so mathematicians, spread across many papers and thousands of pages of proof.

Unfortunately, the original proof had gaps. Most of them were not serious. But at least one serious gap remained. In 2004, two mathematicians published a two-volume, 1,200 page supplement to the original proof, filling in the gap. (At least, we hope they’ve filled in the gap!)

When discoveries rely on hundreds of pieces of evidence or steps of reasoning, we can be pretty sure of our conclusions, provided our error rate is low, say one part in a hundred thousand. But when we start to use a million or a billion (or a trillion or more) pieces of evidence or steps of reasoning, an error rate of one part in a million
becomes a guarantee of failure, unless we develop systems that can tolerate those errors.

It seems to me that one of the core questions the scientific community will wrestle with over the next few decades is what principles and practices we use to judge whether or not a conclusion drawn from a large body of networked knowledge is correct? To put it another way, how can we ensure that we reliably come to correct conclusions, despite the fact that some of our evidence or reasoning is almost certainly wrong?

At the moment each large-scale collaboration addresses this in their own way. The people at the LHC and those responsible for the classification of finite simple groups are certainly very smart, and I’ve no doubt they’re doing lots of smart things to eliminate or greatly reduce the impact of errors. But it’d be good to have a principled way of understanding how and when we can come to correct scientific conclusions, in the face of low-level errors in the evidence and reasoning used to arrive at those errors.

If you doubt there’s a problem here, then think about the mistakes that led to the Pentium floating point bug. Or think of the loss of the Mars Climate Orbiter. That’s often described as a failure to convert between metric and imperial units, which makes it sound trivial, like the people at NASA are fools. The real problem was deeper. As a NASA official said:

People sometimes make errors. The problem here was not the error [of unit conversion], it was the failure of NASA’s systems engineering, and the checks and balances in our processes to detect the error. That’s why we lost the spacecraft.

In other words, when you’re working at NASA scale, problems that are unlikely at small scale, like failing to do a unit conversion, are certain to occur. It’s foolish to act as though they won’t happen. Instead, you need to develop systems which limit the impact of such errors.

In the context of science, what this means is that we need new methods of fault-tolerant discovery.

I don’t have well-developed answers to the questions I’ve raised above, riffing off David Weinberger’s original question. But I will finish with the notion that one useful source of ideas may be systems and safety engineering, which are responsible for the reliable performance of complex systems such as modern aircraft. According to Boeing, a 747-400 has six million parts, and the first 747 required 75,000 engineering drawings. Not to mention all the fallible human “components” in a modern aircraft. Yet aircraft systems and safety engineers have developed checks and balances that let us draw with very high probability the conclusion “The plane will get safely from point A to B”. Sounds like a promising source of insights to me!

Further reading: An intriguing experiment in distributed verification of a mathematical proof has been done in an article by Doron Zeilberger. Even if you can’t follow the mathematics, it’s stimulating to look through. I’ve taken a stab at some of the issues in this post before, in my essay Science beyond individual understanding. I’m also looking forward to David Weinberger’s new book about networked knowledge, Too Big To Know. Finally, my new book Reinventing Discovery is about the promise and the challenges of networked science.

Open Access: a short summary

I wrote the following essay for one of my favourite online forums, Hacker News, which over the past few months has seen more and more discussion of the issue of open access to scientific publication. It seems like it might have broader interest, so I’m reposting it here. Original link here.

The topic of open access to scientific papers comes up often on Hacker News.

Unfortunately, those discussions sometimes bog down in misinformation and misunderstandings.

Although it’s not exactly my area of expertise, it’s close — I’ve spent the last three years working on open science.

So I thought it might be useful to post a summary of the current state of open access. There’s a lot going on, so even though this essay appears lengthy, it’s actually a very brief and incomplete summary of what’s happening. I have links to further reading at the end.

This is not a small stakes game. The big scientific publishers are phenomenally profitable. In 2009, Elsevier made a profit of 1.1 billion dollars on revenue of 3.2 billion dollars. That’s a margin (and business model) they are very strongly motivated to protect. They’re the biggest commercial journal publisher, but the other big publishers are also extremely profitable.

Even not-for-profit societies often make an enormous profit on their journals. In 2004 (the most recent year for which I have figures) the American Chemical Society made a profit of 40 million dollars on revenues of 340 million dollars. Not bad! This money is reinvested in other society activities, including salaries. Top execs receive salaries in the 500k to 1m range (as of 2006, I’m sure it’s quite a bit higher now.)

The traditional publishers make money by charging journal subscription fees to libraries. Why they make so much money is a matter for much discussion, but I will merely point out one fact: there are big systematic inefficiencies built into the market. University libraries for the most part pay the subscription fees, but they rely on guidance (and often respond to pressure) from faculty members in deciding what journals to subscribe to. In practice, faculty often have a lot of power in making these decisions, without bearing the costs. And so they can be quite price-insensitive.

The journal publishers have wildly varying (and changing) responses to the notion of open access.

For example, most Springer journals are closed access, but in 2008 Springer bought BioMedCentral, one of the original open access publishers, and by some counts the world’s largest. They continue to operate. (More on the deal here.)

[Update: It has been pointed out to me in email that Springer now uses a hybrid open access model for most of their journals, whereby authors can opt to pay a fee to make their articles open access. If the authors don’t pay that fee, the articles remain closed. The other Springer journals, including BioMedCentral, are fully open access.]

Nature Publishing Group is also mostly closed access, but has recently started an open access journal called Scientific Reports, apparently modeled after the (open access) Public Library of Science’s journal PLoS One.

It is sometimes stated that big commercial publishers don’t allow authors to put free-to-access copies of their papers on the web. In fact, policies vary quite a bit from publisher to publisher. Elsevier and Springer, for example, do allow authors to put copies of their papers on their websites, and into institutional repositories. This doesn’t mean that always (or even often) happens, but it’s at least in principle possible.

Comments on HN sometimes assume that open access is somehow a new issue, or an issue that no-one has been doing anything about until recently.

This is far from the case. Take a look at the Open Access Newsletters and you’ll realize that there’s a community of people working very, very hard for open access. They’re just not necessarily working in ways that are visible to hackers.

Nonetheless, as a result of the efforts of people in the open access movement, a lot of successes have been achieved, and there is a great deal of momentum toward open access.

Here’s a few examples of success:

In 2008 the US National Institutes of Health (NIH) — by far the world’s largest funding agency, with a $30+ billion dollar a year budget — adopted a policy requiring that all NIH-funded research be made openly accessible within 12 months of publication. See here for more.

All 7 UK Research Councils have adopted similar open access policies requiring researchers they fund to make their work openly accessible.

Many universities have adopted open access policies. Examples include: Harvard’s Faculty of Arts and Sciences, MIT, and Princeton.

As a result of policies like these, in years to come you should see more and more freely downloadable papers showing up in search results.

Note that there are a lot of differences of detail in the different policies, and those details can make a big difference to the practical impact of the policies. I won’t try to summarize all the nuances here, I’m merely pointing out that there is a lot of institutional movement.

Many more pointers to open access policies may be found at ROARMAP. That site notes 52 open access policies from grant agencies, and 135 from academic institutions.

There’s obviously still a long way to go before there is universal open access to publicly-funded research, but there has been a lot of progress, and a lot of momentum.

One thing that I hope will happen is that the US Federal Research Public Access Act passes. First proposed in 2006 (and again in 2010), this Act would essentially extend the NIH policy to all US Government-funded research (from agencies with budgets over 100 million). My understanding is that at present the Act is tied up in committee.

Despite (or because of) this progress, there is considerable pushback on the open access movement from some scientific publishers. As just one instance, in 2007 some large publishers hired a very aggressive PR firm to wage a campaign to publicly discredit open access.

I will not be surprised if this pushback escalates.

What can hackers do to help out?

One great thing to do is start a startup in this space. Startups (and former startups) like Mendeley, ChemSpider, BioMedCentral, PLoS and others have had a big impact over the past ten or so years, but there’s even bigger opportunities for hackers to really redefine scientific publishing. Ideas like text mining, recommender systems, open access to data, automated inference, and many others can be pushed much, much further.

I’ve written about this in the following essay: Is Scientific Publishing About to be Disrupted? Many of those ideas are developed in much greater depth in my book on open science, Reinventing Discovery.

For less technical (and less time-consuming!) ways of getting involved, you may want to subscribe to the RSS feed at the Alliance for Taxpayer Access. This organization was crucial in helping get the NIH open access policy passed, and they’re helping do the same for the Federal Public Research Access Act, as well as other open access efforts.

If you want to know more, the best single resource I know is Peter Suber’s website.

Suber has, for example, written an extremely informative introduction to open access. His still-active Open Access Newsletter is a goldmine of information, as is his (no longer active) blog. He also runs the open access tracking project.

If you got this far, thanks for reading! Corrections are welcome.

Georgia Tech, Duke University, and University of North Carolina

I’ll be speaking about open science at events at Georgia Tech, Duke University, and the University of North Carolina over the next few days. Here’s my current schedule of public and semi-public events:

  • Events on Monday, October 3, at 11:30am and 3:00pm at Georgia Tech: details of both events.
  • Event on Tuesday, October 4, at 4pm at Duke University: details.
  • I will be at the University of North Carolina on Wednesday, October 5. I am not currently doing any public events, but let me know if you’d like to meet, and it’s possible something can be arranged.

Berlin, New York, Boston

I’ll be speaking about open science at events in Berlin, New York and Boston over the next week. Here’s my current schedule of public and semi-public events:

  • Berlin, Friday 16 September, 5pm, event at the Freie Universität of Berlin: more details
  • New York, Courant Institute Colloquium, NYU, Monday 19 September, 3:45pm.
  • New York, event organized by the Coles Science Center and the NYU Libraries Information Futures Group, Monday 19 September, 6:30pm: more details
  • Boston, Harvard, Colloquium at the Institute for Theory and Computation in the Center for Astrophysics, Thursday 22 September: more details.
  • Boston, MIT Physics Colloquium (MIT only), Thursday 22 September: more details

Visiting Europe

I’ll be in Europe for the next couple of weeks, and will be giving several talks about open science. Here’s a rough schedule of where I’ll be and when:

Please come and say hello if you’re at one of the events!

I’m interested in adding more events to my schedule, so if you’re interested in having me speak, or would like to arrange for me to attend some sort of meetup (perhaps with a group), please let me know (mn@michaelnielsen.org).

I’ll add more details over the next couple of days, as details become available.

Public talk about open science in San Francisco

I’m pleased to say that I’ll be giving a public talk about open science in San Francisco, next Wednesday, June 29, at 6pm. The talk is being hosted by the Public Library of Science, and there will be wine, beer and cheese after the event.

The talk is entitled “Why the net doesn’t work for science – and how to fix it” [*]. Here’s my abstract for the talk:

The net is transforming many aspects of our society, from finance to friendship. And yet scientists, who helped create the net, are extremely conservative in how they use it. Although the net has great potential to transform science, most scientists remain stuck in a centuries-old system for the construction of knowledge. I will describe some leading-edge projects that show how online tools can radically change and improve science (using projects in Mathematics and Citizen Science as examples), and will then go on to discuss why these tools haven’t spread to all corners of science, and how we can change that.

The talk will be thematically similar to my recent talk about open science for TEDxWaterloo, but will go much deeper into the challenge and promise of open science.

For more details on the talk, including the address and a map, please see the PLoS blog. Please RSVP to rshah@plos.org if you plan to attend.

Hope to see you there!

[*] The title is a riff on the wonderful phrase “making the web work for science”, which I believe originated with James Boyle. For a recent talk on the subject by Boyle, see here (see also Creative Commons’ work on science).

Data-driven intelligence

I’ve started a new blog, on data-driven intelligence. In future, this is where technical posts like my Google Technology Stack posts will go. There are two posts up:

  • A post describing Pregel, Google’s system for implementing graph-based algorithms on large clusters of machines. In addition to describing how Pregel works, I give a toy single-machine Python implementation which can be used to play with Pregel. The code is up on GitHub.
  • Sex, Einstein, and Lady Gaga: what’s discussed on the most popular blogs. I crawled 50,000 pages from Technorati’s list of the top 1,000 blogs, and determined the percentage of pages containing words such as “sex”, “Einstein”, “Gaga”, and many others. The results were entertaining.

The blog, of course, has an RSS feed.

Quantum computing for the determined

I’ve posted to YouTube a series of 22 short videos giving an introduction to quantum computing. Here’s the first video:

Below I list the remaining 21 videos, which cover subjects including the basic model of quantum computing, entanglement, superdense coding, and quantum teleportation.

To work through the videos you need to be comfortable with basic linear algebra, and with assimilating new mathematical terminology. If you’re not, working through the videos will be arduous at best! Apart from that background, the main prerequisite is determination, and the willingness to work more than once over material you don’t fully understand.

In particular, you don’t need a background in quantum mechanics to follow the videos.

The videos are short, from 5-15 minutes, and each video focuses on explaining one main concept from quantum mechanics or quantum computing. In taking this approach I was inspired by the excellent Khan Academy.

The course is not complete — I originally planned about 8 more videos. The extra videos would complete my summary of basic quantum mechanics (+2 videos), and cover reversible computing (+2 videos), and Grover’s quantum search algorithm (+4 videos). Unfortunately, work responsibilities that couldn’t be put aside meant I had to put the remaining videos on hold. If lots of people work through the existing videos and are keen for more, then I’ll find time to finish them off. As it is, I hope the incomplete series is still useful.

One minor gotcha: originally, I was hoping to integrate the videos with a set of exercises. Again, time prevented me from doing this: there are no exercises. But as a remnant of this plan, in at least one video (video 7, the video on unitary matrices preserving length, and possibly elsewhere) I leave something “to the exercises”. Hopefully it’s pretty clear what needs to be filled in at this point, and viewers can supply the missing details.

Let me finish with two comments on approach. First, the videos treat quantum bits — qubits — as abstract mathematical entities, in a way similar to how we can think of conventional (classical) bits as 0 or 1, not as voltages in a circuit, or magnetic domains on a hard disk. I don’t get into the details of physical implementation at all. This approach bugs some people a lot, and others not at all. If you think it’ll bug you, these videos aren’t for you.

Second, the videos focus on the nuts-and-bolts of how things work. If you want a high-level overview of quantum computing, why it’s interesting, and what quantum computers may be capable of, there are many available online, a Google search away. Here’s a nice one, from Scott Aaronson. You may also enjoy David Deutsch’s original paper about quantum computing. It’s a bit harder to read than an article in Wired or Scientific American, but it’s worth the effort, for the paper gives a lot of insight into some of the fundamental reasons for thinking about quantum computing in the first place. Such higher-level articles may be helpful to read in conjunction with the videos.

Here’s the full list of videos, including the first one above. Note that because this really does get into the nuts and bolts of how things work, it also builds cumulatively. You can’t just skip straight to the quantum teleportation video and hope to understand it, you’ll need to work through the earlier videos, unless you already understand their content.

The basics

Superdense coding

Quantum teleportation

The postulates of quantum mechanics (TBC)

Thanks to Jen Dodd, Ilya Grigorik and Hassan Masum for feedback on the videos, and for many enjoyable discussions about open education.

If you enjoyed these videos, you may be interested in my forthcoming book, Reinventing Discovery, where I describe how online tools and open science are transforming the way scientific discoveries are made.

Survey notes on fermi algebras and the Jordan-Wigner transform now on GitHub

Continuing the theme of my last post, I’ve now put my old survey notes on fermi algebras and the Jordan-Wigner transform up on GitHub.

The Jordan-Wigner transform is an amazing tool. It let’s you move back and forth between two seemingly very different ways of describing a physical system, either as a collection of qubits, or as a collection of fermions. To give you an idea of the power of the Jordan-Wigner transform, in his famous 1982 paper on quantum computing, Richard Feynman wrote the following:

could we [use a quantum computer to] imitate every quantum mechanical system which is discrete and has a finite number of degrees of freedom? I know, almost certainly, that we could do that for any quantum mechanical system which involves Bose particles. I’m not sure whether Fermi particles could be described by such a system. So I leave that open.

As shown in the notes, once you understand the Jordan-Wigner transform, the answer to Feynman’s question is obvious: yes, we can use quantum computers to simulate systems of fermions. The reason is that the Jordan-Wigner transforms lets us view the fermi system as a system of qubits which is easy to simulate using standard simulation techniques. Obviously, the point here isn’t that Feynman was silly: it’s that tools like the Jordan-Wigner transform can make formerly hard things very simple.

The notes assume familiarity with elementary quantum mechanics, comfort with elementary linear algebr, and a little familiarity with the basic nomenclature of quantum information science (qubits, the Pauli matrices).

I’m releasing the notes under a Creative Commons Attribution license (CC BY 3.0). That means anyone can copy, distribute, transmit and adapt/remix the work, provided my contribution is attributed. The notes could be used, for example, to help flesh out Wikipedia’s article about the Jordan-Wigner transform. Or perhaps they could usefully be adapted into course notes, or part of a review article.