(Some) garbage in, gold out

During a recent talk David Weinberger asked me (paraphrasing) whether and how the nature of scientific knowledge will change when it’s produced by large networked collaborations?

It’s a great question. Suppose it’s announced in the next few years that the LHC has discovered the Higgs boson. There will, no doubt, be a peer-reviewed scientific paper describing the result.

How should we regard such an announcement?

The chain of evidence behind the result will no doubt be phenomenally complex. The LHC analyses about 600 million particle collisions per second. The data analysis is done using a cluster of more than 200,000 processing cores, and tens of millions of lines of software code. That code is built on all sorts of extremely specialized knowledge and assumptions about detector and beam physics, statistical inference, quantum field theory, and so on. Whatsmore that code, like any large software package, no doubt has many bugs, despite enormous effort to eliminate bugs.

No one person in the world will understand in detail the entire chain of evidence that led to the discovery of the Higgs. In fact, it’s possible that very few (no?) people will understand in much depth even just the principles behind the chain of evidence. How many people have truly mastered quantum field theory, statistical inference, detector physics, and distributed computing?

What, then, should we make of any paper announcing that the Higgs boson has been found?

Standard pre-publication peer review will mean little. Yes, it’ll be useful as an independent sanity check of the work. But all it will show is that there’s no glaringly obvious holes. It certainly won’t involve more than a cursory inspection of the evidence.

A related situation arose in the 1980s in mathematics. It was announced in the early 1980s that an extremely important mathematical problem had been solved: the classification of the finite simple groups. The proof had taken about 30 years, and involved an effort by 100 or so mathematicians, spread across many papers and thousands of pages of proof.

Unfortunately, the original proof had gaps. Most of them were not serious. But at least one serious gap remained. In 2004, two mathematicians published a two-volume, 1,200 page supplement to the original proof, filling in the gap. (At least, we hope they’ve filled in the gap!)

When discoveries rely on hundreds of pieces of evidence or steps of reasoning, we can be pretty sure of our conclusions, provided our error rate is low, say one part in a hundred thousand. But when we start to use a million or a billion (or a trillion or more) pieces of evidence or steps of reasoning, an error rate of one part in a million
becomes a guarantee of failure, unless we develop systems that can tolerate those errors.

It seems to me that one of the core questions the scientific community will wrestle with over the next few decades is what principles and practices we use to judge whether or not a conclusion drawn from a large body of networked knowledge is correct? To put it another way, how can we ensure that we reliably come to correct conclusions, despite the fact that some of our evidence or reasoning is almost certainly wrong?

At the moment each large-scale collaboration addresses this in their own way. The people at the LHC and those responsible for the classification of finite simple groups are certainly very smart, and I’ve no doubt they’re doing lots of smart things to eliminate or greatly reduce the impact of errors. But it’d be good to have a principled way of understanding how and when we can come to correct scientific conclusions, in the face of low-level errors in the evidence and reasoning used to arrive at those errors.

If you doubt there’s a problem here, then think about the mistakes that led to the Pentium floating point bug. Or think of the loss of the Mars Climate Orbiter. That’s often described as a failure to convert between metric and imperial units, which makes it sound trivial, like the people at NASA are fools. The real problem was deeper. As a NASA official said:

People sometimes make errors. The problem here was not the error [of unit conversion], it was the failure of NASA’s systems engineering, and the checks and balances in our processes to detect the error. That’s why we lost the spacecraft.

In other words, when you’re working at NASA scale, problems that are unlikely at small scale, like failing to do a unit conversion, are certain to occur. It’s foolish to act as though they won’t happen. Instead, you need to develop systems which limit the impact of such errors.

In the context of science, what this means is that we need new methods of fault-tolerant discovery.

I don’t have well-developed answers to the questions I’ve raised above, riffing off David Weinberger’s original question. But I will finish with the notion that one useful source of ideas may be systems and safety engineering, which are responsible for the reliable performance of complex systems such as modern aircraft. According to Boeing, a 747-400 has six million parts, and the first 747 required 75,000 engineering drawings. Not to mention all the fallible human “components” in a modern aircraft. Yet aircraft systems and safety engineers have developed checks and balances that let us draw with very high probability the conclusion “The plane will get safely from point A to B”. Sounds like a promising source of insights to me!

Further reading: An intriguing experiment in distributed verification of a mathematical proof has been done in an article by Doron Zeilberger. Even if you can’t follow the mathematics, it’s stimulating to look through. I’ve taken a stab at some of the issues in this post before, in my essay Science beyond individual understanding. I’m also looking forward to David Weinberger’s new book about networked knowledge, Too Big To Know. Finally, my new book Reinventing Discovery is about the promise and the challenges of networked science.

15 comments

  1. More further reading: http://arxiv.org/abs/0810.5515 “Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes”

    > In this paper, we argue that there are important new methodological problems which arise when assessing global catastrophic risks and we focus on a problem regarding probability estimation. When an expert provides a calculation of the probability of an outcome, they are really providing the probability of the outcome occurring, given that their argument is watertight. However, their argument may fail for a number of reasons such as a flaw in the underlying theory, a flaw in the modeling of the problem, or a mistake in the calculations. If the probability estimate given by an argument is dwarfed by the chance that the argument itself is flawed, then the estimate is suspect. We develop this idea formally, explaining how it differs from the related distinctions of model and parameter uncertainty. Using the risk estimates from the Large Hadron Collider as a test case, we show how serious the problem can be when it comes to catastrophic risks and how best to address it.

    One of my favorite papers on the topic.

  2. @greg and @gwern: thanks for the suggestions, they look very interesting.

    @Jason: Unfortunately, I don’t know what the best texts are on these subjects. John Sidles probably knows, though, maybe he can say. Some useful background that I’ve enjoyed is books like Richard Rhodes’ “Making of the Atomic Bomb” (and several other books about Los Alamos), and “The Pentium Chronicles” (whose author escapes me right now). Neither is systems engineering exactly, but they certainly overlap. I got interested by chatting with several systems engineers, notably a couple of people who worked for Intel, and my father-in-law, who was an aircraft safety engineer.

  3. Michael – an old fashioned example (at least for the LHC case) is that an independent group should be able to verify the result. This would usually be required before, e.g., a Nobel prize is awarded. I don’t see a fundamental issue with this – although tax payers may balk at paying for a second different experiment with a price tag similar to the LHC. One reason Bednorz & Muller got their Nobel so quickly is that everyone could make their own high-temperature superconductor once they read B&M’s recipe – so we new that their result was correct. It will be interesting to see if this is required if the LHC finds the Higgs. (And certainly more interesting than the inevitable theorist priority fight.)

  4. Replication will certainly be helpful, but there are some significant problems. First, it might take 10 or more years and a few billion dollars to do such a replication. Furthermore, it won’t be truly independent, since it will depend on all kinds of shared assumptions, sometimes in hard-to-control for ways (e.g., statistical analyses might be done using the same numerical libraries, that kind of thing). So what should we do in the meantime? And how independent is independent enough?

  5. A few thoughts:

    Sussman’s propagator model is an ambitious attempt to answer this question. Section 6 of his report is quite interesting.

    Knight and Levenson ran a trial of replication of software. The traditional summary is that the programmers all make the same mistakes, and it doesn’t work. I haven’t read the paper for a while, so I don’t know how accurate that summary is.

    Boeing invented a method to verify redundancy in the 747 design, called fault tree analysis. They went through each part in turn, and checked that the plane would keep flying if it broke. One could take a similar approach to verifying a proof: for each lemma, find a way to fix the proof if it were false. Similarly, for every measurement taken at the LHC, someone could ask why they’d still believe in the Higgs boson if they didn’t know the result. Sussman’s work is partly about systematic ways to do that.

  6. Each proof is the result of the combination of a number of subordinate lemmas. Focusing on this set, presumably the elements have previously been combined in various subsets to determine proofs that are more easily verified in other experimental ways and have therefore established their own reliability independently. The more verifiable proofs they have contributed to, the more confidence in the lemma. One hopes that there are scientists determining the weak members of the set, and endeavouring to establish their usefulness with projects less complex than the Higgs boson.

  7. You need not be worried about large collaborations. Large collaborations between humans, where no one human knows how everything works, is nothing new. For example you do not need to know how your clothes are made to wear them, or how the sewer system works to use it. I have worked in a large collaboration of 500+ people on high energy physics experiment like the LHC, and I think I can shed some light onto how those collaborations operate. It is true that no one person knows everything about everything, but the people doing the physics analysis do know everything they need to about quantum field theory, statistical inference, beam physics, etc. They may not know exactly how the cluster of computers they use works, but they don’t really need to. This situation is no different from anything else in the world.

  8. Dear Michael,

    you might enjoy reading John Rushby, a computer scientist and expert in formal verification and system safety, on the topic of aircraft hardware and software certification processes and practices.

    Recent papers about currently madnatory hardware and software development process standards, which have been initially developed as consensus documents within the avionics community, and which are as of now a regulatory requirement for criticical software in aviation:
    http://fm.csl.sri.com/~rushby/abstracts/sefm09
    http://fm.csl.sri.com/~rushby/papers/sss10.pdf

    A classic paper of John introducing application of formal verification to FAA and avionics engineers:
    http://fm.csl.sri.com/~rushby/abstracts/csl-93-7

    regards,
    Martin

    [MN: Thanks for the pointers, Martin!]

Comments are closed.