Introduction to the Polymath Project and “Density Hales-Jewett and Moser Numbers”

In January of 2009, Tim Gowers initiated an experiment in massively collaborative mathematics, the Polymath Project. The initial stage of this project was extremely successful, and led to two scientific papers: “A new proof of the density Hales-Jewett theorem” and “Density Hales-Jewett and Moser numbers”. The second of these papers will soon appear in a birthday volume in honour of Endre Szemeredi. The editor of the Szemeredi birthday volume, Jozsef Solymosi, invited me to submit an introduction to that paper, and to the Polymath Project more generally. The following is a draft of my introductory piece. I’d be very interested in hearing feedback. Note that the early parts of the article briefly discuss some mathematics, but if you’re not mathematically inclined the remainder of the article should be comprehensible. Many of the themes of the article will be discussed at much greater length in my book about open science, “Reinventing Discovery”, to be published early in 2011.

At first appearance, the paper which follows this essay appears to be a typical mathematical paper. It poses and partially answers several combinatorial questions, and follows the standard forms of mathematical discourse, with theorems, proofs, conjectures, and so on. Appearances are deceiving, however, for the paper has an unusual origin, a clue to which is in the name of the author, one D. H. J. Polymath. Behind this unusual name is a bold experiment in how mathematics is done. This experiment was initiated in January of 2009 by W. Timothy Gowers, and was an experiment in what Gowers termed “massively collaborative mathematics”. The idea, in brief, was to attempt to solve a mathematical research problem working entirely in the open, using Gowers’s blog as a medium for mathematical collaboration. The hope was that a large number of mathematicians would contribute, and that their collective intelligence would make easy work of what would ordinarily be a difficult problem. Gowers dubbed the project the “Polymath Project”. In this essay I describe how the Polymath Project proceeded, and reflect on similarities to online collaborations in the open source and open science communities. Although I followed the Polymath Project closely, my background is in theoretical physics, not combinatorics, and so I did not participate directly in the mathematical discussions. The perspective is that of an interested outsider, one whose main creative interests are in open science and collective intelligence.

Gowers began the Polymath Project with a description of the problem to be attacked (see below), a list of rules of collaboration, and a list of 38 brief observations he’d made about the problem, intended to serve as starting inspiration for discussion. At that point, on February 1, 2009, other people were invited to contribute their thoughts on the problem. Anyone with an interest and an internet connection could follow along and, if they wished, contribute their ideas in the comment section of Gowers’s blog. In just the first 24 hours after Gowers opened his blog up for discussion, six people offered 24 comments. In a sign of things to come, those contributors came from four countries on three continents, and included a high-school teacher, a graduate student, and four professors of mathematics. A collaboration was underway, a collaboration which expanded in the weeks that followed to involve more than twenty people.

The problem originally posed by Gowers was to investigate a new approach to a special case of the density Hales-Jewett theorem (DHJ). Let me briefly describe the statement of the theorem, before describing the special case Gowers proposed to attack. Let [tex][k]^n[/tex] be the set of all length [tex]n[/tex] strings over the alphabet [tex]1,2,\ldots,k[/tex]. A combinatorial line is a set of [tex]k[/tex] points in [tex][k]^n[/tex], formed by taking a string with one or more wildcards (“[tex]x[/tex]”) in it, e.g., 14x1xx3, and replacing those wildcards by [tex]1, 2,\ldots,k[/tex], respectively. In the example I’ve given, the resulting combinatorial line is: [tex]\{ 1411113, 1421223, \ldots, 14k1kk3 \}[/tex]. The density Hales-Jewett theorem says that as [tex]n[/tex] becomes large, even very low density subsets of [tex][k]^n[/tex] must contain a combinatorial line. More precisely, let us define the density Hales-Jewett number [tex]c_{n,k}[/tex] to be the size of the largest subset of [tex][k]^n[/tex] which does not contain a combinatorial line. Then the density Hales-Jewett theorem may be stated as:

Theorem (DHJ): [tex]\lim_{n\rightarrow \infty} c_{n,k}/k^n = 0[/tex].

DHJ was originally proved in 1991 by Furstenberg and Katznelson, using techniques from ergodic theory. Gowers proposed to find a combinatorial proof of the [tex]k=3[/tex] case of the theorem, using a strategy that he outlined on his blog. As the Polymath Project progressed, that goal gradually evolved. Four days after Gowers opened his blog up for discussion, Terence Tao used his blog to start a discussion aimed at understanding the behaviour of [tex]c_{n,3}[/tex] for small [tex]n[/tex]. This discussion rapidly gained momentum, and the Polymath Project split into two subprojects, largely carried out, respectively, on Gowers’s blog and Tao’s blog. The first subproject pursued and eventually found an elementary combinatorial proof of the full DHJ theorem. The results of second subproject are described in the paper which follows, “Density Hales-Jewett and Moser Numbers”. As mentioned, this second subproject began with the goal of understanding the behaviour of [tex]c_{n,3}[/tex] for small [tex]n[/tex]. It gradually broadened to consider several related questions, including the behaviour of [tex]c_{n,k}[/tex] for small [tex]n[/tex] and [tex]k[/tex], as well as the behaviour of the Moser numbers, [tex]c_{n,k}'[/tex], defined to be the size of the largest subset of [tex][k]^n[/tex] which contains no geometric line. As for a combinatorial line, a geometric line is defined by taking a strinq in [tex][k]^n[/tex] with one or more wildcard characters present. But unlike a combinatorial line, there are two distinct types of wildcards allowed (“[tex]x[/tex]” and “[tex]\overline x[/tex]”), with [tex]x[/tex] taken to vary over the range [tex]1,\ldots,k[/tex], and [tex]\overline x = k+1-x[/tex]. So, for example, [tex]13x\overline x2[/tex] generates the geometric line [tex]\{131k2,132(k-1)2,\ldots,13k12\}[/tex].

Both subprojects of the Polymath Project progressed quickly. On March 10, Gowers announced that he was confident that the polymaths had found a new combinatorial proof of DHJ. Just 37 days had passed since the collaboration began, and 27 people had contributed approximately 800 mathematical comments, containing 170,000 words. Much work remained to be done, but the original goal had already been surpassed, and this was a major milestone for the first subproject. By contrast, the goals of the second subproject were more open-ended, and no similarly decisive announcement was possible. Work on both continued for months thereafter, gradually shifting to focus on the writeup of results for publication.

Although the Polymath Project is unusual from the perspective of current practice in mathematics, there is another perspective from which it does not appear so unusual. That is the tradition of open source software development in the computer programming community. Perhaps the best known example of open source software is the Linux operating system. Begun by Linus Torvalds in 1991 as a hobby project, Linux has since grown to become one of the world’s most popular operating systems. Although not as widely used in the consumer market as Microsoft Windows, Linux is the primary operating system used on the giant computer clusters at companies such as Google, Yahoo! and Amazon, and also dominates in markets such as the movie industry, where it plays a major role at companies such as Dreamworks and Pixar.

A key feature of Linux is that, unlike proprietary software such as Microsoft Windows, the original source code for the operating system is freely available to be downloaded and modified. In his original message announcing Linux, Torvalds commented that “I’ve enjouyed [sic] doing it, and somebody might enjoy looking at it and even modifying it for their own needs. It is still small enough to understand, use and modify, and I’m looking forward to any comments you might have.” Because he had made the code publicly available, other people could add features if they desired. People began emailing code to Torvalds, who incorporated the changes he liked best into the main Linux code base. A Linux kernel discussion group was set up to co-ordinate work, and the number of people contributing code to Linux gradually increased. By 1994, 80 people were named in the Linux credits file as contributors.

Today, nearly twenty years later, Linux has grown enormously. The kernel of Linux contains 13 million lines of code. On an average day in 2007 and 2008, Linux developers added 4,300 lines of code, deleted 1,800 lines, and modified 1,500 lines. The social processes and tools used to create Linux have also changed enormously. In its early days, Linux used off-the-shelf tools and ad hoc social processes to manage development. But as Linux and the broader open source community have grown, that community has developed increasingly powerful tools to share and integrate code, and to manage discussion of development. They have also evolved increasingly sophisticated social structures to govern the process of large-scale open source development. None of this was anticipated at the outset by Torvalds – in 2003 he said “If someone had told me 12 years ago what would happen, I’d have been flabbergasted” – but instead happened organically.

Linux is just one project in a much broader ecosystem of open source projects. Deshpande and Riehle have conservatively estimated that more than a billion lines of open source software have been written, and more than 300 million lines are being added each year. Many of these are single-person projects, often abandoned soon after being initiated. But there are hundreds and perhaps thousands of projects with many active developers.

Although it began in the programming community, the open source collaboration process can in principle be applied to any digital artifact. It’s possible, for example, for a synthetic biologist to do open source biology, by freely sharing their DNA designs for living things, and then allowing others to contribute back changes that improve upon those designs. It’s possible for an architect to do open source architecture, by sharing design files, and then accepting contributions back from others. And, it’s possible to write an open source encyclopedia, by freely sharing the text of articles, and making it possible for others to contribute back changes. That’s how Wikipedia was written: Wikipedia is an open source project.

The Polymath Project is a natural extension of open source collaboration to mathematics. At first glance it appears to differ in one major way, for in programming the open source process aims to produce an artifact, the source code for the desired software. Similarly, in synthetic biology, architecture and the writing of an encyclopedia the desired end is an artifact of some sort. At least in the early stages of the Polymath Project, there was no obviously analogous artifact. It’s tempting to conclude that the two papers produced by the polymaths play this role, but I don’t think that’s quite right. In mathematics, the desired end isn’t an artifact, it’s mathematical understanding. And the Polymath process was a way of sharing that understanding openly, and gradually improving it through the contributions of many people.

The Polymath Project’s open approach to collaboration is part of a broader movement toward open science. Other prominent examples include the human genome project and the Sloan Digital Sky Survey, which use the internet to openly share data with the entire scientific community. This enables other scientists to find ingenius ways of reusing that data, often posing and answering questions radically different to those that motivated the people who originally took the data.

An example which gives the flavour of this reuse is the recent work by Boroson and Lauer, who used a computer algorithm to search through the spectra of 17,000 quasars from the Sloan Digital Sky Survey, looking for a subtle signature that they believed would indicate a pair of orbiting black holes. The result was the discovery of a candidate quasar containing a pair of supermassive black holes, 20 million and 800 million times the mass of the sun, respectively, and a third of a light year apart, orbiting one another roughly once every 100 years. This is just one of more than 3,000 papers to have cited the Sloan data, most of those papers coming from outside the Sloan collaboration.

People practicing open notebook science have carried this open data approach to its logical conclusion, sharing their entire laboratory record in real time. The Polymath Project and the open data and open notebook projects are all examples of scientists sharing information which, historically, has not been openly available, whether it be raw experimental data, observations made in a laboratory notebook, or ideas for the solution of a mathematical problem.

There is, however, a historical parallel to the early days of modern science. For example, when Galileo first observed what would later be recognized as Saturn’s rings, he sent an anagram to the astronomer Kepler so that if Kepler (or anyone else) later made the same discovery, Galileo could disclose the anagram and claim the credit. Such secretive behaviour was common at the time, and other scientists such as Huygens and Hooke also used devices such as anagrams to “publish” their discoveries. Many scientists waited decades before genuine publication, if they published at all. What changed this situation – the first open science revolution – was the gradual establishment of a link between the act of publishing a scientific discovery and the scientist’s prospects for employment. This establishment of scientific papers as a reputational currency gave scientists an incentive to share their knowledge. Today, we take this reputational currency for granted, yet it was painstakingly developed over a period of many decades in the 17th and 18th centuries. During that time community norms around authorship, citation, and attribution were slowly worked out by the scientific community.

A similar process is beginning today. Will pseudonyms such as D. H. J. Polymath become a commonplace? How should young scientists report their role in such collaborations, for purposes of job and grant applications? How should new types of scientific contribution – contributions such as data or blog comments or lab notebook entries – be valued by other scientists? All these questions and many more will need answers, if we are to take full advantage of the potential of new ways of working together to generate knowledge.

Polymath4

The Polymath4 Project is now underway, with the first formal post here.

The basic problem is very simple and appealing: it’s to find a deterministic algorithm which will quickly generate a prime of at least some given length, ideally in time polynomial in that length. There are fast algorithms which will generate such a prime with high probability – cryptography algorithms like RSA wouldn’t work if that weren’t true. But there’s no known deterministic algorithm.

I’m going to miss the first week of the project – I’ll be camping in a field in the Netherlands, surrounded by 1000+ hackers. But I’m looking forward to catching up when I come back.

On a related note, John Baez asks what mathematicians need to know about blogs.