Introduction to the Polymath Project and “Density Hales-Jewett and Moser Numbers”

In January of 2009, Tim Gowers initiated an experiment in massively collaborative mathematics, the Polymath Project. The initial stage of this project was extremely successful, and led to two scientific papers: “A new proof of the density Hales-Jewett theorem” and “Density Hales-Jewett and Moser numbers”. The second of these papers will soon appear in a birthday volume in honour of Endre Szemeredi. The editor of the Szemeredi birthday volume, Jozsef Solymosi, invited me to submit an introduction to that paper, and to the Polymath Project more generally. The following is a draft of my introductory piece. I’d be very interested in hearing feedback. Note that the early parts of the article briefly discuss some mathematics, but if you’re not mathematically inclined the remainder of the article should be comprehensible. Many of the themes of the article will be discussed at much greater length in my book about open science, “Reinventing Discovery”, to be published early in 2011.

At first appearance, the paper which follows this essay appears to be a typical mathematical paper. It poses and partially answers several combinatorial questions, and follows the standard forms of mathematical discourse, with theorems, proofs, conjectures, and so on. Appearances are deceiving, however, for the paper has an unusual origin, a clue to which is in the name of the author, one D. H. J. Polymath. Behind this unusual name is a bold experiment in how mathematics is done. This experiment was initiated in January of 2009 by W. Timothy Gowers, and was an experiment in what Gowers termed “massively collaborative mathematics”. The idea, in brief, was to attempt to solve a mathematical research problem working entirely in the open, using Gowers’s blog as a medium for mathematical collaboration. The hope was that a large number of mathematicians would contribute, and that their collective intelligence would make easy work of what would ordinarily be a difficult problem. Gowers dubbed the project the “Polymath Project”. In this essay I describe how the Polymath Project proceeded, and reflect on similarities to online collaborations in the open source and open science communities. Although I followed the Polymath Project closely, my background is in theoretical physics, not combinatorics, and so I did not participate directly in the mathematical discussions. The perspective is that of an interested outsider, one whose main creative interests are in open science and collective intelligence.

Gowers began the Polymath Project with a description of the problem to be attacked (see below), a list of rules of collaboration, and a list of 38 brief observations he’d made about the problem, intended to serve as starting inspiration for discussion. At that point, on February 1, 2009, other people were invited to contribute their thoughts on the problem. Anyone with an interest and an internet connection could follow along and, if they wished, contribute their ideas in the comment section of Gowers’s blog. In just the first 24 hours after Gowers opened his blog up for discussion, six people offered 24 comments. In a sign of things to come, those contributors came from four countries on three continents, and included a high-school teacher, a graduate student, and four professors of mathematics. A collaboration was underway, a collaboration which expanded in the weeks that followed to involve more than twenty people.

The problem originally posed by Gowers was to investigate a new approach to a special case of the density Hales-Jewett theorem (DHJ). Let me briefly describe the statement of the theorem, before describing the special case Gowers proposed to attack. Let [tex][k]^n[/tex] be the set of all length [tex]n[/tex] strings over the alphabet [tex]1,2,\ldots,k[/tex]. A combinatorial line is a set of [tex]k[/tex] points in [tex][k]^n[/tex], formed by taking a string with one or more wildcards (“[tex]x[/tex]”) in it, e.g., 14x1xx3, and replacing those wildcards by [tex]1, 2,\ldots,k[/tex], respectively. In the example I’ve given, the resulting combinatorial line is: [tex]\{ 1411113, 1421223, \ldots, 14k1kk3 \}[/tex]. The density Hales-Jewett theorem says that as [tex]n[/tex] becomes large, even very low density subsets of [tex][k]^n[/tex] must contain a combinatorial line. More precisely, let us define the density Hales-Jewett number [tex]c_{n,k}[/tex] to be the size of the largest subset of [tex][k]^n[/tex] which does not contain a combinatorial line. Then the density Hales-Jewett theorem may be stated as:

Theorem (DHJ): [tex]\lim_{n\rightarrow \infty} c_{n,k}/k^n = 0[/tex].

DHJ was originally proved in 1991 by Furstenberg and Katznelson, using techniques from ergodic theory. Gowers proposed to find a combinatorial proof of the [tex]k=3[/tex] case of the theorem, using a strategy that he outlined on his blog. As the Polymath Project progressed, that goal gradually evolved. Four days after Gowers opened his blog up for discussion, Terence Tao used his blog to start a discussion aimed at understanding the behaviour of [tex]c_{n,3}[/tex] for small [tex]n[/tex]. This discussion rapidly gained momentum, and the Polymath Project split into two subprojects, largely carried out, respectively, on Gowers’s blog and Tao’s blog. The first subproject pursued and eventually found an elementary combinatorial proof of the full DHJ theorem. The results of second subproject are described in the paper which follows, “Density Hales-Jewett and Moser Numbers”. As mentioned, this second subproject began with the goal of understanding the behaviour of [tex]c_{n,3}[/tex] for small [tex]n[/tex]. It gradually broadened to consider several related questions, including the behaviour of [tex]c_{n,k}[/tex] for small [tex]n[/tex] and [tex]k[/tex], as well as the behaviour of the Moser numbers, [tex]c_{n,k}'[/tex], defined to be the size of the largest subset of [tex][k]^n[/tex] which contains no geometric line. As for a combinatorial line, a geometric line is defined by taking a strinq in [tex][k]^n[/tex] with one or more wildcard characters present. But unlike a combinatorial line, there are two distinct types of wildcards allowed (“[tex]x[/tex]” and “[tex]\overline x[/tex]”), with [tex]x[/tex] taken to vary over the range [tex]1,\ldots,k[/tex], and [tex]\overline x = k+1-x[/tex]. So, for example, [tex]13x\overline x2[/tex] generates the geometric line [tex]\{131k2,132(k-1)2,\ldots,13k12\}[/tex].

Both subprojects of the Polymath Project progressed quickly. On March 10, Gowers announced that he was confident that the polymaths had found a new combinatorial proof of DHJ. Just 37 days had passed since the collaboration began, and 27 people had contributed approximately 800 mathematical comments, containing 170,000 words. Much work remained to be done, but the original goal had already been surpassed, and this was a major milestone for the first subproject. By contrast, the goals of the second subproject were more open-ended, and no similarly decisive announcement was possible. Work on both continued for months thereafter, gradually shifting to focus on the writeup of results for publication.

Although the Polymath Project is unusual from the perspective of current practice in mathematics, there is another perspective from which it does not appear so unusual. That is the tradition of open source software development in the computer programming community. Perhaps the best known example of open source software is the Linux operating system. Begun by Linus Torvalds in 1991 as a hobby project, Linux has since grown to become one of the world’s most popular operating systems. Although not as widely used in the consumer market as Microsoft Windows, Linux is the primary operating system used on the giant computer clusters at companies such as Google, Yahoo! and Amazon, and also dominates in markets such as the movie industry, where it plays a major role at companies such as Dreamworks and Pixar.

A key feature of Linux is that, unlike proprietary software such as Microsoft Windows, the original source code for the operating system is freely available to be downloaded and modified. In his original message announcing Linux, Torvalds commented that “I’ve enjouyed [sic] doing it, and somebody might enjoy looking at it and even modifying it for their own needs. It is still small enough to understand, use and modify, and I’m looking forward to any comments you might have.” Because he had made the code publicly available, other people could add features if they desired. People began emailing code to Torvalds, who incorporated the changes he liked best into the main Linux code base. A Linux kernel discussion group was set up to co-ordinate work, and the number of people contributing code to Linux gradually increased. By 1994, 80 people were named in the Linux credits file as contributors.

Today, nearly twenty years later, Linux has grown enormously. The kernel of Linux contains 13 million lines of code. On an average day in 2007 and 2008, Linux developers added 4,300 lines of code, deleted 1,800 lines, and modified 1,500 lines. The social processes and tools used to create Linux have also changed enormously. In its early days, Linux used off-the-shelf tools and ad hoc social processes to manage development. But as Linux and the broader open source community have grown, that community has developed increasingly powerful tools to share and integrate code, and to manage discussion of development. They have also evolved increasingly sophisticated social structures to govern the process of large-scale open source development. None of this was anticipated at the outset by Torvalds – in 2003 he said “If someone had told me 12 years ago what would happen, I’d have been flabbergasted” – but instead happened organically.

Linux is just one project in a much broader ecosystem of open source projects. Deshpande and Riehle have conservatively estimated that more than a billion lines of open source software have been written, and more than 300 million lines are being added each year. Many of these are single-person projects, often abandoned soon after being initiated. But there are hundreds and perhaps thousands of projects with many active developers.

Although it began in the programming community, the open source collaboration process can in principle be applied to any digital artifact. It’s possible, for example, for a synthetic biologist to do open source biology, by freely sharing their DNA designs for living things, and then allowing others to contribute back changes that improve upon those designs. It’s possible for an architect to do open source architecture, by sharing design files, and then accepting contributions back from others. And, it’s possible to write an open source encyclopedia, by freely sharing the text of articles, and making it possible for others to contribute back changes. That’s how Wikipedia was written: Wikipedia is an open source project.

The Polymath Project is a natural extension of open source collaboration to mathematics. At first glance it appears to differ in one major way, for in programming the open source process aims to produce an artifact, the source code for the desired software. Similarly, in synthetic biology, architecture and the writing of an encyclopedia the desired end is an artifact of some sort. At least in the early stages of the Polymath Project, there was no obviously analogous artifact. It’s tempting to conclude that the two papers produced by the polymaths play this role, but I don’t think that’s quite right. In mathematics, the desired end isn’t an artifact, it’s mathematical understanding. And the Polymath process was a way of sharing that understanding openly, and gradually improving it through the contributions of many people.

The Polymath Project’s open approach to collaboration is part of a broader movement toward open science. Other prominent examples include the human genome project and the Sloan Digital Sky Survey, which use the internet to openly share data with the entire scientific community. This enables other scientists to find ingenius ways of reusing that data, often posing and answering questions radically different to those that motivated the people who originally took the data.

An example which gives the flavour of this reuse is the recent work by Boroson and Lauer, who used a computer algorithm to search through the spectra of 17,000 quasars from the Sloan Digital Sky Survey, looking for a subtle signature that they believed would indicate a pair of orbiting black holes. The result was the discovery of a candidate quasar containing a pair of supermassive black holes, 20 million and 800 million times the mass of the sun, respectively, and a third of a light year apart, orbiting one another roughly once every 100 years. This is just one of more than 3,000 papers to have cited the Sloan data, most of those papers coming from outside the Sloan collaboration.

People practicing open notebook science have carried this open data approach to its logical conclusion, sharing their entire laboratory record in real time. The Polymath Project and the open data and open notebook projects are all examples of scientists sharing information which, historically, has not been openly available, whether it be raw experimental data, observations made in a laboratory notebook, or ideas for the solution of a mathematical problem.

There is, however, a historical parallel to the early days of modern science. For example, when Galileo first observed what would later be recognized as Saturn’s rings, he sent an anagram to the astronomer Kepler so that if Kepler (or anyone else) later made the same discovery, Galileo could disclose the anagram and claim the credit. Such secretive behaviour was common at the time, and other scientists such as Huygens and Hooke also used devices such as anagrams to “publish” their discoveries. Many scientists waited decades before genuine publication, if they published at all. What changed this situation – the first open science revolution – was the gradual establishment of a link between the act of publishing a scientific discovery and the scientist’s prospects for employment. This establishment of scientific papers as a reputational currency gave scientists an incentive to share their knowledge. Today, we take this reputational currency for granted, yet it was painstakingly developed over a period of many decades in the 17th and 18th centuries. During that time community norms around authorship, citation, and attribution were slowly worked out by the scientific community.

A similar process is beginning today. Will pseudonyms such as D. H. J. Polymath become a commonplace? How should young scientists report their role in such collaborations, for purposes of job and grant applications? How should new types of scientific contribution – contributions such as data or blog comments or lab notebook entries – be valued by other scientists? All these questions and many more will need answers, if we are to take full advantage of the potential of new ways of working together to generate knowledge.

15 comments

  1. It was a great article to read. Though I am a strong supporter of open source ideas and I try to contribute to it in my little ways, I would like to raise one issue with open-source, especially with Linux which I see as a major problem in expanding its reach to developing countries.

    Like say in India, where most research institutes are heavily focussed on the experimental side. A policy level decision about what softwares should be installed is taken with the view of what softwares have the best compatibility with laboratory equipments. Here Windows is a clear winner since almost all machine interfaces are made for it.

    And experimental physicists (at least the ones I know) are completely averse to spending time in developing interfaces for Linux for all the hundreds of equipments that they use.

    Hence Linux and open-source remains a software used in the small communities like of researchers in mathematics, theoretical physics and theoretical computer science.

    I would like to see your take on this issue.

  2. Thanks for the comment, Anirbit.

    This issue seems to arise everywhere – I’ve worked in places in the US and Canada where there is a “Windows-only” policy. (Interestingly, I spent some time visiting Microsoft Research, and they do NOT have such a policy, offering what seemed like excellent support for Mac, Linux, and other systems.)

    There are two observations I have about this issue.

    First is that it obviously hurts the researchers involved, who may end up using substandard tools as a result. And so a long-run effect is to make the institutions involved less successful than they might otherwise be.

    Second, and perhaps more importantly, a great deal of computing work seems to be migrating to the cloud, making one’s desktop environment less important. I’ve used Amazon EC2 quite a bit, and there I simply use my development environment of choice (Ubuntu). And, of course, there are services such as Google Docs and gmail, which make the choice of local platform irrelevant.

    Still, I agree with you, this is a big challenge facing open source.

  3. Hi Michael,

    Three small pieces of feedback.

    1) You write: “Perhaps the best known example of open source software is the Linux operating system”. I would probably go for the more accurate but wordy “the Linux operating system kernel” (especially since you speak about Linuses project, and not about whole operating systems built around the Linux kernel – like Debian GNU/Linux or Google Android or some of the embedded Linux/Busybox stuff). Alternatively you can just say “Linux”.

    2) I think for a long time Yahoo ran primarily on FreeBSD, not Linux, but I am not sure if it has changed.

    http://lists.freebsd.org/pipermail/freebsd-questions/2005-February/078297.html

    3) Maybe you will like this quote

    http://www.technollama.co.uk/socrates-and-free-software

    Best
    Anders

  4. Hi Anders – thanks for the feedback.

    I considered your point 1 while I was originally drafting the piece, and ended up deciding that for this piece I wouldn’t make the distinction. In a longer piece, or if I thought the audience was likely to more oriented toward programming, I’d make the distinction, but here it seemed out of place. (The problem is this: many readers are unlikely to know what a kernel is. And I think that in a broad introduction like this, explaining it is an unnecessary and unenlightening detail.)

    Thanks for the tip on point 2. I’ve forgotten my original source for Yahoo-on-Linux (and, for that matter, for Amazon on Linux). I’ll look into it, and see what I can find.

    The Socrates quote is amusing.

  5. Your article seems just fine, but it’s kind of boring for somebody who has already read your other writings or your joint work with Gowers in Nature.

    One problem is that your “proof by analogy” with Linux and the Human Genome Project is far from rigorous and not at all convincing. Also, your language at times seems all too reminiscent of the Web 2.0/common-based production evangelism (Benkler-Lessig-Godin-Shirky) — which might be great for those who love reading such stuff but is off-putting to the rest.

    Such an analogy is great for motivation when there aren’t too many live examples of polymath projects to point to. But it seems a distraction considering that (i) it’s over a year since the first polymath project (ii) many other polymath projects have been tried since then (though none as successful so far as the original one).

    Talking about the extent to which the success, problems, and other features of the first polymath project were replicated in later ones might be more illuminative (as opposed to inspirational) for those who are already thorough with the basics.

    If your goal is not the polymath project but peer production in general, perhaps you might consider renaming your article to reflect this.

  6. Vipul Naik: “Talking about the extent to which the success, problems, and other features of the first polymath project were replicated in later ones might be… illuminative”

    I’ll add a few sentences on this.

    On your comment about proof-by-analogy, to clarify: it’s not intended as a proof by analogy, but rather (a) as a description of some of the broader context in which Polymath arose, and (b) as an imperfect analogy. As I say in the article, there are real differences in the object of open source programming and the Polymath process that make the analogy only imperfect, albeit still stimulating.

  7. What I wonder about is how we can resolve the apparent disconnect between open source science, and the credit distribution system.

    I ask: is it possible to envision a world where most (if not all) of science is done in this open, collaborative fashion? It seems to me that very few scientists would actually want to take part in such a scheme. In a world where scientists are human and have to worry about livelihoods, it becomes difficult to judge the extent of a scientist’s contribution to an open source project.

    Also, there is the problem of incentive. Coming up with the right question/problem to work on is usually the most difficult and most valuable aspect of research. However, that in itself is rarely praiseworthy. Rather, delivering a completed project (question and answer) is usually the praiseworthy thing. In an open source science system, one would have very little time to work on their own questions before someone competent researcher comes along and solves your problem. That seems difficult to swallow.

    Does this indicate our norms regarding credit in science should be changed (well, other than the current issues it has)?

  8. Hi Henry – Thanks for the comment. What you say is spot on: changing norms and incentives is absolutely crucial. I wrote an essay a couple of years ago on this issue, which you may find of interest: http://michaelnielsen.org/blog/?p=448 In a more developed form how to achieve these changes is also a major theme of my book, which will be published early next year.

  9. Hi Michael – Please use the name “GNU/Linux” when you refer to the operating system and not just to the kernel. I know it sounds pedantic, but it is politically important.

    For the readers who understand Italian, Luca Trevisan is going to talk about the topic of this post in a free ‘webcast’ on May 18. More details on his blog: http://lucatrevisan.wordpress.com/2010/05/10/mathematical-research-and-new-collaboration-tools/

    Anirbit – I don’t think we can blame the open source ideas for the failure of GNU/Linux in the interfacing of hardware devices. Both practically and theoretically, having open source code can only help to develop drivers for new devices. Instead, the problem is that the hardware producers are still too related to proprietary software companies. Also, patents on hardware and technical standards imposed by the companies don’t help the open source community. More important, OS developers pay a lot of attention to make their systems run on almost all the platforms, including most older models. For example, Debian supports 12 (twelve!) different architectures, which is very impressive in my opinion. Windows runs only on x86.

    Michael – I am sorry to hear that I will have to wait until the next year to read Reinventing Discovery. Good luck with writing!

  10. Michael, you and your readers may be interested in my new book, The New Polymath.

    While it catalogs several modern day tech/science individual polymaths and several from history it is about the “enterprise polymath” – those that are taking 3,5, 10 strands of infotech, biotech, cleantech, nanotech and creating brand new solutions and leveraging multiple talent pools to solve some of our Grand Challenges. Over 150 innovative companies, products, people, places cataloged.

    the book website is at http://www.thenewpolymath.com and amazon is starting to ship in N. America and will follow in Europe in early July. Thanks

  11. A different spin on open source mathematics is Sage: http://www.sagemath.org/

    It is an attempt to create an open source mathematics software framework.

    Why bother? Won’t Mathematica and others do just as well?

    Well, beyond “free”, it provides a form of peer review! Another way to look at it: would you trust a theorem you could not read? With commercial math software, you cannot look at the source, thus be certain of its assumptions and scope.

    A similar thing happened in computer security. The community decided ALL the software used in crypto systems needed to be open, so that it could be examined for vulnerabilities. This turned out to be quite important, several closed crypto systems were later found to be flawed.

Comments are closed.