March 2009 – Michael Nielsen

Conscious modularity and scaling open collaboration

I’ve recently been reviewing the history of open source software, and one thing I’ve been struck by is the enormous effort many open source projects put it into making their development modular. They do this so work can be divided up, making it easier to scale the collaboration, and so get the benefits of diverse expertise and more aggregate effort.

I’m struck by this because I’ve sometimes heard sceptics of open science assert that software has a natural modularity which makes it easy to scale open source software projects, but that difficult science problems often have less natural modularity, and this makes it unlikely that open science will scale.

It looks to me like what’s really going on is that the open sourcers have adopted a posture of conscious modularity. They’re certainly not relying on any sort of natural modularity, but are instead working hard to achieve and preserve a modular structure. Here are three striking examples:

The open source Apache webserver software was originally a fork of a public domain webserver developed by the US National Center for Supercomputing Applications (NCSA). The NCSA project was largely abandoned in 1994, and the group that became Apache took over. It quickly became apparent that the old code base was far too monolithic for a distributed effort, and the code base was completely redesigned and overhauled to make it modular.
In September 1998 and June 2002 crises arose in Linux because of community unhappiness at the slow rate new code contributions were being accepted into the kernel. In some cases contributions from major contributors were being ignored completely. The problem in both 1998 and 2002 was that an overloaded Linus Torvalds was becoming a single point of failure. The situation was famously summed up in 1998 by Linux developer Larry McVoy, who said simply “Linus doesn’t scale”. This was a phrase repeated in a 2002 call-to-arms by Linux developer Rob Landley. The resolution in both cases was major re-organization of the project that allowed tasks formerly managed by Torvalds to be split up among the Linux community. In 2002, for instance, Linux switched to an entirely new way of managing code, using a package called BitKeeper, designed in part to make modular development easier.
One of the Mozilla projects is an issue tracking system (bugzilla), designed to make modular development easy, and which Mozilla uses to organize development of the Firefox web browswer. Developing bugzilla is a considerable overhead for Mozilla, but it’s worth it to keep development modular.

The right lesson to learn from open source software, I think, is that it may be darned hard to achieve modularity in software development, but it can be worth it to reap the benefits of large-scale collaboration. Some parts of science may not be “naturally” modular, but that doesn’t mean they can’t be made modular with conscious effort on the part of scientists. It’s a problem to be solved, not to give up on.

First Principles

How would you use 100 million dollars if someone asked you to set up and run an Institute for Theoretical Physics?Â My friend Howard Burton has written a memoir of his 8 years as the founding Executive Director of the Perimeter Institute, taking it from conception to being one of the world’s best known institutes for theoretical physics. I’ve heard many people theorize about how a scientific institution ideally should be organized (“consider a spherical physicist…”), and I’ve contributed more than a few thoughts of my own to such discussions. What I really liked about this book, and what gives it a unique perspective, is that it’s from someone who was actually in the hot seat, from the get-go.

Biweekly links for 03/30/2009

Reflections on Trusting Trust: Ken Thompson
- One of my favourite essays. Written by the co-creator of Unix, Ken Thompson, this is the text of his Turing Award lecture. In it, he explains a beautiful hack to put a backdoor into the Unix login program, a backdoro that would let him login to any Unix system. I doubt his superiors at AT&T thought it was all that funny.
Python pdfminer
- “PDFMiner is a suite of programs that aims to help extracting or analyzing text data from PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It can be also used as a basis for a full-fledged PDF interpreter.”
Science in the open Â» Capturing the record of research process – Part I
- Thoughtful discussion of the difficulty of capturing _process_, not just facts, or ideas, or relationships. Ties in with many other important issues in open science: the importance of the context in which data was taken, and how tacit knowledge affects that data; how to represent process online in a way that is useful.
Galaxy Zoo Blog Â» This is my first timeâ€¦.
- “[I]f one wants to use a telescope to study something in the sky, one must write a proposal 6 months in advance, submit it for scrutiny, and then await your allocation of time on a telescope. The process can take nearly a year and then after your night staring at the stars, it can take a further year to analyse the data (assuming it wasnâ€™t cloudy!). Only then are we ready to ask questions of the data and test our observations against our original hypothesis written two years ago in a haste!
  With the Zoo, itâ€™s all a little too quick! For example, I can ask the question â€œhow many galaxies have a bar through the middle of themâ€ and typically I would embark on a career-long quest to answer this fundamental question. I may even recruit some poor graduate student to eyeball 50,000 galaxies to answer the question (like they did with Kevin!). But now, two days after the launch, we already have the data to address this question and itâ€™s a little too fast for an old-timer like me. “
Galaxy Zoo Blog Â» Lost in Space â€¦ and more (An Italian in the zoo)
- The story of how someone became addicted to Galaxy Zoo.
[0803.3247] Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey
- Analyses the rotation axes of spiral galaxies, and claims that, contra Longo, there is no statistically significant anisotropy.
[0812.3437] Does the Universe Have a Handedness?
- The latest in Longo’s series of papers suggesting an anisotropy in the rotation axes of spiral galaxies. Big news if true, but a team using Galaxy Zoo’s data suggests that Longo is mistaken.
Galaxy Zoo Blog Â» Everything youâ€™ve ever wanted to know about Blue Early-type galaxies!
Academic publishing house Springer put up for sale in teeth of recession | guardian.co.uk
Jacques Distler on Instiki
- I’ve only fiddled with Instiki, but it looks like Jacques has done a great job making this a truly math-aware wiki.
Yuja Wang plays the Flight of the Bumble-Bee
- Don’t try this at home.
ErdÃ¶s and the Quantum Method Â« GÃ¶delâ€™s Lost Letter and P=NP
Daring Fireball: Obsession Times Voice
- “My muse for the session was this quote from Walt Disney: â€œWe donâ€™t make movies to make money, we make money to make more movies.â€ To me, thatâ€™s it. Thatâ€™s the thing… Obsession times voice is a pretty good stab at a simple formula for doing it right.”
Machine Learning (Theory) Â» Machine Learning is too easy
Whitepaper: Estimating the Total Development Cost of a Linux Distribution Â· October 2008
- “we estimate that it would take approximately $10.8 billion to build the Fedora 9 distribution in todayâ€™s dollars, with todayâ€™s software development costs. Additionally, it would take $1.4 billion to develop the Linux kernel alone. This paper outlines our technique and highlights the latest costs of developing Linux. “
Cheating Goes Global as Essay Mills Multiply – Chronicle.com
- “One customer, for example, identifies himself as a Ph.D. student in aerospace engineering at the Massachusetts Institute of Technology. He or she (there is no name on the order) is interested in purchasing a 200-page dissertation. The student writes that the dissertation must be “well-researched” and includes format requirements and a general outline. Attached to the order is a one-page description of Ph.D. requirements taken directly from MIT’s Web site. The student also suggests areas of emphasis like “static and dynamic stability of aircraft controls.””

Click here for all of my del.icio.us bookmarks.

On scaling up the Polymath project

Tim Gowers has an interesting post on the problem of scaling up the Polymath project to involve more contributors. Here are a few comments on the start of Tim’s post. I’ll return to the remainder of the post tomorrow:

As I have already commented, the outcome of the Polymath experiment differed in one important respect from what I had envisaged: though it was larger than most mathematical collaborations, it could not really be described as massive. However, I havenâ€™t given up all hope of a much larger collaboration, and in this post I want to think about ways that that might be achieved.

As discussed in my earlier post, I think part of the reason for the limited size was the short time-frame of the project. The history of open source software suggests that building a large community usually takes considerably more time than Polymath had available – Polymath’s community of contributors likely grew faster than open source projects like Linux and Wikipedia. In that sense, Polymath’s limited scale may have been in part a consequence of its own rapid success.

With that said, it’s not clear that the Polymath community could have scaled up much further, even had it taken much longer for the problem to be solved, without significant changes to the collaborative design. The trouble with scaling conversation is that as the number of people participating goes up, the effort required to track the conversation also goes up. The result is that beyond a certain point, participants are no longer working on the problem at hand, but instead simply trying to follow the conversation (c.f. Brooks’ law). My guess is that Polymath was near that limit, and, crucially, was beyond that limit for some people who would otherwise like to have been involved. The only way to avoid this problem is to introduce new social and technical means for structuring the conversation, limiting the amount of attention participants need to pay to each other, and so increasing the scale at which conversation can take place. The trick is to do this without simultaneously destroying the effectiveness of the medium as a means of problem-solving.

(As an aside, it’s interesting to think about what properties of a technological platform make it easy to rapidly assemble and grow communities. I’ve noticed, for example, that the communities in FriendFeed rooms can grow incredibly rapidly, under the right circumstances, and this growth seems to be a result of some very particular and clever features of the way information is propagated in FriendFeed. But that’s a discussion for another day.)

First, let me say what I think is the main rather general reason for the failure of Polymath1 to be genuinely massive. I had hoped that it would be possible for many people to make small contributions, but what I had not properly thought through was the fact that even to make a small contribution one must understand the big picture. Or so it seems: that is a question I would like to consider here.

One thing that is undeniable is that it was necessary to have a good grasp of the big picture to contribute to Polymath1. But was that an essential aspect of any large mathematical collaboration, or was it just a consequence of the particular way that Polymath1 was organized? To make this question more precise, I would like to make a comparison with the production of open-source software (which was of course one of the inspirations for the Polymath idea). There, it seems, it is possible to have a large-scale collaboration in which many of the collaborators work on very small bits of code that get absorbed into a much larger piece of software. Now it has often struck me that producing an elaborate mathematical proof is rather like producing a complex piece of software (not that I have any experience of the latter): in both cases there is a clearly defined goal (in one case, to prove a theorem, and in the other, to produce a program that will perform a certain task); in both cases this is achieved by means of a sequence of strings written in a formal language (steps of the proof, or lines of code) that have to obey certain rules; in both cases the main product often splits into smaller parts (lemmas, subroutines) that can be treated as black boxes, and so on.

This makes me want to ask what it is that the producers of open software do that we did not manage to do.

Here’s two immediate thoughts inspired by that question, both of which are ways large open-source projects (a) reduce barriers to entry, and (b) limit the amount of attention required from potential contributors.

Clear separation of what is known from how it is known: In some sense, to get involved in an open source project, all you need do is understand the current source code. (In many projects, the code is modular, which means you may only need to understand a small part of the code.) You don’t need to understand all the previous versions of the code, or read through all the previous discussion that led to those versions. By contrast, it was, I think, somewhat difficult to follow the Polymath project without also following a considerable fraction of the discussion along the way.

Bugtracking: One of the common answers to the question “How can I get involved in open source?” is “Submit a bug report to your favourite open source project’s bugtracking system”. The next step up the ladder is: “Fix a bug in the bugtracking system”. Bugtracking systems are a great way of providing an entry point for new contributors, because they narrow the scope of problems down, limiting what a new contributor needs to learn, and how many other contributors they need to pay attention to. Of course, many bugs will be beyond a beginning contributor to fix. But it’s easy to browse through the bug database to find something within your ability to solve. While I don’t think bugtracking is quite the right model for doing mathematics, it’s possible that a similar system for managing problems of limited scope may help in projects like Polymath.

More tomorrow.

Implementing Statistical Machine Translation Using MapReduce

This post is part of my series on the Google Technology Stack. It’s something of a digression, and is optional (but fun, and worth reading!) We’ll return to the main line of development in the next post.

This post is a followup to an earlier post about the beautiful subject of statistical machine translation. In the earlier post I described how to build a statistical model that can be used to automatically translate texts from one language to another. The ideas are beautiful, but require a lot of computing power – the more computing power, the better the translation. In this post I describe a way of implementing statistical machine translation in parallel on a large cluster of machines. The post assumes: (1) that you’re comfortable with my previous post on statistical machine translation; and (2) that you’re comfortable with the MapReduce framework for distributed computing.

Before going any further, a warning: I haven’t implemented the system described here. I’ve no idea if Google does machine translation with a MapReduce implementiation, although Google translate definitely uses statistical machine translation. It’s possible there’s a bottleneck in my implementation that I’m not noticing. Ideally, we should carefully analyse the resources used in my proposed implementation, and, if that reveals no problems, build a working system, and test to see how well it scales in practice. I haven’t done any of that. Still, putting statistical machine translation into the MapReduce framework is a great way of deepening our understanding of both sets of ideas, and I’m pretty confident that many of these ideas could be used in a production machine translation system.

With that said, it should be noted that the simple statistical machine translation model described in the previous post can actually be implemented on a single machine. However, as discussed in the earlier post, it’s easy to tweak that model so it performs far better, at the cost of making it too complex to run on a single machine. The ideas described in this post are easily adapted to many of those tweaked models of machine translation.

One final point before we jump into the meat of the post. Earlier posts in this series have been pretty detailed. In this post I’ve glossed over some details so as not to obscure the forest with too many tress – there’s only so many Map and MapReduce jobs you can read about before your eyes start to glaze over. But the missing details should be easily filled in by the reader, and I’ve tried to indicate where there are details which need to be filled in.

Overview: what our system needs to do

Recall the basic elements of statistical machine translation. It’s split into two main phases. The first phase is the training phase, in which a statistical model of translation is built, using a corpus of texts in both the source and target language (French and English, for the sake of concreteness). The training phase is itself split into three parts: (i) document collection, where we assemble the corpus of texts from which the statistical model will be inferred; (ii) building the language model for the target language; and (iii) building the translation model from the target language to the source language (yes, that’s the right way round, see the earlier post!) The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. It’s this phase which is actually used directly by the end user – the training phase all happens offline, beforehand.

Altogether, that makes four separate procedures we need to implement. We’ll go through them one by one in the remainder of this post. Before doing that, though, there is one idea implicit in my MapReduce post that needs to be made more explicit. This is the idea of a distributed hash of key-value pairs. Given a set of keys and corresponding values [tex](k_1:v_1),(k_2:v_2),\ldots[/tex], a crucial part of MapReduce is distributing the key-value pairs evenly over all computers in the cluster.

Here’s a simple way this may be achieved. Suppose there are [tex]n[/tex] computers in the cluster. Number them [tex]0,\ldots,n-1[/tex]. Let [tex]\#(\cdot)[/tex] be a hash function. An unfortunate fact is that this use of the term “hash” is completely different than the useage of “hash” to denote a set of key-value pairs. In practice, this isn’t as confusing as you might think, but you should note the difference. In any case, the way we set up our distributed hash is to store the key-value pair [tex](k:v)[/tex] on computer number [tex]\#(k) \mod n[/tex]. Because the hash function distributes key values in a uniform way, the result is to spread the key-value pairs through the cluster in a uniform way.

In a practical implementation, this approach to storing a distributed hash has some drawbacks, which we’ll come back to in later posts about the Google File System and BigTable. But for now, you can think of a distributed hash in this way.

Document collection

There are many possible ways we could be presented with the documents we’ll use to build our statistical model of translation. To make things concrete, and to illustrate the general ideas used to manage the distribution of documents across the cluster, let’s imagine we’re presented with them as a file containing several million URLs, with each URL pointing to an English language text which we’ll use to build a language model. Alternately, the file might contain URLs pointing to pairs of translated texts, in English and in French, which we’ll use to build the translation model. The same basic ideas apply in both cases, so to be specific, I’ll suppose the URLs point to English-only texts.

We use the URLs as keys in a distributed hash. Assuming that they start on a single machine, we first need to distribute the keys across the cluster, and then initialize the corresponding values to empty. The result is a distributed hash with key-value pairs (URL:empty). We now use that distributed hash as input to a Map job, which applies a mapper to (URL:empty), downloading the webpage for the URL, and storing the resulting document as the value corresponding to the key. The resulting distributed hash has key-value pairs (URL:document). We’ll use that distributed hash in constructing the language model, as described below.

Language model

Recall from the previous post that in the bigram model we compute the probability of a sentence as:

[tex] \mbox{pr}(e_1,\ldots,e_m) = \mbox{pr}(e_m|e_{m-1}) \mbox{pr}(e_{m-1}|e_{m-2}) \ldots \mbox{pr}(e_2|e_1) \mbox{pr}(e_1). [/tex]

To compute this probability, we need estimates for the single-xword probabilities [tex]\mbox{pr}(e_1)[/tex], and for the bigram conditional probabilities, [tex]\mbox{pr}(e_2|e_1)[/tex].

To estimate the single-word probabilities [tex]\mbox{pr}(e)[/tex] we use two MapReduce jobs. The first MapReduce job counts the total number of words [tex]N[/tex] in all the documents. The input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#(document)), where the intermediate key [tex]0[/tex] is simply an arbitrarily chosen constant, and #(document) is the total number of words in the document. The reducer simply sums over all these intermediate values, giving [tex]N[/tex] as output.

The second MapReduce job is a simple variant on the wordcount example used in the earlier post. It takes as input the hash with entries (URL:document). The mapper parses the document, and each time it comes to a word [tex]e[/tex] emits the intermediate key-value pair [tex](e:1)[/tex], indicating one occurrence of [tex]e[/tex]. The reducers sum over all those counts of [tex]1[/tex], and divide by the total number of words [tex]N[/tex], giving as output a hash with entries [tex](e:\mbox{pr}(e))[/tex], where [tex]\mbox{pr}(e)[/tex] is the relative frequency of [tex]e[/tex] in the document collection.

To estimate [tex]\mbox{pr}(e_2|e_1)[/tex] we also use two MapReduce jobs. The first MapReduce job compues the total number of bigrams, [tex]N'[/tex]. Once again, the input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#'(document)), where the intermediate key [tex]0[/tex] is again an arbitrary constant, and #'(document) is the total number of bigrams in the document. The reducer sums over all these values, giving [tex]N'[/tex] as output.

The second MapReduce job computes the frequency of each bigram, and divides by [tex]N'[/tex] to give [tex]\mbox{pr}(e_2|e_1)[/tex]. Again, the input to this job is the hash with entries (URL:document). The mapper emits an intermediate key-value pair [tex]((e_2|e_1):1)[/tex] for each bigram it finds in the document, possibly with some repetitions. Note that the intermediate keys are of the form [tex](e_2|e_1)[/tex], simply for consistency with our notation elswhere. The reducer sums over the intermediate values and divides by [tex]N'[/tex] to get the relative frequency of occurrence, so the output hash has entries [tex]((e_e|e_1):\mbox{pr}(e_2|e_1))[/tex], as was desired.

At this point we’ve constructed distributed hashes with key-value pairs [tex](e:\mbox{pr}(e))[/tex] and [tex]((e_2|e_1):\mbox{pr}(e_2|e_1))[/tex], effectively encoding our language model. In particular, access to these distributed hashes makes it is easy to compute [tex]\mbox{pr}(E)[/tex] for an entire sentence [tex]E[/tex].

Exercises

In the earlier post about statistical machine translation I mentioned that problems can arise in estimating bigram probabilities, because bigrams not in the corpus have their probability systematically underestimated. How might you address this problem on a distributed cluster?

Translation model

Recall from the previous post that our translation model was determined by the following sets of parameters:

The fertility probability [tex]\mbox{pr}(n|e)[/tex], the probability that the English word [tex]e[/tex] has fertility [tex]n[/tex].
The distortion probability [tex]\mbox{pr}(t|s,l)[/tex], which is the probability that an English word at position [tex]s[/tex] corresponds to a French word at position [tex]t[/tex] in a French sentence of length [tex]l[/tex].
The translation probability [tex]\mbox{pr}(f|e)[/tex], one for each French word [tex]f[/tex] and English word [tex]e[/tex].

To determine these parameters we simply start with guesses, and then use an iterative procedure to gradually get better estimates, stopping when the iterative procedure converges on a relatively stable set of parameters.

To implement this strategy, we start by setting up three hashes to store estimates of the parameters. Let’s start with the fertility probabilities. As our starting guess, we’ll assume:

[tex] \mbox{pr}(n|e) \approx \frac{1}{2^{n+1}}. [/tex]

I’ve put the approximation sign in because we’ll actually truncate the distribution at some high value for the fertility, say [tex]n = 10[/tex], so the hash is of a manageable size. Obviously, these probabilities should then be modified slightly, to be properly normalized, but that’s a detail I won’t worrry about.

We’ll set up a hash with keys [tex](n|e)[/tex] to store our initial guesses for the fertility probabilities. This can be done in many ways. One way is to use a single MapReduce job which takes as input the hash storing the corpus of English-French texts and produces as output a hash with key-value pairs [tex]((0|e):1/2),((1|e):1/4),((2|e):1/8),\ldots[/tex], where [tex]e[/tex] ranges over the set of English words in our corpus.

At this point, you’ve probably noticed that I’m assuming you’re getting comfortable enough with Map and MapReduce that I can just describing what the jobs do, and leave it to you to fill in the details of how they work. This will continue.

Next, we set up a distributed hash for the distortion probabilities [tex]\mbox{pr}(t|s,l)[/tex]. To do this, we first set up an empty hash (i.e., one with empty values) with keys [tex](t|s,l)[/tex], where [tex]t \leq l \leq 40[/tex], and [tex]s \leq 50[/tex], say. As before, these truncations are rather arbitrary, but their purpose is to limit the size of the hash. Now run a Map job with input the empty hash, [tex]((t|s,l):\mbox{empty})[/tex], and where the mapper simply outputs the value [tex]1/l[/tex], giving us a hash [tex]((t|s,l):1/l)[/tex], i.e., a uniform distribution for target positions.

In the last paragraph you might wonder why we don’t start by initializing the values in the hash as [tex]1/l[/tex], rather than using a Map job to do it. The model I have in mind is that the setup of the initial empty hash is co-ordinated from a single centralized server, with the values automatically being set to a default of empty. We could instead set the values on the centralized server as well, before distribution, but the problem is that this might cause the central server to become a bottleneck. That probably won’t happen here – computing and distributing [tex]1/l[/tex] is pretty easy – but it could easily happen if the computation or values were more complex. As a matter of general principle, it’s better to compute the values using a Map job which can be distributed across the cluster.

The third and final distributed hash we need to set up has keys [tex](f|e)[/tex] and values the translation probabilities [tex]\mbox{pr}(f|e)[/tex]. As our initial guess for these probabilities, we’ll use our corpus of starting documents, and set [tex]\mbox{pr}(f|e)[/tex] to be the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] occurs. We do this using two MapReduce jobs. The first MapReduce job constructs a hash with entries [tex](e: \mbox{number of sentences with } e \mbox{ in them})[/tex], over all words [tex]e[/tex] in the corpus. We use this hash and a second MapReduce job to construct a hash with keys [tex](f|e)[/tex] and values the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] is in the starting text.

We’ve now set up the three hashes describing our starting guess for the parameters of our translation model. Next, we need to understand how to iteratively update those guesses, hopefully improving their quality.

The first step in the iterative update is to use Map to process our translation corpus, producing as output a hash with keys [tex](E,F)[/tex], where [tex]E[/tex] and [tex]F[/tex] are corresponding English and French sentences in the corpus. To compute the corresponding value, we consider all possible alignments between the sentence pair, and compute probabilities for those alignments, using our initial estimate of the parameters above. The alignment [tex]a[/tex] with the highest probability is the corresponding value, so we are producing a hash with pairs [tex]((E,F):a)[/tex].

The second step in the iterative update is to use the sentence-pair hash [tex]((E,F):a)[/tex] and three MapReduce jobs to compute the updated estimates for the fertility, distortion and translation probabilities, [tex]\mbox{pr}(n|e), \mbox{pr}(t|s,l)[/tex] and [tex]\mbox{pr}(f|e)[/tex]. This completes the iterative update.

We now repeat this iterative update process many times. I’ll leave it as an exercise to figure out how you could use MapReduce to determine how much the parameters of the model are changing, per iteration, and what would be a reasonable criterion for terminating the iterations.

The translation step: heuristic search

The translation step is the final stage. It’s the stage at which we take our translation model, and apply it to translate a French document into English, sentence by sentence. Recall from the previous post that to translate the French sentence [tex]F[/tex] we search for the English sentence [tex]E[/tex] and alignment [tex]a[/tex] which maximizes the product

[tex] \mbox{pr}(F,a|E) \mbox{pr}(E). [/tex]

The idea of the heuristic search is to consider partial sentences and partial alignments, maintaining a stack of particularly promising candidates. We’ll set the stack size to be [tex]S[/tex], let’s say [tex]S = 10^6[/tex], so we’ll never consider more than a million possible candidate alignments and sentences.

The procedure involves three steps: (1) Construct all possible single-word extensions of existing partial sentences and partial alignments, and compute the corresponding product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]; (2) Choose the [tex]S[/tex] most likely of those, and discard the rest; and (3) Terminate when we have a complete alignment which is substantially more likely than any of the remaining partial alignments, otherwise repeat steps (1) and (2).

Step (1) can easily be done using a Map job. Step (2) can be done using multiple MapReduce steps. This is essentially a binary search to find the top [tex]S[/tex] values for the product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]. Step (3) can be done using the same ideas as step (2).

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

Interesting event in Waterloo

This might be of interest to people in Waterloo and nearby. It’s a day of talks about “The Economic Crisis and its Implications for The Science of Economics”. It’ll be held May 1, 2009, from 9am-5pm at the Perimeter Institute; registration is apparently required. See the website for more. The lineup of speakers is interesting:
Â

Nouriel Roubini, New York University
Emanuel Derman, Columbia University
Brian Arthur, Santa Fe Institute and PARC
Andrew Lo, MIT
Richard Alexander, University of Michigan
Eric Weinstein, Natron Group

Biweekly links for 03/23/2009

Introduction to “Cloud Computing” – Course at UMD
End of Free Access Â« The Scholarly Kitchen
- Here’s a quote from the article, but see also the first comment, from the Journal’s editor: “A venerable Open Access title in medical research has instituted a subscription model for its online journal.
  The Journal of Clinical Investigation (JCI), a prestigious title in continuous publication since 1924, implemented a subscription model beginning with its January 2009 issue. Original articles, however, continue to be freely available.”
Command-line Fu < The best UNIX commands on the web
- Brilliant.
Business Week: Will Google’s Purity Pay Off?
- Entertaining Business Week article from 2000. The gist is that they need to use banner ads to survive: “But how will Google ever make money? There’s the rub. The company’s adamant refusal to use banner or other graphical ads eliminates what is the most lucrative income stream for rival search engines. Although Google does have other revenue sources, such as licensing and text-based advertisements, the privately held company’s business remains limited compared with its competitors’. “
Factoring Could be Easy Â« GÃ¶delâ€™s Lost Letter and P=NP

Click here for all of my del.icio.us bookmarks.

The Polymath project: scope of participation

As I’ve mentioned before, over the past seven weeks mathematician Tim Gowers has been running a remarkable experiment in how mathematics is done, a project he dubbed the Polymath1 project. Using principles similar to those employed in open source programming projects, he used blogs and a wiki to organize an open mathematical collaboration attempting to find a new proof of an important mathematical theorem known as the density Hales-Jewett (DHJ) theorem. The problem was a tough one. Gowers, an accomplished professional mathematician, initially thought that the chances of success were “substantially less than 100%”, even adopting a quite modest criterion for success.

Last week, Gowers announced that the problem was “probably solved”. In fact, if the results hold up, the project has exceeded expectations. The original goal was to find a new proof of an important special case of the DHJ theorem using a particular approach Gowers suggested (or explain why that approach failed). This goal broadened over time, and the project appears to have found a new proof of the full theorem, using an approach different to that Gowers originally proposed. A writeup is in progress. Inspired by the work of Polymath1, mathematician Tim Austin has released a preprint claiming another proof of DHJ, and citing Polymath1 as crucial to his work.

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

The original hope was that the project would be a “massive collaboration”. Let’s suppose we take the number above (23) as representative of the number of people who made notable mathematical contributions, bearing in mind that there are obviously substantial limitations to using the timeline in this way. (The timeline contains some pointers to notable general comments, which I have not included in this count.) It’s certainly true that 23 people is a very large number for a mathematical collaboration – a few days into the project, Tim Gowers remarked that “this process is to normal research as driving is to pushing a car” – but it also falls well short of mass collaborations such as Linux and Wikipedia. Gowers has remarked that “I thought that there would be dozens of contributors, but instead the number settled down to a handful, all of whom I knew personally”.

These numbers take on a different flavour, however, when you note that the number of people involved compares favourably even to very successful open source collaborations, at the seven-week mark. 7 weeks after the inception of Wikipedia it had approximately 150 articles. This is considerably more than the Polymath1 wiki, but keep in mind that (a) the Polymath1 wiki is only a small part of the overall project; and (b) I doubt anyone would disagree that the average quality on the Polymath1 wiki is considerably higher. Similarly, while Linux has now received contributions from several thousand people, it took years to build that community. 6 months after Linus Torvalds first announced Linux there were 196 people on the Linux activists mailing list, but most were merely watching. Many had not even installed Linux (at the time, a tricky operation), much less contributed substantively. I’m willing to bet more than 196 people were following Polymath1.

A great deal can be said about scaling up future projects. I believe this can be done, and that there are potentially substantial further benefits. For now, I’ll just make one remark. Long-lived open-source collaborations sometimes start with narrowly focused goals, but they typically broaden over time, and become more open-ended, allowing the community of contributors to continue to grow over time. That’s certainly true of Linux, whose goal – the construction of an operating system kernel – is extremely broad. At the same time, that broad goal naturally gives rise to many focused and to some extent independent problems, which can be tackled in parallel by the development community. It may be possible to broaden Polymath1’s goals in a natural way at this point, but it seems like an interesting challenge to at the same time retain the sharp problem-oriented focused that characterized the collaboration.

Biweekly links for 03/20/2009

Bad Correspondent: Neal Stephenson
- “The quality of my e-mails and public speaking is, in my view, nowhere near that of my novels. So for me it comes down to the following choice: I can distribute material of bad-to-mediocre quality to a small number of people, or I can distribute material of higher quality to more people. But I can’t do both; the first one obliterates the second.”
Attacked from Within || kuro5hin.org
- An essay about online community. Over-the-top and overall not convincing, but stimulating and insightful in places.
Australian Government adds Wikileaks to banned website list | News | TechRadar UK
Caveat Lector Â» A post-Roach-Motel world
- Dorothea Salo, musing about the future of institutional repositories.
To share or not to share: Publication and quality assurance of research data outputs. A report commissioned by the Research Information Network – ECS EPrints Repository
- “A study on current practices with respect to data creation, use, sharing and publication in eight research disciplines (systems biology, genomics, astronomy, chemical crystallography, rural economy and land use, classics, climate science and social and public health science). The study looked at data creation and care, motivations for sharing data, discovery, access and usability of datasets and quality assurance of data in each discipline.”
Jody Williams – Wikipedia, the free encyclopedia
- Jody Williams won the Nobel Peace Prize for her work on the landmine treaty. She has some interesting things to say about group-building: “Imagine trying to get hundreds of organizations â€“ each one independent and working on many, many issues â€“ to feel that each is a critical element of the development of a new movement. I wanted each to feel that what they had to say about campaign planning, thinking, programs, actions was important. So, instead of sending letters, Iâ€™d send everyone faxes. People got in the habit of faxing back. This served two purposes â€“ people would really have to think about what they were committing to doing before writing it down, and we have a permanent, written record of almost everything in the development of the campaign from day one. “
Open Source Cinema – An Open Source Documentary Film about Copyright
picoup:
- Like twitter, but with an 18 character limit. Amusing. Presumably this is pico-blogging.
Demographics, Career Concerns or Social Comparison: Who Games SSRN Download Counts? by Benjamin Edelman, Ian Larkin
- A study of gaming of online metrics: “We use a unique database of every SSRN paper download over the course of seven years, along with detailed resume data on a random sample of SSRN authors, to examine the role of demographic factors, career concerns, and social comparisons on the commission of a particular type of gaming: the self-downloading of an author’s own SSRN working paper solely to inflate the paper’s reported download count. We find significant evidence that authors are more likely to inflate their papers’ download counts when a higher count greatly improves the visibility of a paper on the SSRN network. We also find limited evidence of gaming due to demographic factors and career concerns, and strong evidence of gaming driven by social comparisons with various peer groups. These results indicate the importance of including psychological factors in the study of deceptive behavior. “
The International Campaign to Ban Landmines – A Model for Disarmament Initiatives
- Very interesting look at how the ICBL worked, and how it helped achieve the landmine treaty.
Translating “The Economist” Behind China’s Great Firewall – Waxy.org
- “a group of dedicated fans of The Economist newsmagazine are translating each weekly issue cover-to-cover, splitting up the work among a team of volunteers, and redistributing the finished translations as complete PDFs for a Chinese audience. “
Interview with Clay Shirky, Part I : Columbia Journalism Review
A Wiki for the Planet: Clay Shirky on Open Source Environmentalism | Wired Science from Wired.com
- “Wired.com: What sort of online tools would enable that kind of collaboration? What’s missing?
  Shirky: What’s missing is there’s no license. There’s no equivalent of the GPL [GNU General Public License]. There’s been some tools for collaborative production of thinking. Anything from mailing lists to Wikis. What we don’t have is tools for acting once a decision has been taken. In Linux the final action is the compile step. In Wikipedia the action is the latest edit.
  
  Now we need to turn around and do X out in the world. I don’t think that there’s anything digital that we could do that would solve this gap. I think the gap is produced by the difficulty of translating thought into action. I think the kind of things that help people turn thoughts into action are much more about social and legal structures. “

Click here for all of my del.icio.us bookmarks.

wiki_tex.py

I’ve written a simple python script called wiki_tex.py to help convert LaTeX to the wiki-tex used on MediaWikis like the Polymath1 wiki, and Wikipedia. It’s a stripped-down port of a similar script I use to put LaTeX on my blog – the original script is heavily customized, which is why I haven’t made it generally available.

A zipped version of the script can be downloaded here.

To use the script, put a copy of the script in the directory you’re writing LaTeX in. You also need to make sure Python is installed on your machine. Mac and most Linuxes have it pre-installed. Windows doesn’t, but you can get a one-click installer at the python site. I wrote the script under Python 2.5, but would be surprised if it doesn’t work under Python 2.x. I don’t know how it runs under the recently released Python 3.

To run the script, you need a command line prompt, and to have the directory set to be whatever directory you’re writing LaTeX in. Once in the right directory, just run:

python wiki_tex.py filename

where filename.tex is the file you’re texing. Note the omission of the .tex suffix is intentional. The script will extract everything between \begin{document} and \end{document}, convert to wiki-tex, and output a file filename0.wiki which you can cut and paste into the MediaWiki.

As a convenience, if you put lines of the form “%#break” into the file (these are comments, so LaTeX ignores them) then the script breaks up the resulting output into multiple files, filename0.wiki, filename1.wiki, and so on. This is useful for compiling smaller snippets of LaTeX into wiki-tex.

The script is very basic. There’s many conversions it won’t handle, but there’s quite a bit of basic stuff that it copes with just fine. I’d suggest starting with some small examples, and gradually working up in complexity.

The script is also not very battle-tested. It’s undergone a little testing, but this is well and truly alpha software. Let me know in comments if you find bugs, and I’ll try to fix them.