How changing the technology of collaboration can change the nature of collaboration

Update: Ilya (who made the video) reports in comments that some fraction of the effect described below is an artifact. It’s hard to say how much. Here’s Ilya’s comment:

Michael, apparently one of the reasons you see the explosion in commits is because Git correctly attributes the changeset to the author. In [Subversion] days, the patch would be submitted by some author, but then one of the core team members would merge it in (under his/her name). Basically, same workflow with Git, but with proper attribution.

Having said that, I think seeing other people commit and get their changes merged in also encourages other developers to join in on the fray!

So it may or may not be that what’s said in the post is true. But the video shown isn’t evidence for it. A pity. It’d be nice to have a clearly visualizable demonstration of this general phenomenon.

Ilya Grigorik recently pointed me to a great example which shows vividly how even relatively modest changes in the technology of collaboration can change the nature of collaboration. The example is from an open source project called Ruby on Rails. Ruby on Rails is a web development framework famous within the web development community – it’s been used to develop well-known sites such as twitter – but, unlike, say, Linux, it’s largely unknown outside its own community. The original developer of Ruby on Rails is a programmer named David Heinemeier Hansson who for a long time worked on the framework on his own, before other people gradually began to join him.

The short video below shows the history of the collaboration graphically – what you see are pieces of code being virtually shuttled backward and forward between different contributors to the collaboration. There’s no need to watch the whole video, although it’s fun to do so: in the bottom right of the video you’ll see a date ticking over, and you can simply fast forward to January 2008, and watch until June 2008. Here’s the video:

(Edit: It’s better in high definition at Vimeo. As it is, it’s hard to see the dates – the relevant part of the video is roughly from 4:00 to 5:30.)

What you see, very vividly, is that in April 2008, a qualitative change occurs in the collaboration. Before April, you see a relatively slow and stately development process. After April, that process explodes with vastly more contributors and activity. What happened in April was this: the Ruby on Rails developers changed the tool they used to share code. Before April they used a tool called Subversion. In April of 2008 they switched to a new tool called Git (managed through Github). As changes go, this was similar to a group of writers changing the wiki software they use to collaborate on a shared set of documents. What’s interesting is that the effect on the collaboration was so dramatic, out of proportion to our everyday experience; it’s almost as though Ernest Hemingway had gotten a qualitative improvement in his writing by changing the type of pen he used to write.

I won’t say much here about the details of what caused the change. Briefly, Git and Github are a lot more social than Subversion, making it easier for people to go off and experiment with code on their own, to merge useful changes back in, and to track the activity of other people. Git was, in fact, developed by Linus Torvalds, to help Linux development scale better.

The background to all this is that I’ve been collecting some thoughts about the ongoing Polymath project, an experiment in open source mathematics, and the question of how projects like Polymath can be scaled up further. I’ll have more to say about than in future posts, but for now it seemed worth describing this striking example of how changes in technology can result in changes in the nature of collaboration.

Biweekly links for 03/16/2009

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 03/13/2009

  • Polymath1 and open collaborative mathematics « Gowers’s Weblog
    • “… for me personally this has been one of the most exciting six weeks of my mathematical life. That is partly because it is always exciting to solve a problem, but a much more important reason is the way this problem was solved, with people chipping in with their thoughts, provoking other people to have other thoughts (sometimes almost accidentally, and sometimes more logically), and ideas gradually emerging as a result. Incidentally, many of these ideas are still to be properly explored: at some point the main collaboration will probably be declared to be over (though I suppose in theory it could just go on and on, since its seems almost impossible to clear up every interesting question that emerges) and then I hope that the comments will be a useful resource…

      The sheer speed at which all this happened contributed to the excitement. In my own case it led to my becoming fairly obsessed with the project and working on it to the exclusion of almost everything else… “

  • Twenty years of the world wide web | The Economist
    • “Science inspired the world wide web. Two decades on, the web has repaid the compliment by changing science”
  • Get our full university data | guardian.co.uk
  • Paul Graham: Be Relentlessly Resourceful
    • One of my favourites of pg’s essays: “A couple days ago I finally got being a good startup founder down to two words: relentlessly resourceful.”

Click here for all of my del.icio.us bookmarks.

Published

Introduction to Statistical Machine Translation

How do computers translate texts from one language to another? Human translators use a great deal of detailed knowledge about how the world works to correctly translate all the different meanings the same word or phrase can have in different contexts. This makes automated translation seem like it might be a hard problem to solve, the kind of problem that may require genuine artificial intelligence. Yet although translators like Google Translate and Yahoo!’s Babel Fish are far from perfect, they do a surprisingly good job. How is that possible?

In this post, I describe the basic ideas behind the most successful approach to automated machine translation, an approach known as statistical machine translation. Statistical machine translation starts with a very large data set of good translations, that is, a corpus of texts (e.g., United Nations documents) which have already been translated into multiple languages, and then uses those texts to automatically infer a statistical model of translation. That statistical model is then applied to new texts to make a guess as to a reasonable translation.

I’m not an expert in statistical machine translation. I’ve written these notes to help me internalize the basic ideas of the area. However, I hope the notes are useful to others as an introductory overview of statistical machine translation, and as a starting place to learn more.

Formulating the problem

Imagine you’re given a French text, [tex]f[/tex], and you’d like to find a good English translation, [tex]e[/tex]. There are many possible translations of [tex]f[/tex] into English, of course, and different translators will have different opinions about what the best translation, [tex]e[/tex], is. We can model these differences of opinion with a probability distribution [tex]\mbox{pr}(e|f)[/tex] over possible translations, [tex]e[/tex], given that the French text was [tex]f[/tex]. A reasonable way of choosing the “best” translation is to choose [tex]e[/tex] which maximizes the conditional probability [tex]\mbox{pr}(e|f)[/tex].

The problem with this strategy is that we don’t know the conditional probability [tex]\mbox{pr}(e|f)[/tex]. To solve this problem, suppose we’re in possession of an initial corpus of documents that are in both French and English, e.g., United Nations documents, or the Canadian parliamentary proceedings. We’ll use that corpus to infer a model estimating the conditional probabilities [tex]\mbox{pr}(e|f)[/tex]. The model we’ll construct is far from perfect, but with a large and high quality initial corpus, yields pretty good translations. To simplify the discussion we assume [tex]e[/tex] and [tex]f[/tex] are single sentences, and we’ll ignore punctuation; obviously, the translator can be applied serially to a text containing many sentences.

Now, how do we start from the corpus and infer a model for [tex]\mbox{pr}(e|f)[/tex]? The standard approach is to use Bayes’ theorem to first rewrite [tex]\mbox{pr}(e|f)[/tex] as

[tex] \mbox{pr}(e|f) = \frac{\mbox{pr}(f|e) \mbox{pr}(e)}{\mbox{pr}(f)}. [/tex]

Because [tex]f[/tex] is fixed, the maximization over [tex]e[/tex] is thus equivalent to maximizing

[tex] \mbox{pr}(f|e) \mbox{pr}(e). [/tex]

What we’re going to do is to use our data set to infer models of [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex], and then use those models to search for [tex]e[/tex] maximizing [tex]\mbox{pr}(f|e) \mbox{pr}(e)[/tex].

The procedure outlined in the last paragraph seems a priori like a strange thing to do. Why not save time and difficulty by inferring a model of [tex]\mbox{pr}(e|f)[/tex] and then maximizing directly? It turns out that the procedure I described actually works quite a bit better. Now, to be blunt, I don’t entirely understand the explanations of this I’ve read in the published literature. But what I think is going on is as follows. It turns out that if we work directly with [tex]\mbox{pr}(e|f)[/tex], then many malformed English sentences [tex]e[/tex] are given a high probability by our model. You’ll see why this is the case later – it’s obvious from the construction – but for now just accept that’s the case. If we worked directly with [tex]\mbox{pr}(e|f)[/tex] those sentences would often be issued as “translations”, which is no good. Now, why should maximizing [tex]\mbox{pr}(f|e) \mbox{pr}(e)[/tex] work any better? After all, both [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex] may give high probabilistic weight to sentences [tex]e[/tex] which are not well-formed. So what gives? I think there are two things going on. First, although [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex] may give high weight to malformed sentences [tex]e[/tex], the models are constructed very differently. For both to give high weight simultaneously, it’s quite unlikely that [tex]e[/tex] is malformed. Second, we can build grammatical checks directly into the construction of [tex]\mbox{pr}(e)[/tex], so very few malformed sentences have high values for [tex]\mbox{pr}(e)[/tex]. We won’t actually do that in this post, but it’s relatively straightforward to do.

We’ve broken machine translation up into three problems: (1) build a language model which allows us to estimate [tex]\mbox{pr}(e)[/tex]; (2) build a translation model which allows us to estimate [tex]\mbox{pr}(f|e)[/tex]; and (3) search for [tex]e[/tex] maximizing the product [tex]\mbox{pr}(f|e)\mbox{pr}(e)[/tex]. Each of these problems is itself a rich problem which can be solved in many different ways. In the next three sections I’ll describe simple approaches to solving each of the three problems.

The language model

Suppose we break an English sentence [tex]e[/tex] up into words [tex]e = e_1e_2\ldots e_m[/tex]. Then we can write the probability for [tex]e[/tex] as a product of conditional probabilities:

[tex] \mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) [/tex]

So, for example, in the sentence fragment “To be or not to…”, you would probably assign a very high conditional probability to the final word being “be”, certainly much higher than the probability of its occurrence at a random point in a piece of text.

The challenge in building a good language model [tex]\mbox{pr}(e)[/tex] is that there are so many distinct conditional probabilities that need to be estimated. To simplify the problem we’ll make some assumptions about the form of the probabilities. The most drastic assumption we could make is to assume that the probability of seeing a word is independent of what came before it, i.e.,

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j), [/tex]

so

[tex] \mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j). [/tex]

We could estimate the probabilities [tex]\mbox{pr}(e_j)[/tex] by taking a very large corpus of English text, and counting words. The problem is that this model is not very realistic. A more realistic model is the bigram model, which assumes that the probability of a word occurring depends only on the word immediately before it:

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-1}). [/tex]

More realistic still is the trigram model, which assumes that the probability of a word occurring depends only on the two words immediately before it:

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-2},e_{j-1}). [/tex]

To make the discussion which follows as concrete as possible, let’s concentrate on the problem of how to estimate conditional probabilities in the bigram model. Similar remarks can be made for the trigram model. The obvious way to proceed is to take a large corpus of English text, count the number of occurrences [tex]\#(e_1,e_2)[/tex] of a particular word pair in that corpus, and then set [tex]\mbox{pr}(e_2|e_1) = \#(e_1,e_2)/\#(e_1)[/tex].

The problem with this procedure is that there are a lot of bigrams. If we assume that there are 50,000 words in English, say, then there are 2.5 billion possible bigrams. Even if you take a large corpus of training data (say, a billion words), it’s reasonably likely that there will be some bigrams which don’t appear in your corpus, and thus are assigned zero probability, yet which you would a priori like to appear in translations of some French sentences. That is, this kind of training procedure is likely to underestimate the probability of bigrams which don’t appear in the training set, and overestimate the probability of those which do. The problem is even worse for trigrams.

There’s no obvious best solution to this problem. Many different ad hoc solutions have been tried, and my quick survey of the literature suggests that there’s as yet no broad agreement about the best way of solving the problem. To give you the flavour of the solutions people use, though, let me describe two basic approaches, adapted from the survey in section 2.3 of Philip Clarkson’s PhD thesis.

The first approach is to move away from a pure bigram model, and instead to use linear interpolation between the monogram and bigram models. A large and diverse enough corpus of text is likely to give pretty good estimates for nearly all single-word probabilities [tex]\mbox{pr}(e_1)[/tex]. So one way of estimating [tex]\mbox{pr}(e_2|e_1)[/tex] is as:

[tex] \mbox{pr}(e_2|e_1) = \lambda \frac{\#(e_2)}{N} + (1-\lambda) \frac{\#(e_1,e_2)}{\#(e_1)}, [/tex]

where [tex]N[/tex] is the total number of words in the corpus. [tex]\lambda[/tex] is a parameter in the range [tex]0 < \lambda < 1[/tex] which needs to be determined. This can be done using a second corpus of text, and setting [tex]\lambda[/tex] so that the average probability of the bigrams in that corpus is maximized. A second approach is to apply a discount factor to the conditional probabilities for the bigrams which appear in the training corpus, essentially reducing their probability. The easist way to proceed is simply to multiply them all by some constant [tex]c[/tex] in the range [tex]0 < c < 1[/tex], and then to spread the remaining probability [tex]1-c[/tex] uniformly over all bigrams [tex](e_1,e_2)[/tex] which don't appear in the corpus.

Exercises

  • The constant [tex]c[/tex] can be viewed as an estimate of the fraction of all English bigrams that appear in the training corpus. How might you go about obtaining an estimate for [tex]c[/tex]?

To conclude this section, let me mention that the problem of building a good language model is also of great interest to people working on speech recognition software. In fact, much of the research literature on language models can be applied equally to either speech recognition or statistical machine translation.

The translation model

Note: English is my only language, which makes it hard for me to construct translation examples! All my examples are taken from a paper by Brown et al.

In this section we’ll construct a simple translation model allowing us to compute [tex]\mbox{pr}(f|e)[/tex]. Intuitively, when we translate a sentence, words in the source text generate (possibly in a context-dependent way) words in the target language. In the sentence pair (Jean aime Marie | John loves Mary) we intuitively feel that John corresponds to Jean, loves to aime, and Mary to Marie. Of coure, there is no need for the word correspondence to be one-to-one, nor for the ordering of words to be preserved. Sometimes, a word in English may generate two or more words in French; sometimes it may generate no word at all.

Despite these complications, the notion of a correspondence between words in the source language and in the target language is so useful that we’ll formalize it through what is called an alignment. Rather than give a precise definition, let me explain alignments through an example, the sentence pair (Le chien est battu par Jean | John (6) does beat (3,4) the (1) dog (2)). In this example alignment, the numbers in parentheses tell us that John corresponds to the 6th word in the French sentence, i.e., Jean. The word does has no trailing parentheses, and so doesn’t correspond to any of the words in the French sentence. However, beat corresponds to not one, but two separate words in the French sentence, the 3rd and 4th words, est and battu. And so on.

Two notions derived from alignments are particularly useful in building up our translation model. The first is fertility, defined as the number of French words generated by a given English word. So, in the example above, does has fertility [tex]0[/tex], since it doesn’t generate anything in the French sentence. On the other hand, beat has fertility [tex]2[/tex], since it generates two separate words in the French sentence.

The second notion is distortion. In many sentences, the English word and its corresponding French word or words appear in the same part of the sentence – near the beginning, perhaps, or near the end. We say that such words are translated roughly undistorted, while words which move a great deal have high distortion. We’ll encoude this notion more formally shortly.

We’ll build up our translation model using some simple parameters related to fertility and distortion:

  • The fertility probability [tex]\mbox{pr}(n|e)[/tex], the probability that the English word [tex]e[/tex] has fertility [tex]n[/tex].
  • The distortion probability [tex]\mbox{pr}(t|s,l)[/tex], which is the probability that an English word at position [tex]s[/tex] corresponds to a French word at position [tex]t[/tex] in a French sentence of length [tex]l[/tex].
  • The translation probability [tex]\mbox{pr}(f|e)[/tex], one for each French word [tex]f[/tex] and English word [tex]e[/tex]. This should not be confused with the case when [tex]f[/tex] and [tex]e[/tex] are sentences!

I’ve described the translation model as a way of computing [tex]\mbox{pr}(f|e)[/tex], where [tex]e[/tex] is an English sentence, and [tex]f[/tex] is a French sentence. In fact, we’re going to modify that definition a little, defining the translation model as the probability [tex]\mbox{pr}(f,a|e)[/tex] that the French sentence [tex]f[/tex] is the correct translation of [tex]e[/tex], with a particular alignment, which we’ll denote by [tex]a[/tex]. I’ll return to the question of how this change in definition affects translation in the next section. Rather than specify the probability for our translation model formally, I’ll simply show how it works for the example alignment (Le chien est battu par Jean |John (6) does beat (3,4) the (1) dog (2)):

[tex]\mbox{pr}(1|John) \times \mbox{pr}(Jean|John) \times \mbox{pr}(6|1,6) \times [/tex]

[tex]\mbox{pr}(0|does) \times [/tex]

[tex]\mbox{pr}(2|beat) \times \mbox{pr}(est|beat) \times \mbox{pr}(3|3,6) \times \mbox{pr}(battu|beat) \times \mbox{pr}(4|3,6) \times [/tex]

[tex]\mbox{pr}(1|the) \times \mbox{pr}(Le|the) \times \mbox{pr}(1|5,6) \times [/tex]

[tex]\mbox{pr}(1|dog) \times \mbox{pr}(chien|dog) \times \mbox{pr}(2|6,6) \times [/tex]

[tex]\mbox{pr}(1|) \times \mbox{pr}(par|) [/tex]

This should be self-explanatory except the final line, for the French word par. This word has no corresponding word in the English sentence, which we model using a special word [tex][/tex].

What remains to be done is to estimate the parameters used in constructing the translation model – the fertility, distortion and translation probabilities. I’ll briefly outline an iterative procedure that can be used to do this estimation. It starts with a simple guess of the parameters. E.g., we might guess that the distortion probabilities [tex]\mbox{pr}(t|s,l) = 1/l[/tex] are uniform across the sentence. Similar guesses could be made for the other parameters. For each pair [tex](e,f)[/tex] of sentences in our corpus we can use this guess to compute the probability for all possible alignments between the sentences. We then estimate the “true” alignment to be whichever alignment has highest probability. Applying this procedure to the entire corpus gives us many estimated alignments. We can then use those estimated alignments to compute new estimates for all the parameters in our model. E.g., if we find that [tex]1/10[/tex]th of the alignments of sentences of length [tex]25[/tex] have the first word mapped to the first word then our new estimate for [tex]\mbox{pr}(1|1,25) = 1/10[/tex]. This gives us a procedure for iteratively updating the parameters of our model, which can be repeated many times. Empirically we find (and it can be proved) that the parameters gradually converge to a fixed point.

Exercises

  • A problem I’ve glossed over above is figuring out which sentences in the English corpus correspond to which sentences in the French corpus. How could you figure out which sentences correspond to which? What would be the effect of mistakes?

Searching for the maximizing [tex]e[/tex]

In my introductory remarks I explained how translation from French to English can be viewed as the problem of finding [tex]e[/tex] which maximizes [tex]\mbox{pr}(e|f)[/tex], and that this was equivalent to maximizing [tex]\mbox{pr}(f|e)\mbox{pr}(e)[/tex]. Let’s revisit the reasoning which led to that framing of the problem, and recast it in a form more amenable to our translation model based on alignments. Instead of searching for [tex]e[/tex] which maximizes [tex]\mbox{pr}(e|f)[/tex], let’s instead search for a pair [tex](e,a)[/tex] which maximizes [tex]\mbox{pr}(e,a|f)[/tex]. Using Bayes’ theorem this is equivalent to finding [tex](e,a)[/tex] which maximizes

[tex] \mbox{pr}(e,a|f) = \frac{\mbox{pr}(f,a|e) \mbox{pr}(e)} {\mbox{pr}(f)}. [/tex]

As earlier, we can omit the factor [tex]\mbox{pr}(f)[/tex] in the denominator, and so our search is equivalent to finding a pair [tex](e,a)[/tex] which maximizes

[tex] \mbox{pr}(f,a|e) \mbox{pr}(e) [/tex]

Our translation problem is to find [tex](e,a)[/tex] maximizing this expression. Unfortunately, there are far too many possibility to do an exhaustive search. But there are some heauristics which work pretty well. One is to use a greedy algorithm which gradually builds up partial translations and alignments. Consider the following partial translations and alighments:

(Jean aime Marie | John (1) *)

(Jean aime Marie | * loves (2) *)

(Jean aime Marie | Petroleum (1) *)

Using the translation and language models described earlier we can compute the probabilities for each of these partial alignments, and so compute the product [tex]\mbox{pr}(f,a|e) \mbox{pr}(e)[/tex]. It would be very surprising if the first two candidates weren’t assigned a high probability, and the third example a very low probability. Our heuristic for search is thus to maintain a stack of promising partial translations and alignments, and to continually extend those partial translations and alignments until we find a complete translation and alignment with a significantly higher probability than any of the partial alignments.

Commentary

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data. Worry about the data first before you worry about the algorithm.

Peter Norvig, Director of Research at Google, as reported by Greg Linden.

That completes our description of a simple statistical machine translation system. The system I’ve described is essentially the same as the system described in 1990 by Brown et al at IBM research, a system which translated roughly half of all sentences in an acceptable way! Personally, I find it remarkable that such simple ideas can be used to make so much progress on what appears to be an enormously difficult problem. (I’ll describe other examples in future posts.) Furthermore, although the basic ideas remain the same, many improvements to these ideas have been made since 1990. The biggest single advance seems to have been a movement away from words as the unit of language, and towards phrase-based models, which give greatly improved performance.

If you’re interested in understanding more, I strongly suggest checking out an accessible and very influential 1993 follow-up paper from the IBM group, which describes a whole plethora of translation models. I don’t have a single canonical reference to give on phrase-based statistical translation models, but found several of the top hits at Google Scholar quite informative. Finally, and just for fun, I recommend Peter Norvig’s excellent short introduction to automatic spelling correction, which uses similar ideas in a much (!) simpler setting, but has the big advantage that it is illustrated by real code.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

Published
Categorized as GTS

Biweekly links for 03/09/2009

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 03/06/2009

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 03/02/2009

Click here for all of my del.icio.us bookmarks.

Published