Implementing Statistical Machine Translation Using MapReduce

This post is part of my series on the Google Technology Stack. It’s something of a digression, and is optional (but fun, and worth reading!) We’ll return to the main line of development in the next post.

This post is a followup to an earlier post about the beautiful subject of statistical machine translation. In the earlier post I described how to build a statistical model that can be used to automatically translate texts from one language to another. The ideas are beautiful, but require a lot of computing power – the more computing power, the better the translation. In this post I describe a way of implementing statistical machine translation in parallel on a large cluster of machines. The post assumes: (1) that you’re comfortable with my previous post on statistical machine translation; and (2) that you’re comfortable with the MapReduce framework for distributed computing.

Before going any further, a warning: I haven’t implemented the system described here. I’ve no idea if Google does machine translation with a MapReduce implementiation, although Google translate definitely uses statistical machine translation. It’s possible there’s a bottleneck in my implementation that I’m not noticing. Ideally, we should carefully analyse the resources used in my proposed implementation, and, if that reveals no problems, build a working system, and test to see how well it scales in practice. I haven’t done any of that. Still, putting statistical machine translation into the MapReduce framework is a great way of deepening our understanding of both sets of ideas, and I’m pretty confident that many of these ideas could be used in a production machine translation system.

With that said, it should be noted that the simple statistical machine translation model described in the previous post can actually be implemented on a single machine. However, as discussed in the earlier post, it’s easy to tweak that model so it performs far better, at the cost of making it too complex to run on a single machine. The ideas described in this post are easily adapted to many of those tweaked models of machine translation.

One final point before we jump into the meat of the post. Earlier posts in this series have been pretty detailed. In this post I’ve glossed over some details so as not to obscure the forest with too many tress – there’s only so many Map and MapReduce jobs you can read about before your eyes start to glaze over. But the missing details should be easily filled in by the reader, and I’ve tried to indicate where there are details which need to be filled in.

Overview: what our system needs to do

Recall the basic elements of statistical machine translation. It’s split into two main phases. The first phase is the training phase, in which a statistical model of translation is built, using a corpus of texts in both the source and target language (French and English, for the sake of concreteness). The training phase is itself split into three parts: (i) document collection, where we assemble the corpus of texts from which the statistical model will be inferred; (ii) building the language model for the target language; and (iii) building the translation model from the target language to the source language (yes, that’s the right way round, see the earlier post!) The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. It’s this phase which is actually used directly by the end user – the training phase all happens offline, beforehand.

Altogether, that makes four separate procedures we need to implement. We’ll go through them one by one in the remainder of this post. Before doing that, though, there is one idea implicit in my MapReduce post that needs to be made more explicit. This is the idea of a distributed hash of key-value pairs. Given a set of keys and corresponding values [tex](k_1:v_1),(k_2:v_2),\ldots[/tex], a crucial part of MapReduce is distributing the key-value pairs evenly over all computers in the cluster.

Here’s a simple way this may be achieved. Suppose there are [tex]n[/tex] computers in the cluster. Number them [tex]0,\ldots,n-1[/tex]. Let [tex]\#(\cdot)[/tex] be a hash function. An unfortunate fact is that this use of the term “hash” is completely different than the useage of “hash” to denote a set of key-value pairs. In practice, this isn’t as confusing as you might think, but you should note the difference. In any case, the way we set up our distributed hash is to store the key-value pair [tex](k:v)[/tex] on computer number [tex]\#(k) \mod n[/tex]. Because the hash function distributes key values in a uniform way, the result is to spread the key-value pairs through the cluster in a uniform way.

In a practical implementation, this approach to storing a distributed hash has some drawbacks, which we’ll come back to in later posts about the Google File System and BigTable. But for now, you can think of a distributed hash in this way.

Document collection

There are many possible ways we could be presented with the documents we’ll use to build our statistical model of translation. To make things concrete, and to illustrate the general ideas used to manage the distribution of documents across the cluster, let’s imagine we’re presented with them as a file containing several million URLs, with each URL pointing to an English language text which we’ll use to build a language model. Alternately, the file might contain URLs pointing to pairs of translated texts, in English and in French, which we’ll use to build the translation model. The same basic ideas apply in both cases, so to be specific, I’ll suppose the URLs point to English-only texts.

We use the URLs as keys in a distributed hash. Assuming that they start on a single machine, we first need to distribute the keys across the cluster, and then initialize the corresponding values to empty. The result is a distributed hash with key-value pairs (URL:empty). We now use that distributed hash as input to a Map job, which applies a mapper to (URL:empty), downloading the webpage for the URL, and storing the resulting document as the value corresponding to the key. The resulting distributed hash has key-value pairs (URL:document). We’ll use that distributed hash in constructing the language model, as described below.

Language model

Recall from the previous post that in the bigram model we compute the probability of a sentence as:

[tex] \mbox{pr}(e_1,\ldots,e_m) = \mbox{pr}(e_m|e_{m-1}) \mbox{pr}(e_{m-1}|e_{m-2}) \ldots \mbox{pr}(e_2|e_1) \mbox{pr}(e_1). [/tex]

To compute this probability, we need estimates for the single-xword probabilities [tex]\mbox{pr}(e_1)[/tex], and for the bigram conditional probabilities, [tex]\mbox{pr}(e_2|e_1)[/tex].

To estimate the single-word probabilities [tex]\mbox{pr}(e)[/tex] we use two MapReduce jobs. The first MapReduce job counts the total number of words [tex]N[/tex] in all the documents. The input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#(document)), where the intermediate key [tex]0[/tex] is simply an arbitrarily chosen constant, and #(document) is the total number of words in the document. The reducer simply sums over all these intermediate values, giving [tex]N[/tex] as output.

The second MapReduce job is a simple variant on the wordcount example used in the earlier post. It takes as input the hash with entries (URL:document). The mapper parses the document, and each time it comes to a word [tex]e[/tex] emits the intermediate key-value pair [tex](e:1)[/tex], indicating one occurrence of [tex]e[/tex]. The reducers sum over all those counts of [tex]1[/tex], and divide by the total number of words [tex]N[/tex], giving as output a hash with entries [tex](e:\mbox{pr}(e))[/tex], where [tex]\mbox{pr}(e)[/tex] is the relative frequency of [tex]e[/tex] in the document collection.

To estimate [tex]\mbox{pr}(e_2|e_1)[/tex] we also use two MapReduce jobs. The first MapReduce job compues the total number of bigrams, [tex]N'[/tex]. Once again, the input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#'(document)), where the intermediate key [tex]0[/tex] is again an arbitrary constant, and #'(document) is the total number of bigrams in the document. The reducer sums over all these values, giving [tex]N'[/tex] as output.

The second MapReduce job computes the frequency of each bigram, and divides by [tex]N'[/tex] to give [tex]\mbox{pr}(e_2|e_1)[/tex]. Again, the input to this job is the hash with entries (URL:document). The mapper emits an intermediate key-value pair [tex]((e_2|e_1):1)[/tex] for each bigram it finds in the document, possibly with some repetitions. Note that the intermediate keys are of the form [tex](e_2|e_1)[/tex], simply for consistency with our notation elswhere. The reducer sums over the intermediate values and divides by [tex]N'[/tex] to get the relative frequency of occurrence, so the output hash has entries [tex]((e_e|e_1):\mbox{pr}(e_2|e_1))[/tex], as was desired.

At this point we’ve constructed distributed hashes with key-value pairs [tex](e:\mbox{pr}(e))[/tex] and [tex]((e_2|e_1):\mbox{pr}(e_2|e_1))[/tex], effectively encoding our language model. In particular, access to these distributed hashes makes it is easy to compute [tex]\mbox{pr}(E)[/tex] for an entire sentence [tex]E[/tex].

Exercises

  • In the earlier post about statistical machine translation I mentioned that problems can arise in estimating bigram probabilities, because bigrams not in the corpus have their probability systematically underestimated. How might you address this problem on a distributed cluster?

Translation model

Recall from the previous post that our translation model was determined by the following sets of parameters:

  • The fertility probability [tex]\mbox{pr}(n|e)[/tex], the probability that the English word [tex]e[/tex] has fertility [tex]n[/tex].
  • The distortion probability [tex]\mbox{pr}(t|s,l)[/tex], which is the probability that an English word at position [tex]s[/tex] corresponds to a French word at position [tex]t[/tex] in a French sentence of length [tex]l[/tex].
  • The translation probability [tex]\mbox{pr}(f|e)[/tex], one for each French word [tex]f[/tex] and English word [tex]e[/tex].

To determine these parameters we simply start with guesses, and then use an iterative procedure to gradually get better estimates, stopping when the iterative procedure converges on a relatively stable set of parameters.

To implement this strategy, we start by setting up three hashes to store estimates of the parameters. Let’s start with the fertility probabilities. As our starting guess, we’ll assume:

[tex] \mbox{pr}(n|e) \approx \frac{1}{2^{n+1}}. [/tex]

I’ve put the approximation sign in because we’ll actually truncate the distribution at some high value for the fertility, say [tex]n = 10[/tex], so the hash is of a manageable size. Obviously, these probabilities should then be modified slightly, to be properly normalized, but that’s a detail I won’t worrry about.

We’ll set up a hash with keys [tex](n|e)[/tex] to store our initial guesses for the fertility probabilities. This can be done in many ways. One way is to use a single MapReduce job which takes as input the hash storing the corpus of English-French texts and produces as output a hash with key-value pairs [tex]((0|e):1/2),((1|e):1/4),((2|e):1/8),\ldots[/tex], where [tex]e[/tex] ranges over the set of English words in our corpus.

At this point, you’ve probably noticed that I’m assuming you’re getting comfortable enough with Map and MapReduce that I can just describing what the jobs do, and leave it to you to fill in the details of how they work. This will continue.

Next, we set up a distributed hash for the distortion probabilities [tex]\mbox{pr}(t|s,l)[/tex]. To do this, we first set up an empty hash (i.e., one with empty values) with keys [tex](t|s,l)[/tex], where [tex]t \leq l \leq 40[/tex], and [tex]s \leq 50[/tex], say. As before, these truncations are rather arbitrary, but their purpose is to limit the size of the hash. Now run a Map job with input the empty hash, [tex]((t|s,l):\mbox{empty})[/tex], and where the mapper simply outputs the value [tex]1/l[/tex], giving us a hash [tex]((t|s,l):1/l)[/tex], i.e., a uniform distribution for target positions.

In the last paragraph you might wonder why we don’t start by initializing the values in the hash as [tex]1/l[/tex], rather than using a Map job to do it. The model I have in mind is that the setup of the initial empty hash is co-ordinated from a single centralized server, with the values automatically being set to a default of empty. We could instead set the values on the centralized server as well, before distribution, but the problem is that this might cause the central server to become a bottleneck. That probably won’t happen here – computing and distributing [tex]1/l[/tex] is pretty easy – but it could easily happen if the computation or values were more complex. As a matter of general principle, it’s better to compute the values using a Map job which can be distributed across the cluster.

The third and final distributed hash we need to set up has keys [tex](f|e)[/tex] and values the translation probabilities [tex]\mbox{pr}(f|e)[/tex]. As our initial guess for these probabilities, we’ll use our corpus of starting documents, and set [tex]\mbox{pr}(f|e)[/tex] to be the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] occurs. We do this using two MapReduce jobs. The first MapReduce job constructs a hash with entries [tex](e: \mbox{number of sentences with } e \mbox{ in them})[/tex], over all words [tex]e[/tex] in the corpus. We use this hash and a second MapReduce job to construct a hash with keys [tex](f|e)[/tex] and values the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] is in the starting text.

We’ve now set up the three hashes describing our starting guess for the parameters of our translation model. Next, we need to understand how to iteratively update those guesses, hopefully improving their quality.

The first step in the iterative update is to use Map to process our translation corpus, producing as output a hash with keys [tex](E,F)[/tex], where [tex]E[/tex] and [tex]F[/tex] are corresponding English and French sentences in the corpus. To compute the corresponding value, we consider all possible alignments between the sentence pair, and compute probabilities for those alignments, using our initial estimate of the parameters above. The alignment [tex]a[/tex] with the highest probability is the corresponding value, so we are producing a hash with pairs [tex]((E,F):a)[/tex].

The second step in the iterative update is to use the sentence-pair hash [tex]((E,F):a)[/tex] and three MapReduce jobs to compute the updated estimates for the fertility, distortion and translation probabilities, [tex]\mbox{pr}(n|e), \mbox{pr}(t|s,l)[/tex] and [tex]\mbox{pr}(f|e)[/tex]. This completes the iterative update.

We now repeat this iterative update process many times. I’ll leave it as an exercise to figure out how you could use MapReduce to determine how much the parameters of the model are changing, per iteration, and what would be a reasonable criterion for terminating the iterations.

The translation step: heuristic search

The translation step is the final stage. It’s the stage at which we take our translation model, and apply it to translate a French document into English, sentence by sentence. Recall from the previous post that to translate the French sentence [tex]F[/tex] we search for the English sentence [tex]E[/tex] and alignment [tex]a[/tex] which maximizes the product

[tex] \mbox{pr}(F,a|E) \mbox{pr}(E). [/tex]

The idea of the heuristic search is to consider partial sentences and partial alignments, maintaining a stack of particularly promising candidates. We’ll set the stack size to be [tex]S[/tex], let’s say [tex]S = 10^6[/tex], so we’ll never consider more than a million possible candidate alignments and sentences.

The procedure involves three steps: (1) Construct all possible single-word extensions of existing partial sentences and partial alignments, and compute the corresponding product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]; (2) Choose the [tex]S[/tex] most likely of those, and discard the rest; and (3) Terminate when we have a complete alignment which is substantially more likely than any of the remaining partial alignments, otherwise repeat steps (1) and (2).

Step (1) can easily be done using a Map job. Step (2) can be done using multiple MapReduce steps. This is essentially a binary search to find the top [tex]S[/tex] values for the product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]. Step (3) can be done using the same ideas as step (2).

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

Interesting event in Waterloo

This might be of interest to people in Waterloo and nearby. It’s a day of talks about “The Economic Crisis and its Implications for The Science of Economics”. It’ll be held May 1, 2009, from 9am-5pm at the Perimeter Institute; registration is apparently required. See the website for more. The lineup of speakers is interesting:
 

  1. Nouriel Roubini, New York University
  2. Emanuel Derman, Columbia University
  3. Brian Arthur, Santa Fe Institute and PARC
  4. Andrew Lo, MIT
  5. Richard Alexander, University of Michigan
  6. Eric Weinstein, Natron Group
Published

Biweekly links for 03/23/2009

Click here for all of my del.icio.us bookmarks.

Published

The Polymath project: scope of participation

As I’ve mentioned before, over the past seven weeks mathematician Tim Gowers has been running a remarkable experiment in how mathematics is done, a project he dubbed the Polymath1 project. Using principles similar to those employed in open source programming projects, he used blogs and a wiki to organize an open mathematical collaboration attempting to find a new proof of an important mathematical theorem known as the density Hales-Jewett (DHJ) theorem. The problem was a tough one. Gowers, an accomplished professional mathematician, initially thought that the chances of success were “substantially less than 100%”, even adopting a quite modest criterion for success.

Last week, Gowers announced that the problem was “probably solved”. In fact, if the results hold up, the project has exceeded expectations. The original goal was to find a new proof of an important special case of the DHJ theorem using a particular approach Gowers suggested (or explain why that approach failed). This goal broadened over time, and the project appears to have found a new proof of the full theorem, using an approach different to that Gowers originally proposed. A writeup is in progress. Inspired by the work of Polymath1, mathematician Tim Austin has released a preprint claiming another proof of DHJ, and citing Polymath1 as crucial to his work.

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

The original hope was that the project would be a “massive collaboration”. Let’s suppose we take the number above (23) as representative of the number of people who made notable mathematical contributions, bearing in mind that there are obviously substantial limitations to using the timeline in this way. (The timeline contains some pointers to notable general comments, which I have not included in this count.) It’s certainly true that 23 people is a very large number for a mathematical collaboration – a few days into the project, Tim Gowers remarked that “this process is to normal research as driving is to pushing a car” – but it also falls well short of mass collaborations such as Linux and Wikipedia. Gowers has remarked that “I thought that there would be dozens of contributors, but instead the number settled down to a handful, all of whom I knew personally”.

These numbers take on a different flavour, however, when you note that the number of people involved compares favourably even to very successful open source collaborations, at the seven-week mark. 7 weeks after the inception of Wikipedia it had approximately 150 articles. This is considerably more than the Polymath1 wiki, but keep in mind that (a) the Polymath1 wiki is only a small part of the overall project; and (b) I doubt anyone would disagree that the average quality on the Polymath1 wiki is considerably higher. Similarly, while Linux has now received contributions from several thousand people, it took years to build that community. 6 months after Linus Torvalds first announced Linux there were 196 people on the Linux activists mailing list, but most were merely watching. Many had not even installed Linux (at the time, a tricky operation), much less contributed substantively. I’m willing to bet more than 196 people were following Polymath1.

A great deal can be said about scaling up future projects. I believe this can be done, and that there are potentially substantial further benefits. For now, I’ll just make one remark. Long-lived open-source collaborations sometimes start with narrowly focused goals, but they typically broaden over time, and become more open-ended, allowing the community of contributors to continue to grow over time. That’s certainly true of Linux, whose goal – the construction of an operating system kernel – is extremely broad. At the same time, that broad goal naturally gives rise to many focused and to some extent independent problems, which can be tackled in parallel by the development community. It may be possible to broaden Polymath1’s goals in a natural way at this point, but it seems like an interesting challenge to at the same time retain the sharp problem-oriented focused that characterized the collaboration.

Biweekly links for 03/20/2009

  • Bad Correspondent: Neal Stephenson
    • “The quality of my e-mails and public speaking is, in my view, nowhere near that of my novels. So for me it comes down to the following choice: I can distribute material of bad-to-mediocre quality to a small number of people, or I can distribute material of higher quality to more people. But I can’t do both; the first one obliterates the second.”
  • Attacked from Within || kuro5hin.org
    • An essay about online community. Over-the-top and overall not convincing, but stimulating and insightful in places.
  • Australian Government adds Wikileaks to banned website list | News | TechRadar UK
  • Caveat Lector » A post-Roach-Motel world
    • Dorothea Salo, musing about the future of institutional repositories.
  • To share or not to share: Publication and quality assurance of research data outputs. A report commissioned by the Research Information Network – ECS EPrints Repository
    • “A study on current practices with respect to data creation, use, sharing and publication in eight research disciplines (systems biology, genomics, astronomy, chemical crystallography, rural economy and land use, classics, climate science and social and public health science). The study looked at data creation and care, motivations for sharing data, discovery, access and usability of datasets and quality assurance of data in each discipline.”
  • Jody Williams – Wikipedia, the free encyclopedia
    • Jody Williams won the Nobel Peace Prize for her work on the landmine treaty. She has some interesting things to say about group-building: “Imagine trying to get hundreds of organizations – each one independent and working on many, many issues – to feel that each is a critical element of the development of a new movement. I wanted each to feel that what they had to say about campaign planning, thinking, programs, actions was important. So, instead of sending letters, I’d send everyone faxes. People got in the habit of faxing back. This served two purposes – people would really have to think about what they were committing to doing before writing it down, and we have a permanent, written record of almost everything in the development of the campaign from day one. “
  • Open Source Cinema – An Open Source Documentary Film about Copyright
  • picoup:
    • Like twitter, but with an 18 character limit. Amusing. Presumably this is pico-blogging.
  • Demographics, Career Concerns or Social Comparison: Who Games SSRN Download Counts? by Benjamin Edelman, Ian Larkin
    • A study of gaming of online metrics: “We use a unique database of every SSRN paper download over the course of seven years, along with detailed resume data on a random sample of SSRN authors, to examine the role of demographic factors, career concerns, and social comparisons on the commission of a particular type of gaming: the self-downloading of an author’s own SSRN working paper solely to inflate the paper’s reported download count. We find significant evidence that authors are more likely to inflate their papers’ download counts when a higher count greatly improves the visibility of a paper on the SSRN network. We also find limited evidence of gaming due to demographic factors and career concerns, and strong evidence of gaming driven by social comparisons with various peer groups. These results indicate the importance of including psychological factors in the study of deceptive behavior. “
  • The International Campaign to Ban Landmines – A Model for Disarmament Initiatives
    • Very interesting look at how the ICBL worked, and how it helped achieve the landmine treaty.
  • Translating “The Economist” Behind China’s Great Firewall – Waxy.org
    • “a group of dedicated fans of The Economist newsmagazine are translating each weekly issue cover-to-cover, splitting up the work among a team of volunteers, and redistributing the finished translations as complete PDFs for a Chinese audience. “
  • Interview with Clay Shirky, Part I : Columbia Journalism Review
  • A Wiki for the Planet: Clay Shirky on Open Source Environmentalism | Wired Science from Wired.com
    • “Wired.com: What sort of online tools would enable that kind of collaboration? What’s missing?
      Shirky: What’s missing is there’s no license. There’s no equivalent of the GPL [GNU General Public License]. There’s been some tools for collaborative production of thinking. Anything from mailing lists to Wikis. What we don’t have is tools for acting once a decision has been taken. In Linux the final action is the compile step. In Wikipedia the action is the latest edit.

      Now we need to turn around and do X out in the world. I don’t think that there’s anything digital that we could do that would solve this gap. I think the gap is produced by the difficulty of translating thought into action. I think the kind of things that help people turn thoughts into action are much more about social and legal structures. “

Click here for all of my del.icio.us bookmarks.

Published

wiki_tex.py

I’ve written a simple python script called wiki_tex.py to help convert LaTeX to the wiki-tex used on MediaWikis like the Polymath1 wiki, and Wikipedia. It’s a stripped-down port of a similar script I use to put LaTeX on my blog – the original script is heavily customized, which is why I haven’t made it generally available.

A zipped version of the script can be downloaded here.

To use the script, put a copy of the script in the directory you’re writing LaTeX in. You also need to make sure Python is installed on your machine. Mac and most Linuxes have it pre-installed. Windows doesn’t, but you can get a one-click installer at the python site. I wrote the script under Python 2.5, but would be surprised if it doesn’t work under Python 2.x. I don’t know how it runs under the recently released Python 3.

To run the script, you need a command line prompt, and to have the directory set to be whatever directory you’re writing LaTeX in. Once in the right directory, just run:

python wiki_tex.py filename

where filename.tex is the file you’re texing. Note the omission of the .tex suffix is intentional. The script will extract everything between \begin{document} and \end{document}, convert to wiki-tex, and output a file filename0.wiki which you can cut and paste into the MediaWiki.

As a convenience, if you put lines of the form “%#break” into the file (these are comments, so LaTeX ignores them) then the script breaks up the resulting output into multiple files, filename0.wiki, filename1.wiki, and so on. This is useful for compiling smaller snippets of LaTeX into wiki-tex.

The script is very basic. There’s many conversions it won’t handle, but there’s quite a bit of basic stuff that it copes with just fine. I’d suggest starting with some small examples, and gradually working up in complexity.

The script is also not very battle-tested. It’s undergone a little testing, but this is well and truly alpha software. Let me know in comments if you find bugs, and I’ll try to fix them.

Published
Categorized as polymath1

How changing the technology of collaboration can change the nature of collaboration

Update: Ilya (who made the video) reports in comments that some fraction of the effect described below is an artifact. It’s hard to say how much. Here’s Ilya’s comment:

Michael, apparently one of the reasons you see the explosion in commits is because Git correctly attributes the changeset to the author. In [Subversion] days, the patch would be submitted by some author, but then one of the core team members would merge it in (under his/her name). Basically, same workflow with Git, but with proper attribution.

Having said that, I think seeing other people commit and get their changes merged in also encourages other developers to join in on the fray!

So it may or may not be that what’s said in the post is true. But the video shown isn’t evidence for it. A pity. It’d be nice to have a clearly visualizable demonstration of this general phenomenon.

Ilya Grigorik recently pointed me to a great example which shows vividly how even relatively modest changes in the technology of collaboration can change the nature of collaboration. The example is from an open source project called Ruby on Rails. Ruby on Rails is a web development framework famous within the web development community – it’s been used to develop well-known sites such as twitter – but, unlike, say, Linux, it’s largely unknown outside its own community. The original developer of Ruby on Rails is a programmer named David Heinemeier Hansson who for a long time worked on the framework on his own, before other people gradually began to join him.

The short video below shows the history of the collaboration graphically – what you see are pieces of code being virtually shuttled backward and forward between different contributors to the collaboration. There’s no need to watch the whole video, although it’s fun to do so: in the bottom right of the video you’ll see a date ticking over, and you can simply fast forward to January 2008, and watch until June 2008. Here’s the video:

(Edit: It’s better in high definition at Vimeo. As it is, it’s hard to see the dates – the relevant part of the video is roughly from 4:00 to 5:30.)

What you see, very vividly, is that in April 2008, a qualitative change occurs in the collaboration. Before April, you see a relatively slow and stately development process. After April, that process explodes with vastly more contributors and activity. What happened in April was this: the Ruby on Rails developers changed the tool they used to share code. Before April they used a tool called Subversion. In April of 2008 they switched to a new tool called Git (managed through Github). As changes go, this was similar to a group of writers changing the wiki software they use to collaborate on a shared set of documents. What’s interesting is that the effect on the collaboration was so dramatic, out of proportion to our everyday experience; it’s almost as though Ernest Hemingway had gotten a qualitative improvement in his writing by changing the type of pen he used to write.

I won’t say much here about the details of what caused the change. Briefly, Git and Github are a lot more social than Subversion, making it easier for people to go off and experiment with code on their own, to merge useful changes back in, and to track the activity of other people. Git was, in fact, developed by Linus Torvalds, to help Linux development scale better.

The background to all this is that I’ve been collecting some thoughts about the ongoing Polymath project, an experiment in open source mathematics, and the question of how projects like Polymath can be scaled up further. I’ll have more to say about than in future posts, but for now it seemed worth describing this striking example of how changes in technology can result in changes in the nature of collaboration.

Biweekly links for 03/16/2009

Click here for all of my del.icio.us bookmarks.

Published

Biweekly links for 03/13/2009

  • Polymath1 and open collaborative mathematics « Gowers’s Weblog
    • “… for me personally this has been one of the most exciting six weeks of my mathematical life. That is partly because it is always exciting to solve a problem, but a much more important reason is the way this problem was solved, with people chipping in with their thoughts, provoking other people to have other thoughts (sometimes almost accidentally, and sometimes more logically), and ideas gradually emerging as a result. Incidentally, many of these ideas are still to be properly explored: at some point the main collaboration will probably be declared to be over (though I suppose in theory it could just go on and on, since its seems almost impossible to clear up every interesting question that emerges) and then I hope that the comments will be a useful resource…

      The sheer speed at which all this happened contributed to the excitement. In my own case it led to my becoming fairly obsessed with the project and working on it to the exclusion of almost everything else… “

  • Twenty years of the world wide web | The Economist
    • “Science inspired the world wide web. Two decades on, the web has repaid the compliment by changing science”
  • Get our full university data | guardian.co.uk
  • Paul Graham: Be Relentlessly Resourceful
    • One of my favourites of pg’s essays: “A couple days ago I finally got being a good startup founder down to two words: relentlessly resourceful.”

Click here for all of my del.icio.us bookmarks.

Published

Introduction to Statistical Machine Translation

How do computers translate texts from one language to another? Human translators use a great deal of detailed knowledge about how the world works to correctly translate all the different meanings the same word or phrase can have in different contexts. This makes automated translation seem like it might be a hard problem to solve, the kind of problem that may require genuine artificial intelligence. Yet although translators like Google Translate and Yahoo!’s Babel Fish are far from perfect, they do a surprisingly good job. How is that possible?

In this post, I describe the basic ideas behind the most successful approach to automated machine translation, an approach known as statistical machine translation. Statistical machine translation starts with a very large data set of good translations, that is, a corpus of texts (e.g., United Nations documents) which have already been translated into multiple languages, and then uses those texts to automatically infer a statistical model of translation. That statistical model is then applied to new texts to make a guess as to a reasonable translation.

I’m not an expert in statistical machine translation. I’ve written these notes to help me internalize the basic ideas of the area. However, I hope the notes are useful to others as an introductory overview of statistical machine translation, and as a starting place to learn more.

Formulating the problem

Imagine you’re given a French text, [tex]f[/tex], and you’d like to find a good English translation, [tex]e[/tex]. There are many possible translations of [tex]f[/tex] into English, of course, and different translators will have different opinions about what the best translation, [tex]e[/tex], is. We can model these differences of opinion with a probability distribution [tex]\mbox{pr}(e|f)[/tex] over possible translations, [tex]e[/tex], given that the French text was [tex]f[/tex]. A reasonable way of choosing the “best” translation is to choose [tex]e[/tex] which maximizes the conditional probability [tex]\mbox{pr}(e|f)[/tex].

The problem with this strategy is that we don’t know the conditional probability [tex]\mbox{pr}(e|f)[/tex]. To solve this problem, suppose we’re in possession of an initial corpus of documents that are in both French and English, e.g., United Nations documents, or the Canadian parliamentary proceedings. We’ll use that corpus to infer a model estimating the conditional probabilities [tex]\mbox{pr}(e|f)[/tex]. The model we’ll construct is far from perfect, but with a large and high quality initial corpus, yields pretty good translations. To simplify the discussion we assume [tex]e[/tex] and [tex]f[/tex] are single sentences, and we’ll ignore punctuation; obviously, the translator can be applied serially to a text containing many sentences.

Now, how do we start from the corpus and infer a model for [tex]\mbox{pr}(e|f)[/tex]? The standard approach is to use Bayes’ theorem to first rewrite [tex]\mbox{pr}(e|f)[/tex] as

[tex] \mbox{pr}(e|f) = \frac{\mbox{pr}(f|e) \mbox{pr}(e)}{\mbox{pr}(f)}. [/tex]

Because [tex]f[/tex] is fixed, the maximization over [tex]e[/tex] is thus equivalent to maximizing

[tex] \mbox{pr}(f|e) \mbox{pr}(e). [/tex]

What we’re going to do is to use our data set to infer models of [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex], and then use those models to search for [tex]e[/tex] maximizing [tex]\mbox{pr}(f|e) \mbox{pr}(e)[/tex].

The procedure outlined in the last paragraph seems a priori like a strange thing to do. Why not save time and difficulty by inferring a model of [tex]\mbox{pr}(e|f)[/tex] and then maximizing directly? It turns out that the procedure I described actually works quite a bit better. Now, to be blunt, I don’t entirely understand the explanations of this I’ve read in the published literature. But what I think is going on is as follows. It turns out that if we work directly with [tex]\mbox{pr}(e|f)[/tex], then many malformed English sentences [tex]e[/tex] are given a high probability by our model. You’ll see why this is the case later – it’s obvious from the construction – but for now just accept that’s the case. If we worked directly with [tex]\mbox{pr}(e|f)[/tex] those sentences would often be issued as “translations”, which is no good. Now, why should maximizing [tex]\mbox{pr}(f|e) \mbox{pr}(e)[/tex] work any better? After all, both [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex] may give high probabilistic weight to sentences [tex]e[/tex] which are not well-formed. So what gives? I think there are two things going on. First, although [tex]\mbox{pr}(f|e)[/tex] and [tex]\mbox{pr}(e)[/tex] may give high weight to malformed sentences [tex]e[/tex], the models are constructed very differently. For both to give high weight simultaneously, it’s quite unlikely that [tex]e[/tex] is malformed. Second, we can build grammatical checks directly into the construction of [tex]\mbox{pr}(e)[/tex], so very few malformed sentences have high values for [tex]\mbox{pr}(e)[/tex]. We won’t actually do that in this post, but it’s relatively straightforward to do.

We’ve broken machine translation up into three problems: (1) build a language model which allows us to estimate [tex]\mbox{pr}(e)[/tex]; (2) build a translation model which allows us to estimate [tex]\mbox{pr}(f|e)[/tex]; and (3) search for [tex]e[/tex] maximizing the product [tex]\mbox{pr}(f|e)\mbox{pr}(e)[/tex]. Each of these problems is itself a rich problem which can be solved in many different ways. In the next three sections I’ll describe simple approaches to solving each of the three problems.

The language model

Suppose we break an English sentence [tex]e[/tex] up into words [tex]e = e_1e_2\ldots e_m[/tex]. Then we can write the probability for [tex]e[/tex] as a product of conditional probabilities:

[tex] \mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) [/tex]

So, for example, in the sentence fragment “To be or not to…”, you would probably assign a very high conditional probability to the final word being “be”, certainly much higher than the probability of its occurrence at a random point in a piece of text.

The challenge in building a good language model [tex]\mbox{pr}(e)[/tex] is that there are so many distinct conditional probabilities that need to be estimated. To simplify the problem we’ll make some assumptions about the form of the probabilities. The most drastic assumption we could make is to assume that the probability of seeing a word is independent of what came before it, i.e.,

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j), [/tex]

so

[tex] \mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j). [/tex]

We could estimate the probabilities [tex]\mbox{pr}(e_j)[/tex] by taking a very large corpus of English text, and counting words. The problem is that this model is not very realistic. A more realistic model is the bigram model, which assumes that the probability of a word occurring depends only on the word immediately before it:

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-1}). [/tex]

More realistic still is the trigram model, which assumes that the probability of a word occurring depends only on the two words immediately before it:

[tex] \mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-2},e_{j-1}). [/tex]

To make the discussion which follows as concrete as possible, let’s concentrate on the problem of how to estimate conditional probabilities in the bigram model. Similar remarks can be made for the trigram model. The obvious way to proceed is to take a large corpus of English text, count the number of occurrences [tex]\#(e_1,e_2)[/tex] of a particular word pair in that corpus, and then set [tex]\mbox{pr}(e_2|e_1) = \#(e_1,e_2)/\#(e_1)[/tex].

The problem with this procedure is that there are a lot of bigrams. If we assume that there are 50,000 words in English, say, then there are 2.5 billion possible bigrams. Even if you take a large corpus of training data (say, a billion words), it’s reasonably likely that there will be some bigrams which don’t appear in your corpus, and thus are assigned zero probability, yet which you would a priori like to appear in translations of some French sentences. That is, this kind of training procedure is likely to underestimate the probability of bigrams which don’t appear in the training set, and overestimate the probability of those which do. The problem is even worse for trigrams.

There’s no obvious best solution to this problem. Many different ad hoc solutions have been tried, and my quick survey of the literature suggests that there’s as yet no broad agreement about the best way of solving the problem. To give you the flavour of the solutions people use, though, let me describe two basic approaches, adapted from the survey in section 2.3 of Philip Clarkson’s PhD thesis.

The first approach is to move away from a pure bigram model, and instead to use linear interpolation between the monogram and bigram models. A large and diverse enough corpus of text is likely to give pretty good estimates for nearly all single-word probabilities [tex]\mbox{pr}(e_1)[/tex]. So one way of estimating [tex]\mbox{pr}(e_2|e_1)[/tex] is as:

[tex] \mbox{pr}(e_2|e_1) = \lambda \frac{\#(e_2)}{N} + (1-\lambda) \frac{\#(e_1,e_2)}{\#(e_1)}, [/tex]

where [tex]N[/tex] is the total number of words in the corpus. [tex]\lambda[/tex] is a parameter in the range [tex]0 < \lambda < 1[/tex] which needs to be determined. This can be done using a second corpus of text, and setting [tex]\lambda[/tex] so that the average probability of the bigrams in that corpus is maximized. A second approach is to apply a discount factor to the conditional probabilities for the bigrams which appear in the training corpus, essentially reducing their probability. The easist way to proceed is simply to multiply them all by some constant [tex]c[/tex] in the range [tex]0 < c < 1[/tex], and then to spread the remaining probability [tex]1-c[/tex] uniformly over all bigrams [tex](e_1,e_2)[/tex] which don't appear in the corpus.

Exercises

  • The constant [tex]c[/tex] can be viewed as an estimate of the fraction of all English bigrams that appear in the training corpus. How might you go about obtaining an estimate for [tex]c[/tex]?

To conclude this section, let me mention that the problem of building a good language model is also of great interest to people working on speech recognition software. In fact, much of the research literature on language models can be applied equally to either speech recognition or statistical machine translation.

The translation model

Note: English is my only language, which makes it hard for me to construct translation examples! All my examples are taken from a paper by Brown et al.

In this section we’ll construct a simple translation model allowing us to compute [tex]\mbox{pr}(f|e)[/tex]. Intuitively, when we translate a sentence, words in the source text generate (possibly in a context-dependent way) words in the target language. In the sentence pair (Jean aime Marie | John loves Mary) we intuitively feel that John corresponds to Jean, loves to aime, and Mary to Marie. Of coure, there is no need for the word correspondence to be one-to-one, nor for the ordering of words to be preserved. Sometimes, a word in English may generate two or more words in French; sometimes it may generate no word at all.

Despite these complications, the notion of a correspondence between words in the source language and in the target language is so useful that we’ll formalize it through what is called an alignment. Rather than give a precise definition, let me explain alignments through an example, the sentence pair (Le chien est battu par Jean | John (6) does beat (3,4) the (1) dog (2)). In this example alignment, the numbers in parentheses tell us that John corresponds to the 6th word in the French sentence, i.e., Jean. The word does has no trailing parentheses, and so doesn’t correspond to any of the words in the French sentence. However, beat corresponds to not one, but two separate words in the French sentence, the 3rd and 4th words, est and battu. And so on.

Two notions derived from alignments are particularly useful in building up our translation model. The first is fertility, defined as the number of French words generated by a given English word. So, in the example above, does has fertility [tex]0[/tex], since it doesn’t generate anything in the French sentence. On the other hand, beat has fertility [tex]2[/tex], since it generates two separate words in the French sentence.

The second notion is distortion. In many sentences, the English word and its corresponding French word or words appear in the same part of the sentence – near the beginning, perhaps, or near the end. We say that such words are translated roughly undistorted, while words which move a great deal have high distortion. We’ll encoude this notion more formally shortly.

We’ll build up our translation model using some simple parameters related to fertility and distortion:

  • The fertility probability [tex]\mbox{pr}(n|e)[/tex], the probability that the English word [tex]e[/tex] has fertility [tex]n[/tex].
  • The distortion probability [tex]\mbox{pr}(t|s,l)[/tex], which is the probability that an English word at position [tex]s[/tex] corresponds to a French word at position [tex]t[/tex] in a French sentence of length [tex]l[/tex].
  • The translation probability [tex]\mbox{pr}(f|e)[/tex], one for each French word [tex]f[/tex] and English word [tex]e[/tex]. This should not be confused with the case when [tex]f[/tex] and [tex]e[/tex] are sentences!

I’ve described the translation model as a way of computing [tex]\mbox{pr}(f|e)[/tex], where [tex]e[/tex] is an English sentence, and [tex]f[/tex] is a French sentence. In fact, we’re going to modify that definition a little, defining the translation model as the probability [tex]\mbox{pr}(f,a|e)[/tex] that the French sentence [tex]f[/tex] is the correct translation of [tex]e[/tex], with a particular alignment, which we’ll denote by [tex]a[/tex]. I’ll return to the question of how this change in definition affects translation in the next section. Rather than specify the probability for our translation model formally, I’ll simply show how it works for the example alignment (Le chien est battu par Jean |John (6) does beat (3,4) the (1) dog (2)):

[tex]\mbox{pr}(1|John) \times \mbox{pr}(Jean|John) \times \mbox{pr}(6|1,6) \times [/tex]

[tex]\mbox{pr}(0|does) \times [/tex]

[tex]\mbox{pr}(2|beat) \times \mbox{pr}(est|beat) \times \mbox{pr}(3|3,6) \times \mbox{pr}(battu|beat) \times \mbox{pr}(4|3,6) \times [/tex]

[tex]\mbox{pr}(1|the) \times \mbox{pr}(Le|the) \times \mbox{pr}(1|5,6) \times [/tex]

[tex]\mbox{pr}(1|dog) \times \mbox{pr}(chien|dog) \times \mbox{pr}(2|6,6) \times [/tex]

[tex]\mbox{pr}(1|) \times \mbox{pr}(par|) [/tex]

This should be self-explanatory except the final line, for the French word par. This word has no corresponding word in the English sentence, which we model using a special word [tex][/tex].

What remains to be done is to estimate the parameters used in constructing the translation model – the fertility, distortion and translation probabilities. I’ll briefly outline an iterative procedure that can be used to do this estimation. It starts with a simple guess of the parameters. E.g., we might guess that the distortion probabilities [tex]\mbox{pr}(t|s,l) = 1/l[/tex] are uniform across the sentence. Similar guesses could be made for the other parameters. For each pair [tex](e,f)[/tex] of sentences in our corpus we can use this guess to compute the probability for all possible alignments between the sentences. We then estimate the “true” alignment to be whichever alignment has highest probability. Applying this procedure to the entire corpus gives us many estimated alignments. We can then use those estimated alignments to compute new estimates for all the parameters in our model. E.g., if we find that [tex]1/10[/tex]th of the alignments of sentences of length [tex]25[/tex] have the first word mapped to the first word then our new estimate for [tex]\mbox{pr}(1|1,25) = 1/10[/tex]. This gives us a procedure for iteratively updating the parameters of our model, which can be repeated many times. Empirically we find (and it can be proved) that the parameters gradually converge to a fixed point.

Exercises

  • A problem I’ve glossed over above is figuring out which sentences in the English corpus correspond to which sentences in the French corpus. How could you figure out which sentences correspond to which? What would be the effect of mistakes?

Searching for the maximizing [tex]e[/tex]

In my introductory remarks I explained how translation from French to English can be viewed as the problem of finding [tex]e[/tex] which maximizes [tex]\mbox{pr}(e|f)[/tex], and that this was equivalent to maximizing [tex]\mbox{pr}(f|e)\mbox{pr}(e)[/tex]. Let’s revisit the reasoning which led to that framing of the problem, and recast it in a form more amenable to our translation model based on alignments. Instead of searching for [tex]e[/tex] which maximizes [tex]\mbox{pr}(e|f)[/tex], let’s instead search for a pair [tex](e,a)[/tex] which maximizes [tex]\mbox{pr}(e,a|f)[/tex]. Using Bayes’ theorem this is equivalent to finding [tex](e,a)[/tex] which maximizes

[tex] \mbox{pr}(e,a|f) = \frac{\mbox{pr}(f,a|e) \mbox{pr}(e)} {\mbox{pr}(f)}. [/tex]

As earlier, we can omit the factor [tex]\mbox{pr}(f)[/tex] in the denominator, and so our search is equivalent to finding a pair [tex](e,a)[/tex] which maximizes

[tex] \mbox{pr}(f,a|e) \mbox{pr}(e) [/tex]

Our translation problem is to find [tex](e,a)[/tex] maximizing this expression. Unfortunately, there are far too many possibility to do an exhaustive search. But there are some heauristics which work pretty well. One is to use a greedy algorithm which gradually builds up partial translations and alignments. Consider the following partial translations and alighments:

(Jean aime Marie | John (1) *)

(Jean aime Marie | * loves (2) *)

(Jean aime Marie | Petroleum (1) *)

Using the translation and language models described earlier we can compute the probabilities for each of these partial alignments, and so compute the product [tex]\mbox{pr}(f,a|e) \mbox{pr}(e)[/tex]. It would be very surprising if the first two candidates weren’t assigned a high probability, and the third example a very low probability. Our heuristic for search is thus to maintain a stack of promising partial translations and alignments, and to continually extend those partial translations and alignments until we find a complete translation and alignment with a significantly higher probability than any of the partial alignments.

Commentary

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data. Worry about the data first before you worry about the algorithm.

Peter Norvig, Director of Research at Google, as reported by Greg Linden.

That completes our description of a simple statistical machine translation system. The system I’ve described is essentially the same as the system described in 1990 by Brown et al at IBM research, a system which translated roughly half of all sentences in an acceptable way! Personally, I find it remarkable that such simple ideas can be used to make so much progress on what appears to be an enormously difficult problem. (I’ll describe other examples in future posts.) Furthermore, although the basic ideas remain the same, many improvements to these ideas have been made since 1990. The biggest single advance seems to have been a movement away from words as the unit of language, and towards phrase-based models, which give greatly improved performance.

If you’re interested in understanding more, I strongly suggest checking out an accessible and very influential 1993 follow-up paper from the IBM group, which describes a whole plethora of translation models. I don’t have a single canonical reference to give on phrase-based statistical translation models, but found several of the top hits at Google Scholar quite informative. Finally, and just for fun, I recommend Peter Norvig’s excellent short introduction to automatic spelling correction, which uses similar ideas in a much (!) simpler setting, but has the big advantage that it is illustrated by real code.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

Published
Categorized as GTS