Implementing Statistical Machine Translation Using MapReduce

This post is part of my series on the Google Technology Stack. It’s something of a digression, and is optional (but fun, and worth reading!) We’ll return to the main line of development in the next post.

This post is a followup to an earlier post about the beautiful subject of statistical machine translation. In the earlier post I described how to build a statistical model that can be used to automatically translate texts from one language to another. The ideas are beautiful, but require a lot of computing power – the more computing power, the better the translation. In this post I describe a way of implementing statistical machine translation in parallel on a large cluster of machines. The post assumes: (1) that you’re comfortable with my previous post on statistical machine translation; and (2) that you’re comfortable with the MapReduce framework for distributed computing.

Before going any further, a warning: I haven’t implemented the system described here. I’ve no idea if Google does machine translation with a MapReduce implementiation, although Google translate definitely uses statistical machine translation. It’s possible there’s a bottleneck in my implementation that I’m not noticing. Ideally, we should carefully analyse the resources used in my proposed implementation, and, if that reveals no problems, build a working system, and test to see how well it scales in practice. I haven’t done any of that. Still, putting statistical machine translation into the MapReduce framework is a great way of deepening our understanding of both sets of ideas, and I’m pretty confident that many of these ideas could be used in a production machine translation system.

With that said, it should be noted that the simple statistical machine translation model described in the previous post can actually be implemented on a single machine. However, as discussed in the earlier post, it’s easy to tweak that model so it performs far better, at the cost of making it too complex to run on a single machine. The ideas described in this post are easily adapted to many of those tweaked models of machine translation.

One final point before we jump into the meat of the post. Earlier posts in this series have been pretty detailed. In this post I’ve glossed over some details so as not to obscure the forest with too many tress – there’s only so many Map and MapReduce jobs you can read about before your eyes start to glaze over. But the missing details should be easily filled in by the reader, and I’ve tried to indicate where there are details which need to be filled in.

Overview: what our system needs to do

Recall the basic elements of statistical machine translation. It’s split into two main phases. The first phase is the training phase, in which a statistical model of translation is built, using a corpus of texts in both the source and target language (French and English, for the sake of concreteness). The training phase is itself split into three parts: (i) document collection, where we assemble the corpus of texts from which the statistical model will be inferred; (ii) building the language model for the target language; and (iii) building the translation model from the target language to the source language (yes, that’s the right way round, see the earlier post!) The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. It’s this phase which is actually used directly by the end user – the training phase all happens offline, beforehand.

Altogether, that makes four separate procedures we need to implement. We’ll go through them one by one in the remainder of this post. Before doing that, though, there is one idea implicit in my MapReduce post that needs to be made more explicit. This is the idea of a distributed hash of key-value pairs. Given a set of keys and corresponding values [tex](k_1:v_1),(k_2:v_2),\ldots[/tex], a crucial part of MapReduce is distributing the key-value pairs evenly over all computers in the cluster.

Here’s a simple way this may be achieved. Suppose there are [tex]n[/tex] computers in the cluster. Number them [tex]0,\ldots,n-1[/tex]. Let [tex]\#(\cdot)[/tex] be a hash function. An unfortunate fact is that this use of the term “hash” is completely different than the useage of “hash” to denote a set of key-value pairs. In practice, this isn’t as confusing as you might think, but you should note the difference. In any case, the way we set up our distributed hash is to store the key-value pair [tex](k:v)[/tex] on computer number [tex]\#(k) \mod n[/tex]. Because the hash function distributes key values in a uniform way, the result is to spread the key-value pairs through the cluster in a uniform way.

In a practical implementation, this approach to storing a distributed hash has some drawbacks, which we’ll come back to in later posts about the Google File System and BigTable. But for now, you can think of a distributed hash in this way.

Document collection

There are many possible ways we could be presented with the documents we’ll use to build our statistical model of translation. To make things concrete, and to illustrate the general ideas used to manage the distribution of documents across the cluster, let’s imagine we’re presented with them as a file containing several million URLs, with each URL pointing to an English language text which we’ll use to build a language model. Alternately, the file might contain URLs pointing to pairs of translated texts, in English and in French, which we’ll use to build the translation model. The same basic ideas apply in both cases, so to be specific, I’ll suppose the URLs point to English-only texts.

We use the URLs as keys in a distributed hash. Assuming that they start on a single machine, we first need to distribute the keys across the cluster, and then initialize the corresponding values to empty. The result is a distributed hash with key-value pairs (URL:empty). We now use that distributed hash as input to a Map job, which applies a mapper to (URL:empty), downloading the webpage for the URL, and storing the resulting document as the value corresponding to the key. The resulting distributed hash has key-value pairs (URL:document). We’ll use that distributed hash in constructing the language model, as described below.

Language model

Recall from the previous post that in the bigram model we compute the probability of a sentence as:

[tex] \mbox{pr}(e_1,\ldots,e_m) = \mbox{pr}(e_m|e_{m-1}) \mbox{pr}(e_{m-1}|e_{m-2}) \ldots \mbox{pr}(e_2|e_1) \mbox{pr}(e_1). [/tex]

To compute this probability, we need estimates for the single-xword probabilities [tex]\mbox{pr}(e_1)[/tex], and for the bigram conditional probabilities, [tex]\mbox{pr}(e_2|e_1)[/tex].

To estimate the single-word probabilities [tex]\mbox{pr}(e)[/tex] we use two MapReduce jobs. The first MapReduce job counts the total number of words [tex]N[/tex] in all the documents. The input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#(document)), where the intermediate key [tex]0[/tex] is simply an arbitrarily chosen constant, and #(document) is the total number of words in the document. The reducer simply sums over all these intermediate values, giving [tex]N[/tex] as output.

The second MapReduce job is a simple variant on the wordcount example used in the earlier post. It takes as input the hash with entries (URL:document). The mapper parses the document, and each time it comes to a word [tex]e[/tex] emits the intermediate key-value pair [tex](e:1)[/tex], indicating one occurrence of [tex]e[/tex]. The reducers sum over all those counts of [tex]1[/tex], and divide by the total number of words [tex]N[/tex], giving as output a hash with entries [tex](e:\mbox{pr}(e))[/tex], where [tex]\mbox{pr}(e)[/tex] is the relative frequency of [tex]e[/tex] in the document collection.

To estimate [tex]\mbox{pr}(e_2|e_1)[/tex] we also use two MapReduce jobs. The first MapReduce job compues the total number of bigrams, [tex]N'[/tex]. Once again, the input to this job is the hash with entries (URL:document). The mapper emits a single intermediate pair, (0:#'(document)), where the intermediate key [tex]0[/tex] is again an arbitrary constant, and #'(document) is the total number of bigrams in the document. The reducer sums over all these values, giving [tex]N'[/tex] as output.

The second MapReduce job computes the frequency of each bigram, and divides by [tex]N'[/tex] to give [tex]\mbox{pr}(e_2|e_1)[/tex]. Again, the input to this job is the hash with entries (URL:document). The mapper emits an intermediate key-value pair [tex]((e_2|e_1):1)[/tex] for each bigram it finds in the document, possibly with some repetitions. Note that the intermediate keys are of the form [tex](e_2|e_1)[/tex], simply for consistency with our notation elswhere. The reducer sums over the intermediate values and divides by [tex]N'[/tex] to get the relative frequency of occurrence, so the output hash has entries [tex]((e_e|e_1):\mbox{pr}(e_2|e_1))[/tex], as was desired.

At this point we’ve constructed distributed hashes with key-value pairs [tex](e:\mbox{pr}(e))[/tex] and [tex]((e_2|e_1):\mbox{pr}(e_2|e_1))[/tex], effectively encoding our language model. In particular, access to these distributed hashes makes it is easy to compute [tex]\mbox{pr}(E)[/tex] for an entire sentence [tex]E[/tex].

Exercises

  • In the earlier post about statistical machine translation I mentioned that problems can arise in estimating bigram probabilities, because bigrams not in the corpus have their probability systematically underestimated. How might you address this problem on a distributed cluster?

Translation model

Recall from the previous post that our translation model was determined by the following sets of parameters:

  • The fertility probability [tex]\mbox{pr}(n|e)[/tex], the probability that the English word [tex]e[/tex] has fertility [tex]n[/tex].
  • The distortion probability [tex]\mbox{pr}(t|s,l)[/tex], which is the probability that an English word at position [tex]s[/tex] corresponds to a French word at position [tex]t[/tex] in a French sentence of length [tex]l[/tex].
  • The translation probability [tex]\mbox{pr}(f|e)[/tex], one for each French word [tex]f[/tex] and English word [tex]e[/tex].

To determine these parameters we simply start with guesses, and then use an iterative procedure to gradually get better estimates, stopping when the iterative procedure converges on a relatively stable set of parameters.

To implement this strategy, we start by setting up three hashes to store estimates of the parameters. Let’s start with the fertility probabilities. As our starting guess, we’ll assume:

[tex] \mbox{pr}(n|e) \approx \frac{1}{2^{n+1}}. [/tex]

I’ve put the approximation sign in because we’ll actually truncate the distribution at some high value for the fertility, say [tex]n = 10[/tex], so the hash is of a manageable size. Obviously, these probabilities should then be modified slightly, to be properly normalized, but that’s a detail I won’t worrry about.

We’ll set up a hash with keys [tex](n|e)[/tex] to store our initial guesses for the fertility probabilities. This can be done in many ways. One way is to use a single MapReduce job which takes as input the hash storing the corpus of English-French texts and produces as output a hash with key-value pairs [tex]((0|e):1/2),((1|e):1/4),((2|e):1/8),\ldots[/tex], where [tex]e[/tex] ranges over the set of English words in our corpus.

At this point, you’ve probably noticed that I’m assuming you’re getting comfortable enough with Map and MapReduce that I can just describing what the jobs do, and leave it to you to fill in the details of how they work. This will continue.

Next, we set up a distributed hash for the distortion probabilities [tex]\mbox{pr}(t|s,l)[/tex]. To do this, we first set up an empty hash (i.e., one with empty values) with keys [tex](t|s,l)[/tex], where [tex]t \leq l \leq 40[/tex], and [tex]s \leq 50[/tex], say. As before, these truncations are rather arbitrary, but their purpose is to limit the size of the hash. Now run a Map job with input the empty hash, [tex]((t|s,l):\mbox{empty})[/tex], and where the mapper simply outputs the value [tex]1/l[/tex], giving us a hash [tex]((t|s,l):1/l)[/tex], i.e., a uniform distribution for target positions.

In the last paragraph you might wonder why we don’t start by initializing the values in the hash as [tex]1/l[/tex], rather than using a Map job to do it. The model I have in mind is that the setup of the initial empty hash is co-ordinated from a single centralized server, with the values automatically being set to a default of empty. We could instead set the values on the centralized server as well, before distribution, but the problem is that this might cause the central server to become a bottleneck. That probably won’t happen here – computing and distributing [tex]1/l[/tex] is pretty easy – but it could easily happen if the computation or values were more complex. As a matter of general principle, it’s better to compute the values using a Map job which can be distributed across the cluster.

The third and final distributed hash we need to set up has keys [tex](f|e)[/tex] and values the translation probabilities [tex]\mbox{pr}(f|e)[/tex]. As our initial guess for these probabilities, we’ll use our corpus of starting documents, and set [tex]\mbox{pr}(f|e)[/tex] to be the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] occurs. We do this using two MapReduce jobs. The first MapReduce job constructs a hash with entries [tex](e: \mbox{number of sentences with } e \mbox{ in them})[/tex], over all words [tex]e[/tex] in the corpus. We use this hash and a second MapReduce job to construct a hash with keys [tex](f|e)[/tex] and values the fraction of sentences with [tex]f[/tex] in them, given that [tex]e[/tex] is in the starting text.

We’ve now set up the three hashes describing our starting guess for the parameters of our translation model. Next, we need to understand how to iteratively update those guesses, hopefully improving their quality.

The first step in the iterative update is to use Map to process our translation corpus, producing as output a hash with keys [tex](E,F)[/tex], where [tex]E[/tex] and [tex]F[/tex] are corresponding English and French sentences in the corpus. To compute the corresponding value, we consider all possible alignments between the sentence pair, and compute probabilities for those alignments, using our initial estimate of the parameters above. The alignment [tex]a[/tex] with the highest probability is the corresponding value, so we are producing a hash with pairs [tex]((E,F):a)[/tex].

The second step in the iterative update is to use the sentence-pair hash [tex]((E,F):a)[/tex] and three MapReduce jobs to compute the updated estimates for the fertility, distortion and translation probabilities, [tex]\mbox{pr}(n|e), \mbox{pr}(t|s,l)[/tex] and [tex]\mbox{pr}(f|e)[/tex]. This completes the iterative update.

We now repeat this iterative update process many times. I’ll leave it as an exercise to figure out how you could use MapReduce to determine how much the parameters of the model are changing, per iteration, and what would be a reasonable criterion for terminating the iterations.

The translation step: heuristic search

The translation step is the final stage. It’s the stage at which we take our translation model, and apply it to translate a French document into English, sentence by sentence. Recall from the previous post that to translate the French sentence [tex]F[/tex] we search for the English sentence [tex]E[/tex] and alignment [tex]a[/tex] which maximizes the product

[tex] \mbox{pr}(F,a|E) \mbox{pr}(E). [/tex]

The idea of the heuristic search is to consider partial sentences and partial alignments, maintaining a stack of particularly promising candidates. We’ll set the stack size to be [tex]S[/tex], let’s say [tex]S = 10^6[/tex], so we’ll never consider more than a million possible candidate alignments and sentences.

The procedure involves three steps: (1) Construct all possible single-word extensions of existing partial sentences and partial alignments, and compute the corresponding product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]; (2) Choose the [tex]S[/tex] most likely of those, and discard the rest; and (3) Terminate when we have a complete alignment which is substantially more likely than any of the remaining partial alignments, otherwise repeat steps (1) and (2).

Step (1) can easily be done using a Map job. Step (2) can be done using multiple MapReduce steps. This is essentially a binary search to find the top [tex]S[/tex] values for the product [tex]\mbox{pr}(F,a|E) \mbox{pr}(E)[/tex]. Step (3) can be done using the same ideas as step (2).

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

A simple experiment in cognitive enhancement: what effect does the web and Google have on IQ?

A psychology experiment I’ve long wanted done is to give a group of people an IQ test where they have web access, and to compare the results to a control group without web access. I suspect those with web access would do quite a bit better, but it’d be interesting to see just how much better. “The web raises your IQ x points”, where x = … I’ve heard of a few experiments vaguely in this vein, but nothing seems to be quite like it. It could be run pretty cheaply on a University campus.

The dangers of deliberation: suppressing rather than amplifying hidden knowledge

Continuing yesterday’s theme, Cass Sunstein’s “Infotopia” provides a remarkable example of the things that can go wrong when groups collaborate. From an experiment in simulated elections:

Information was parceled out to group members about three candidates for political office; if the information had been properly pooled, the group would have selected Candidate A, who was clearly the best choice. The experiment involved two conditions. In the first condition, each member of the four-person group was given most of the relevant information (66 percent of the information about each candidate). In that condition, 67 percent of the group members favored Candidate A before discussion, and 85 percent after discussion…

In the second condition, by contrast, the information that favored Candidate A was parceled out to various members of the group, rather than shared by all. As this condition was designed, the shared information favoured the two unambiguously inferior candidates, B and C. If the unshared information emerged through discussion and was taken seriously, Candidate A would be chosen.

In that condition, fewer than 25 percent of group members favored Candidate A before discussions, a natural product of the initial distribution of information. But (and this is the key result) the number of people favoring Candidate A actually fell after discussion, simply because the shared information had disproportionate influence on group members. In other words, groups did worse, not better, than individuals when the key information was not initially shared by group members. The commonly held information was far more influential than the unshared information, to the detriment of the group’s ultimate decision.

Sunstein gives many similar examples where groups behave worse than individuals; all failures of collective cognition. Of course, the examples are of limited import, because he’s talking about groups using rather unstructured processes. With good facilitation, good process, or simply good tools, many of these collective cognitive biases can likely be avoided.

Aristotle and Sunstein on collective intelligence

From Politics, as quoted in Cass Sunstein’s excellent Infotopia: How Many Minds Produce Knowledge, Aristotle suggested that when diverse groups

all come together… they may surpass – collectively and as a body, although not individually – the quality of the few best… When there are many who contribute to the process of deliberation, each can bring his share of goodness and moral prudence… some appreciate one part, some another, and all together appreciate all.

Sunstein goes on to discuss the many ways in which Aristotle got it wrong (and right), detailing in particular many ways in which deliberation can make group functioning worse. The short summary is that deliberation often produces more confidence, a greater sense of legitimacy, and more homogeneous thinking, but doesn’t get it right as often as processes like voting without discussion. The most striking exception is for very well posed problems (“What is the population of Singapore”), where deliberation can help (“I took a course on Singapore in College, and the population is…”).

This is thought-provoking for anyone interested in collective intelligence online, and suggests the great value of providing independent objective measures of quality.