# Introduction to Statistical Machine Translation

How do computers translate texts from one language to another? Human translators use a great deal of detailed knowledge about how the world works to correctly translate all the different meanings the same word or phrase can have in different contexts. This makes automated translation seem like it might be a hard problem to solve, the kind of problem that may require genuine artificial intelligence. Yet although translators like Google Translate and Yahoo!’s Babel Fish are far from perfect, they do a surprisingly good job. How is that possible?

In this post, I describe the basic ideas behind the most successful approach to automated machine translation, an approach known as statistical machine translation. Statistical machine translation starts with a very large data set of good translations, that is, a corpus of texts (e.g., United Nations documents) which have already been translated into multiple languages, and then uses those texts to automatically infer a statistical model of translation. That statistical model is then applied to new texts to make a guess as to a reasonable translation.

I’m not an expert in statistical machine translation. I’ve written these notes to help me internalize the basic ideas of the area. However, I hope the notes are useful to others as an introductory overview of statistical machine translation, and as a starting place to learn more.

### Formulating the problem

Imagine you’re given a French text, $$f$$, and you’d like to find a good English translation, $$e$$. There are many possible translations of $$f$$ into English, of course, and different translators will have different opinions about what the best translation, $$e$$, is. We can model these differences of opinion with a probability distribution $$\mbox{pr}(e|f)$$ over possible translations, $$e$$, given that the French text was $$f$$. A reasonable way of choosing the “best” translation is to choose $$e$$ which maximizes the conditional probability $$\mbox{pr}(e|f)$$.

The problem with this strategy is that we don’t know the conditional probability $$\mbox{pr}(e|f)$$. To solve this problem, suppose we’re in possession of an initial corpus of documents that are in both French and English, e.g., United Nations documents, or the Canadian parliamentary proceedings. We’ll use that corpus to infer a model estimating the conditional probabilities $$\mbox{pr}(e|f)$$. The model we’ll construct is far from perfect, but with a large and high quality initial corpus, yields pretty good translations. To simplify the discussion we assume $$e$$ and $$f$$ are single sentences, and we’ll ignore punctuation; obviously, the translator can be applied serially to a text containing many sentences.

Now, how do we start from the corpus and infer a model for $$\mbox{pr}(e|f)$$? The standard approach is to use Bayes’ theorem to first rewrite $$\mbox{pr}(e|f)$$ as

$$\mbox{pr}(e|f) = \frac{\mbox{pr}(f|e) \mbox{pr}(e)}{\mbox{pr}(f)}.$$

Because $$f$$ is fixed, the maximization over $$e$$ is thus equivalent to maximizing

$$\mbox{pr}(f|e) \mbox{pr}(e).$$

What we’re going to do is to use our data set to infer models of $$\mbox{pr}(f|e)$$ and $$\mbox{pr}(e)$$, and then use those models to search for $$e$$ maximizing $$\mbox{pr}(f|e) \mbox{pr}(e)$$.

The procedure outlined in the last paragraph seems a priori like a strange thing to do. Why not save time and difficulty by inferring a model of $$\mbox{pr}(e|f)$$ and then maximizing directly? It turns out that the procedure I described actually works quite a bit better. Now, to be blunt, I don’t entirely understand the explanations of this I’ve read in the published literature. But what I think is going on is as follows. It turns out that if we work directly with $$\mbox{pr}(e|f)$$, then many malformed English sentences $$e$$ are given a high probability by our model. You’ll see why this is the case later – it’s obvious from the construction – but for now just accept that’s the case. If we worked directly with $$\mbox{pr}(e|f)$$ those sentences would often be issued as “translations”, which is no good. Now, why should maximizing $$\mbox{pr}(f|e) \mbox{pr}(e)$$ work any better? After all, both $$\mbox{pr}(f|e)$$ and $$\mbox{pr}(e)$$ may give high probabilistic weight to sentences $$e$$ which are not well-formed. So what gives? I think there are two things going on. First, although $$\mbox{pr}(f|e)$$ and $$\mbox{pr}(e)$$ may give high weight to malformed sentences $$e$$, the models are constructed very differently. For both to give high weight simultaneously, it’s quite unlikely that $$e$$ is malformed. Second, we can build grammatical checks directly into the construction of $$\mbox{pr}(e)$$, so very few malformed sentences have high values for $$\mbox{pr}(e)$$. We won’t actually do that in this post, but it’s relatively straightforward to do.

We’ve broken machine translation up into three problems: (1) build a language model which allows us to estimate $$\mbox{pr}(e)$$; (2) build a translation model which allows us to estimate $$\mbox{pr}(f|e)$$; and (3) search for $$e$$ maximizing the product $$\mbox{pr}(f|e)\mbox{pr}(e)$$. Each of these problems is itself a rich problem which can be solved in many different ways. In the next three sections I’ll describe simple approaches to solving each of the three problems.

### The language model

Suppose we break an English sentence $$e$$ up into words $$e = e_1e_2\ldots e_m$$. Then we can write the probability for $$e$$ as a product of conditional probabilities:

$$\mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j|e_1,\ldots,e_{j-1})$$

So, for example, in the sentence fragment “To be or not to…”, you would probably assign a very high conditional probability to the final word being “be”, certainly much higher than the probability of its occurrence at a random point in a piece of text.

The challenge in building a good language model $$\mbox{pr}(e)$$ is that there are so many distinct conditional probabilities that need to be estimated. To simplify the problem we’ll make some assumptions about the form of the probabilities. The most drastic assumption we could make is to assume that the probability of seeing a word is independent of what came before it, i.e.,

$$\mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j),$$

so

$$\mbox{pr}(e) = \prod_{j=1}^m \mbox{pr}(e_j).$$

We could estimate the probabilities $$\mbox{pr}(e_j)$$ by taking a very large corpus of English text, and counting words. The problem is that this model is not very realistic. A more realistic model is the bigram model, which assumes that the probability of a word occurring depends only on the word immediately before it:

$$\mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-1}).$$

More realistic still is the trigram model, which assumes that the probability of a word occurring depends only on the two words immediately before it:

$$\mbox{pr}(e_j|e_1,\ldots,e_{j-1}) = \mbox{pr}(e_j|e_{j-2},e_{j-1}).$$

To make the discussion which follows as concrete as possible, let’s concentrate on the problem of how to estimate conditional probabilities in the bigram model. Similar remarks can be made for the trigram model. The obvious way to proceed is to take a large corpus of English text, count the number of occurrences $$\#(e_1,e_2)$$ of a particular word pair in that corpus, and then set $$\mbox{pr}(e_2|e_1) = \#(e_1,e_2)/\#(e_1)$$.

The problem with this procedure is that there are a lot of bigrams. If we assume that there are 50,000 words in English, say, then there are 2.5 billion possible bigrams. Even if you take a large corpus of training data (say, a billion words), it’s reasonably likely that there will be some bigrams which don’t appear in your corpus, and thus are assigned zero probability, yet which you would a priori like to appear in translations of some French sentences. That is, this kind of training procedure is likely to underestimate the probability of bigrams which don’t appear in the training set, and overestimate the probability of those which do. The problem is even worse for trigrams.

There’s no obvious best solution to this problem. Many different ad hoc solutions have been tried, and my quick survey of the literature suggests that there’s as yet no broad agreement about the best way of solving the problem. To give you the flavour of the solutions people use, though, let me describe two basic approaches, adapted from the survey in section 2.3 of Philip Clarkson’s PhD thesis.

The first approach is to move away from a pure bigram model, and instead to use linear interpolation between the monogram and bigram models. A large and diverse enough corpus of text is likely to give pretty good estimates for nearly all single-word probabilities $$\mbox{pr}(e_1)$$. So one way of estimating $$\mbox{pr}(e_2|e_1)$$ is as:

$$\mbox{pr}(e_2|e_1) = \lambda \frac{\#(e_2)}{N} + (1-\lambda) \frac{\#(e_1,e_2)}{\#(e_1)},$$

where $$N$$ is the total number of words in the corpus. $$\lambda$$ is a parameter in the range $$0 < \lambda < 1$$ which needs to be determined. This can be done using a second corpus of text, and setting $$\lambda$$ so that the average probability of the bigrams in that corpus is maximized. A second approach is to apply a discount factor to the conditional probabilities for the bigrams which appear in the training corpus, essentially reducing their probability. The easist way to proceed is simply to multiply them all by some constant $$c$$ in the range $$0 < c < 1$$, and then to spread the remaining probability $$1-c$$ uniformly over all bigrams $$(e_1,e_2)$$ which don't appear in the corpus.

### Exercises

• The constant $$c$$ can be viewed as an estimate of the fraction of all English bigrams that appear in the training corpus. How might you go about obtaining an estimate for $$c$$?

To conclude this section, let me mention that the problem of building a good language model is also of great interest to people working on speech recognition software. In fact, much of the research literature on language models can be applied equally to either speech recognition or statistical machine translation.

### The translation model

Note: English is my only language, which makes it hard for me to construct translation examples! All my examples are taken from a paper by Brown et al.

In this section we’ll construct a simple translation model allowing us to compute $$\mbox{pr}(f|e)$$. Intuitively, when we translate a sentence, words in the source text generate (possibly in a context-dependent way) words in the target language. In the sentence pair (Jean aime Marie | John loves Mary) we intuitively feel that John corresponds to Jean, loves to aime, and Mary to Marie. Of coure, there is no need for the word correspondence to be one-to-one, nor for the ordering of words to be preserved. Sometimes, a word in English may generate two or more words in French; sometimes it may generate no word at all.

Despite these complications, the notion of a correspondence between words in the source language and in the target language is so useful that we’ll formalize it through what is called an alignment. Rather than give a precise definition, let me explain alignments through an example, the sentence pair (Le chien est battu par Jean | John (6) does beat (3,4) the (1) dog (2)). In this example alignment, the numbers in parentheses tell us that John corresponds to the 6th word in the French sentence, i.e., Jean. The word does has no trailing parentheses, and so doesn’t correspond to any of the words in the French sentence. However, beat corresponds to not one, but two separate words in the French sentence, the 3rd and 4th words, est and battu. And so on.

Two notions derived from alignments are particularly useful in building up our translation model. The first is fertility, defined as the number of French words generated by a given English word. So, in the example above, does has fertility $$0$$, since it doesn’t generate anything in the French sentence. On the other hand, beat has fertility $$2$$, since it generates two separate words in the French sentence.

The second notion is distortion. In many sentences, the English word and its corresponding French word or words appear in the same part of the sentence – near the beginning, perhaps, or near the end. We say that such words are translated roughly undistorted, while words which move a great deal have high distortion. We’ll encoude this notion more formally shortly.

We’ll build up our translation model using some simple parameters related to fertility and distortion:

• The fertility probability $$\mbox{pr}(n|e)$$, the probability that the English word $$e$$ has fertility $$n$$.
• The distortion probability $$\mbox{pr}(t|s,l)$$, which is the probability that an English word at position $$s$$ corresponds to a French word at position $$t$$ in a French sentence of length $$l$$.
• The translation probability $$\mbox{pr}(f|e)$$, one for each French word $$f$$ and English word $$e$$. This should not be confused with the case when $$f$$ and $$e$$ are sentences!

I’ve described the translation model as a way of computing $$\mbox{pr}(f|e)$$, where $$e$$ is an English sentence, and $$f$$ is a French sentence. In fact, we’re going to modify that definition a little, defining the translation model as the probability $$\mbox{pr}(f,a|e)$$ that the French sentence $$f$$ is the correct translation of $$e$$, with a particular alignment, which we’ll denote by $$a$$. I’ll return to the question of how this change in definition affects translation in the next section. Rather than specify the probability for our translation model formally, I’ll simply show how it works for the example alignment (Le chien est battu par Jean |John (6) does beat (3,4) the (1) dog (2)):

$$\mbox{pr}(1|John) \times \mbox{pr}(Jean|John) \times \mbox{pr}(6|1,6) \times$$

$$\mbox{pr}(0|does) \times$$

$$\mbox{pr}(2|beat) \times \mbox{pr}(est|beat) \times \mbox{pr}(3|3,6) \times \mbox{pr}(battu|beat) \times \mbox{pr}(4|3,6) \times$$

$$\mbox{pr}(1|the) \times \mbox{pr}(Le|the) \times \mbox{pr}(1|5,6) \times$$

$$\mbox{pr}(1|dog) \times \mbox{pr}(chien|dog) \times \mbox{pr}(2|6,6) \times$$

$$\mbox{pr}(1|) \times \mbox{pr}(par|)$$

This should be self-explanatory except the final line, for the French word par. This word has no corresponding word in the English sentence, which we model using a special word .

What remains to be done is to estimate the parameters used in constructing the translation model – the fertility, distortion and translation probabilities. I’ll briefly outline an iterative procedure that can be used to do this estimation. It starts with a simple guess of the parameters. E.g., we might guess that the distortion probabilities $$\mbox{pr}(t|s,l) = 1/l$$ are uniform across the sentence. Similar guesses could be made for the other parameters. For each pair $$(e,f)$$ of sentences in our corpus we can use this guess to compute the probability for all possible alignments between the sentences. We then estimate the “true” alignment to be whichever alignment has highest probability. Applying this procedure to the entire corpus gives us many estimated alignments. We can then use those estimated alignments to compute new estimates for all the parameters in our model. E.g., if we find that $$1/10$$th of the alignments of sentences of length $$25$$ have the first word mapped to the first word then our new estimate for $$\mbox{pr}(1|1,25) = 1/10$$. This gives us a procedure for iteratively updating the parameters of our model, which can be repeated many times. Empirically we find (and it can be proved) that the parameters gradually converge to a fixed point.

### Exercises

• A problem I’ve glossed over above is figuring out which sentences in the English corpus correspond to which sentences in the French corpus. How could you figure out which sentences correspond to which? What would be the effect of mistakes?

### Searching for the maximizing $$e$$

In my introductory remarks I explained how translation from French to English can be viewed as the problem of finding $$e$$ which maximizes $$\mbox{pr}(e|f)$$, and that this was equivalent to maximizing $$\mbox{pr}(f|e)\mbox{pr}(e)$$. Let’s revisit the reasoning which led to that framing of the problem, and recast it in a form more amenable to our translation model based on alignments. Instead of searching for $$e$$ which maximizes $$\mbox{pr}(e|f)$$, let’s instead search for a pair $$(e,a)$$ which maximizes $$\mbox{pr}(e,a|f)$$. Using Bayes’ theorem this is equivalent to finding $$(e,a)$$ which maximizes

$$\mbox{pr}(e,a|f) = \frac{\mbox{pr}(f,a|e) \mbox{pr}(e)} {\mbox{pr}(f)}.$$

As earlier, we can omit the factor $$\mbox{pr}(f)$$ in the denominator, and so our search is equivalent to finding a pair $$(e,a)$$ which maximizes

$$\mbox{pr}(f,a|e) \mbox{pr}(e)$$

Our translation problem is to find $$(e,a)$$ maximizing this expression. Unfortunately, there are far too many possibility to do an exhaustive search. But there are some heauristics which work pretty well. One is to use a greedy algorithm which gradually builds up partial translations and alignments. Consider the following partial translations and alighments:

(Jean aime Marie | John (1) *)

(Jean aime Marie | * loves (2) *)

(Jean aime Marie | Petroleum (1) *)

Using the translation and language models described earlier we can compute the probabilities for each of these partial alignments, and so compute the product $$\mbox{pr}(f,a|e) \mbox{pr}(e)$$. It would be very surprising if the first two candidates weren’t assigned a high probability, and the third example a very low probability. Our heuristic for search is thus to maintain a stack of promising partial translations and alignments, and to continually extend those partial translations and alignments until we find a complete translation and alignment with a significantly higher probability than any of the partial alignments.

### Commentary

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm … is performing better than the best algorithm on less training data. Worry about the data first before you worry about the algorithm.

Peter Norvig, Director of Research at Google, as reported by Greg Linden.

That completes our description of a simple statistical machine translation system. The system I’ve described is essentially the same as the system described in 1990 by Brown et al at IBM research, a system which translated roughly half of all sentences in an acceptable way! Personally, I find it remarkable that such simple ideas can be used to make so much progress on what appears to be an enormously difficult problem. (I’ll describe other examples in future posts.) Furthermore, although the basic ideas remain the same, many improvements to these ideas have been made since 1990. The biggest single advance seems to have been a movement away from words as the unit of language, and towards phrase-based models, which give greatly improved performance.

If you’re interested in understanding more, I strongly suggest checking out an accessible and very influential 1993 follow-up paper from the IBM group, which describes a whole plethora of translation models. I don’t have a single canonical reference to give on phrase-based statistical translation models, but found several of the top hits at Google Scholar quite informative. Finally, and just for fun, I recommend Peter Norvig’s excellent short introduction to automatic spelling correction, which uses similar ideas in a much (!) simpler setting, but has the big advantage that it is illustrated by real code.

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. Subscribe to my blog to follow future posts in the series.

Published
Categorized as GTS

1. great survey! a few comments.

re language modeling, (first, please call them unigrams not monograms :P) i would say that there is a fair amount of consensus that using kneser-ney smoothing is pretty much okay. (though the googlers argue that using stupider techniques works just as well too.) almost exactly the KN model can be justified either in a bayesian setting or a maximum likelihood sets setting. regarding speech versus translation, there’s a big open question whether ngram models are good for doing word reordering (evidence suggests “no”), which is needed for translation but not for speech.

regarding the translation process, it’s easy to show NP-hardness (reduction to traveling salesman for word reordering) and greedy worked okay back in the late 90s but not really any more. it’s a search problem confounded by the fact that you really want to sum over latent variables (the alignments).

at any rate, there are great tutorials that go into the nuts and bolts of the process, including kevin knight’s tutorial (which promises free beer!) and a tutorial kevin and philipp koehn have given over the past few years. (also, i feel i should plug philipp’s upcoming book which is awesome, based on drafts.)

2. hal – Thankyou, those are some great links, and some great tutorials! I’ve only had a brief skim so far, but I’ll look at several in more depth later. I’ll also be sure to use “unigram” in future ðŸ™‚

3. Or says:

Great post (as usual)

Regarding estimating Pr(e_2|e_1), didn’t you forget
to divide by an estimator for Pr(e_1)?
Taking #(e_1,e_2)/N gives you an estimate of the joint prob. Pr(e_1,e_2)

4. Rajiv Das – No, I’d never heard of the book before; my post was based on the original papers. But based on Peter Norvig’s review, I’ll take a look at the book.

5. Dear Michael, is it the case that this method involves no parsing? (And very little linguistics.)

6. Gil – Yes, that’s right. More recent iterations of statistical machine translation incorporate much more linguistics. But it seems like you get a much bigger gain by starting with statistical ideas, and then later grafting on the linguistic ideas, not the other way round. I found this very surprising.

7. Koos van der Wilt says:

Thanks for a very nice primer in Statistical Machine Translation. Over the past few days, I have attempted to think deeply about the Noisy Channel model, which you present here, and which also holds sway in POS-tagging, Named Entity Recognition, and Speech Recogntion.

You discuss this in discussing P(E|F)=P(F|E).P(E) and by asking why we would not use P(E|F) directly. I would very much like to discuss this with someone knowledgeable.