Using evaluation to improve our question-answering system
It’s tempting to think IBM’s Jeopardy-playing machine Watson must have relied on some huge algorithmic advance or silver bullet idea in order to beat top human Jeopardy players. But the researchers behind Watson have written a very interesting paper about how Watson works, and a different picture emerges. It’s not that they found any super-algorithm for answering questions. Instead, Watson combined a large number of different algorithms, most of them variations on standard algorithms from natural language processing and machine learning. Individually, none of these algorithms was particularly good at solving Jeopardy puzzles, or the sub-problems that needed to be solved along the way. But by integrating a suite of not-very-good algorithms in just the right way, the Watson team got superior performance. As they write in the paper:
rapid integration and evaluation of new ideas and new components against end-to-end metrics [were] essential to our progress… [Question answering benefits from] a single extensible architecture that [allows] component results to be consistently evaluated in a common technical context against a growing variety of… “Challenge Problems.”… Our commitment to regularly evaluate the effects of specific techniques on end-to-end-performance, and to let that shape our research investment, was necessary for our rapid progress.
In other words, they built an extensive evaluation environment that gave detailed and clear metrics that let them see how well their system was doing. In turn, they used these metrics to determine where to invest time and money in improving their system and, equally important, to determine what ideas to abandon. We will call this style of development evaluation-driven development. It’s not a new idea, of course – anyone who’s ever run an A/B test is doing something similar – but the paper implies that Watson took the approach to a remarkable extreme, and that it was this approach which was responsible for much of Watson’s success.
In the last post I developed a very simple question-answering system based on the AskMSR system developed at Microsoft Research in 2001. I’ve since been experimenting – in a very small-scale way! – with the use of evaluation-driven development to improve the performance of that system. In this post I describe some of my experiments. Of course, my experiments don’t compare in sophistication to what the Watson team did, nor with what is done in many other machine learning systems. Still, I hope that the details are of interest to at least a few readers. The code for the experiments may be found on GitHub here. Note that you don’t need to have read the last post to understand this one, although of course it wouldn’t hurt!
First evaluation: The system described in my last post took questions like “Who is the world’s richest person?” and rewrote those questions as search queries for Google. E.g., for the last question it might use Google to find webpages matching “the world’s richest person is *”, and then look for name-like strings in the wildcard position. It would extract possible strings, and score them based on a number of factors, such as whether the strings were capitalized, how likely the search query was to give the correct results, and so on. Finally, it returned the highest-scoring strings as candidate answers.
To evaluate this system, I constructed a list of 100 questions, with each question having a list of one or more acceptable answers. Example questions and acceptable answers included:
(1) Who married the Prince of Wales in 2005?: Camilla Parker Bowles
(2) Who was the first woman to climb the mountain K2?: Wanda Rutkiewicz
(3) Who was the bass guitarist for the band Nirvana?: Krist Novoselic
(4) Who wrote ‘The Waste Land’?: T. S. Eliot or Thomas Stearns Eliot
(5) Who won the 1994 Formula One Grand Prix championship?: Michael Schumacher
The system described in my last post got 41 of the 100 questions exactly right, i.e., it returned an acceptable answer as its top-scoring answer. Furthemore, the system returned a correct answer as one of its top 20 scored responses for 75 of the 100 questions. Among these, the average rank was 3.48: i.e., when Google could find the answer at all, it tended to rank it very highly.
(Actually, to be precise, I used a slightly modified version of the system in the last post: I fixed a bug in how the program parsed strings. The bug is conceptually unimportant, but fixing it increased performance slightly.)
Comparison to Wolfram Alpha: As a comparison, I tried submitting the evaluation questions also to Wolfram Alpha, and determining if the answers returned by Wolfram were in the list of acceptable answers. Now, Wolfram doesn’t always return an answer. But for 27 of those questions it did return an answer. And 20 of those answers were correct.
Hybrid system: Two things are notable about Wolfram’s performance versus the Google-based system: (1) Wolfram gets answers correct much less often than the Google-based system; and (2) When Wolfram returns an answer, it is much more likely to be correct (20 / 27, or 74 percent) than the Google-based system. This suggests using a hybrid system which applies the following procedure:
if Wolfram Alpha returns an answer: return Wolfram Alpha's answer else: return the Google-based answer
Let’s see if we can guess how well this system will work. One possible assumption is that the correctness of the Google-based system and the Wolfram-based system are uncorrelated. If that’s the case, then we’d expect that the hybrid system would get 20 / 27 questions correct, for those questions answered by Wolfram, and the Google-based system would get 41 percent of the remaining 73 questions correct. That gives a total of 50 questions correct.
Now, in practice, when I ran the hybrid procedure it only gave 45 correct answers. That’s a considerable improvement (4 questions) over the Google-based system alone. On the other hand, it’s less than half the improvement we’d expect if the Google-based system and the Wolfram-based system were uncorrelated. What’s going on is that the evaluation questions which Wolfram Alpha is good at answering are, for the most part, questions which Google is also good at answering. In other words, what we learn from the evaluation is that there is considerable overlap (or correlation) in the type of questions these two systems are good at answering, and so we get less improvement than we might have thought.
Problems for the author
- In developing the hybrid algorithm we relied on the fact that when Wolfram Alpha returned an answer it was highly likely to be correct – much more likely than the Google-based system. But suppose the Google-based system says that there is a very big gap in score between the top-ranked and second-ranked answer. Might that signal that we can have high confidence in the Google-based answer? Determine where it is possible to gain better performance by returning the Google-based answer preferentially when the ratio of the top-ranked and second-ranked score is sufficiently large.
- Are there other problem features that would enable us to determine whether a question is more or less likely to be answered correctly by the Google-based system or by the Wolfram-based system? Maybe, for example, long questions are more likely to be answered correctly by Google. Modify the code to determine whether this is the case.
Improving the scoring of proper nouns in the Google-based system: In the Google-based system, the scoring mechanism increase the scores of candidate answers by a factor for each word in the answer which is capitalized. The idea is to score more highly candidate answers which are likely to be proper nouns.
Originally, I chose the capitalization factor to be 3.0, based on nothing more than an intuitive guess. Let’s use the evaluation system to see if we can improve this guess. The results are below; you may wish to skip straight to the list of lessons learned, however, as an entree into the results.
Factor: 4.0 Top-ranking answer is correct: 33 A correct answer in the top 20: 74 Average rank for correct answers in the top 20: 4.46 Factor: 3.0 (original parameter choice) Top-ranking answer is correct: 41 A correct answer in the top 20: 75 Average rank for correct answers in the top 20: 3.48 Factor: 2.5 Top-ranking answer is correct: 42 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.41 Factor: 2.4 Top-ranking answer is correct: 47 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.38 Factor: 2.3 Top-ranking answer is correct: 47 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.25 Factor: 2.2 Top-ranking answer is correct: 47 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.21 Factor: 2.1 Top-ranking answer is correct: 47 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.24 Factor: 2.0 Top-ranking answer is correct: 42 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.13 Factor: 1.9 Top-ranking answer is correct: 44 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.13 Factor: 1.8 Top-ranking answer is correct: 43 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.12 Factor: 1.7 Top-ranking answer is correct: 38 A correct answer in the top 20: 76 Average rank for correct answers in the top 20: 3.39 Factor: 1.6 Top-ranking answer is correct: 34 A correct answer in the top 20: 74 Average rank for correct answers in the top 20: 3.22 Factor: 1.5 Top-ranking answer is correct: 29 A correct answer in the top 20: 73 Average rank for correct answers in the top 20: 3.38 Factor: 1.0 Top-ranking answer is correct: 1 A correct answer in the top 20: 65 Average rank for correct answers in the top 20: 7.74 Factor: 0.8 Top-ranking answer is correct: 1 A correct answer in the top 20: 52 Average rank for correct answers in the top 20: 10.25
Looking at the above, we learn a great deal:
(1) Increasing the capitalization factor to much above my original guess at a value (3.0) starts to significantly degrade the system.
(2) Removing the capitalization factor (i.e., setting it to 1.0) makes the system much, much worse. It still does okay at getting things approximately correct — for 65 of the questions a correct answer is in the top 20 returned responses. But for only 1 of the 100 questions is the top-ranked answer actually correct. In other words, we learn that paying attention to capitalization makes a huge difference to how likely our algorithm is to return a correct answer as the top result. It’s also interesting to note that even ignoring capitalization the algorithm does pretty well at returning a high-ranked answer that is correct.
(3) Actually penalizing capitalized answers (by choosing a capitalization factor of 0.8) makes things even worse. This is not surprising, but it’s useful to see explicitly, and further confirmation that paying attention to capitalization matters.
(4) The optimal parameter range for the capitalization factor depends on whether you want to maximize the number of perfect answers, in which case the capitalization factor should be 2.1-2.4, or to get the best possible rank, in which case the capitalization factor should be 1.8-2.0.
(5) Performance only varies a small amount over the range of capitalization factors 1.8-2.3. Even taking the capitalization factor out to its original value (3.0) decreases performance only a small amount.
Improving the scoring of proper nouns in the hybrid system: We’ve just used our evaluation suite to improve the value of the capitalization factor for the Google-based question-answering system. What happens when go through the same procedure for the hybrid question-answering system? A priori the optimal value for the capitalization factor may be somewhat different. I ran the evaluation algorithm for many different values of the capitaliation factor. Here are the results:
Factor: 2.6 Top-ranking answer is correct: 47 Factor: 2.5 Top-ranking answer is correct: 46 Factor: 2.4 Top-ranking answer is correct: 51 Factor: 2.3 Top-ranking answer is correct: 51 Factor: 2.2 Top-ranking answer is correct: 51 Factor: 2.1 Top-ranking answer is correct: 51 Factor: 2.0 Top-ranking answer is correct: 47 Factor: 1.9 Top-ranking answer is correct: 50 Factor: 1.8 Top-ranking answer is correct: 49 Factor: 1.7 Top-ranking answer is correct: 47
The pattern is quite similar to the Google-based system, with the highest scores being obtained for capitalization factors in the range 2.1-2.4. As a result, I’ve change the value for the capitalization factor from its original value of 3.0 to a somewhat lower value, 2.2.
Varying the weights of the rewrite rules: Recall from the last post that questions of the form “Who VERB w1 w2 w3…” got rewritten as Google queries:
"VERB w1 w2 w3 ... " "w1 VERB w2 w3 ... " "w1 w2 VERB w3 ... " ...
The candidate answers extracted from the search result for these rewritten queries were given a score of 5, following the scoring used in the original AskMSR research paper. Another rewrite rule formulated a single Google query which omitted both the quotes and VERB, i.e., it was just:
w1 w2 w3 ...
The candidate answers extracted from the search results for these rewritten queries were given a score of 2.
The scores 5 and 2 seem pretty arbitrary. Can we improve our system by changing the scores? Let’s experiment, and find out. To do the experiment, let me first mention that all that matters to the final ranking is the ratio of the two scores (initially, 5/2 = 2.5). So I’ll give the results of running our evaluation suite for different choices of that ratio:
Ratio: infinite Top-ranking answer is correct: 50 Ratio: 10.0 Top-ranking answer is correct: 50 Ratio: 6.0 Top-ranking answer is correct: 50 Ratio: 5.0 Top-ranking answer is correct: 50 Ratio: 4.0 Top-ranking answer is correct: 50 Ratio: 3.5 Top-ranking answer is correct: 51 Ratio: 3.0 Top-ranking answer is correct: 51 Ratio: 2.5 Top-ranking answer is correct: 51 Ratio: 2.0 Top-ranking answer is correct: 51 Ratio: 1.5 Top-ranking answer is correct: 51 Ratio: 1.0 Top-ranking answer is correct: 50 Ratio: 0.75 Top-ranking answer is correct: 49 Ratio: 0.0 Top-ranking answer is correct: 37
Again, we learn some interesting things:
(1) The performance is surprisingly insensitive to changes in this ratio.
(2) Even removing the quoted rewritten queries entirely (ratio = 0.0) only drops the performance from 51 to 37 questions answered correctly.
(3) Dropping the unquoted rewritten queries has almost no effect at all, dropping the number of correct answers from 51 to 50.
(4) Choosing the ratio to be 2.5, as was done in the original AskMSR paper, seems to provide good performance.
Two problems in the way we’ve used evaluation questions to improve the system
(1) It’s not clear that the evaluation questions are truly a random sample of well-posed “Who…?” questions. They’re simply a set of questions that happened to occur to me while I was working on the program, and they naturally reflect my biases and interests. In fact, the problem is worse than that: it’s not even clear what it would mean for the evaluation questions to be a random sample. A random sample from what natural population?
(2) There’s something fishy about using the same data repeatedly to improve the parameters in our system. It can lead to the problem of overfitting, where the parameters we end up choosing reflect accidental features of the evaluation set. A standard solution to this problem is cross-validation, dividing evaluation questions up into training data (which are used to estimate parameters) and validation data (which are used to check that overfitting hasn’t occurred).
Problem (2) could be solved with a much larger set of evaluation questions, and that’s a good goal for the future. If anyone knows of a good source of evaluation questions, I’d like to hear about it! It’s not at all clear to me how to solve problem (1), and I’d be interested to hear people’s ideas.
Automating: What we’ve done here is very similar to what people routinely do with statistical models or machine learning, where you use training data to find the best possible parameters for your model. What’s interesting, though, is that we’re not just varying parameters, we’re changing the algorithm. It’s interesting to learn that Wolfram Alpha and Google overlap more than you might a priori expect in the types of questions they’re good (and bad) at answering. And it tells you that effort invested in improving one or the other approach might have less impact than you expect, unless you’re careful to aim your improvements at questions not already well-answered by the other system.
- Find three ways of modifying the hybrid algorithm, and use the evaluation algorithm to see how well the modifications work. Can you improve the algorithm so it gets more than 51 of the evaluation questions exactly right?
Acknowledgements: Thanks to Nathan Hoffman for gently suggesting that it might be a good idea to cache results from my screen-scraping!