It’s a Poor Workman Who Blames Yogi Berra: Artificial Intelligence and Jeopardy!

Last week, an IBM computer named Watson beat Ken Jennings and Brad Rutter, the two greatest Jeopardy! players of all time, in a nationally televised event. The Man vs. Machine construct is a powerful one (I’ve even used it myself), as these contests have always captured progressive imaginations. Are humans powerful enough to build a rock so heavy, not even we can lift it?

Watson was named for Thomas J. Watson, IBM’s first president. But he could just as easily have been named after John B. Watson, the American psychologist who is considered to be the father of behaviorism. Behaviorism is a view of psychology that disregards the inner workings of the mind and focuses only on stimuli and responses. This input leads to that output. Watson was heavily influenced by the salivating dog experiments of Ivan Pavlov, and was himself influential in the operant conditioning experiments of B.F. Skinner. Though there are few strict behaviorists today, the movement was quite dominant in the early 20th century.

The behaviorists would have loved the idea of a computer playing Jeopardy! as well as a human. They would have considered it a validation of their theory that the mind could be viewed as merely generating a series of predictable outputs when given a specific set of inputs. Playing Jeopardy! is qualitatively different from playing chess. The rules of chess are discrete and unambiguous, and the possibilities are ultimately finite. As Noam Chomsky argues, language possibilities are infinite. Chess may one day be solved, but Jeopardy! never will be. So Watson’s victory here is a significant milestone.

Much has been made of whether or not the contest was “fair.” Well, of course it wasn’t fair. How could that word possibly have any meaning in this context. There are things computers naturally do much better than humans, and vice versa. The question instead should have been in which direction would the unfairness be decisive. Some complained that the computer’s superior buzzer speed gave it the advantage, but buzzer speed is the whole point.

Watson has to do three things before buzzing in: 1) understand what question the clue is asking, 2) retrieve that information from its database, and 3) develop a sufficient confidence level for its top answer. In order to achieve a win, IBM had to build a machine that could do those things fast enough to beat the humans to the buzzer. Quick reflexes are an important part of the game to be sure, but if that were the whole story, computers would have dominated quiz shows decades ago.

To my way of thinking, it’s actually the comprehensive database of information that gives Watson the real edge. We may think of Ken and Brad as walking encyclopedias, but that status was hard earned. Think of the hours upon hours they must have spent studying classical composers, vice-presidential nicknames, and foods that start with the letter Q. Even a prepared human might temporarily forget the Best Picture Oscar winner for 1959 when the moment comes, but Watson never will. (It was Ben-Hur.)

In fact, given what I could see, Watson’s biggest challenge seemed to be understanding what the clue was asking. To avoid the complications introduced by Searle’s Chinese Room thought experiement, we’ll adopt a behaviorist, pragmatic definition of “understanding” and take it to mean that Watson is able to give the correct response to a clue, or at least a reasonable guess. (After all, you can understand a question and still get it wrong.) Watching the show on television, we are able to see Watson’s top three responses, and his confidence level for each. This gives us remarkable insight into the machine’s process, allowing us a deeper level of analysis.

A lot of my own work lately has been in training school-based data inquiry teams how to examine testing data to learn where students need extra help, and that work involves examining individual testing items. So naturally, when I see three responses to a prompt, I want to figure out what they mean. In this case, Watson was generating the choices rather than simply choosing among them, but that actually makes them more helpful in sifting through his method.

One problem I see a lot in schools is that students are often unable to correctly identify what kind of answer the question is asking for. In as much as Watson has what we would call a student learning problem, this is it. When a human is asked to come up with three responses to a clue, all of the responses would presumably be of the correct answer type. See if you can come up with three possible responses to this clue:

Category: Hedgehog-Pogde
Clue: Hedgehogs are covered with quills or spines, which are hollow hairs made stiff by this protein

Watson correctly answered Keratin with a confidence rating of 99%, but his other two answers were Porcupine (36%) and Fur (8%). I would have expected all three candidate answers to be proteins, especially since the words “this protein” ended the clue. In many cases, the three potential responses seemed to reflect three possible questions being asked rather than three possible answers to a correct question, for example:

Category: One Buck or Less
Clue: In 2002, Eminem signed this rapper to a 7-figure deal, obviously worth a lot more than his name implies

Ken was first to the buzzer on this one and Alex confirmed the correct response, both men pronouncing 50 Cent as “Fiddy Cent” to the delight of humans everywhere. Watson’s top three responses were 50 Cent (39%), Marshall Mathers (20%), and Dr. Dre (14%). This time, the words “this rapper” prompted Watson to consider three rappers, but not three potential rappers that could have been signed by Eminem in 2002. It was Dr. Dre who signed Eminem, and Marshall Mathers is Eminem’s real name. So again, Watson wasn’t considering three possible answers to a question; he was considering three possible questions. And alas, we will never know if Watson would have said “Fiddy.”

It seemed as though the more confident Watson was in his first guess, the more likely the second and third guesses would be way off base:

Category: Familiar Sayings
Clue: It’s a poor workman who blames these

Watson’s first answer Tools (84%) was correct, but his other answer candidates were Yogi Berra (10%) and Explorer (3%). However Watson is processing these clues, it isn’t the way humans do it. The confidence levels seemed to be a pretty good predictor of whether or not a response was correct, which is why we can forgive Watson his occassional lapses into the bizarre. Yeah, he put down Toronto when the category was US Cities, but it was a Final Jeopardy, where answers are forced, and his multiple question marks were an indicator that his confidence was low. Similarly cornered in a Daily Double, he prefaced his answer with “I’ll take a guess.” That time, he got it right. I’m just looking into how the program works, not making excuses for Watson. After all, it’s a poor workman who blames Yogi Berra.

But the fact that Watson interpreted so many clues accurately was impressive, especially since Jeopardy! clues sometimes contain so much wordplay that even the sharpest of humans need an extra moment to unpack what’s being asked, and understanding language is our thing. Watson can’t hear the the other players, which means he can’t eliminate their incorrect responses when he buzzes in second. It also means that he doesn’t learn the correct answer unless he gives it, which makes it difficult for him to catch on to category themes. He managed it pretty well, though. After stumbling blindly through the category “Also on Your Computer Keys,” Watson finally caught on for the last clue:

Category: Also on Your Computer Keys
Clue: Proverbially, it’s “where the heart is”

Watson’s answers were Home is where the heart is (20%), Delete Key (11%), and Elvis Presley quickly changed to Encryption (8%). The fact that Watson was considering “Delete Key” as an option means that he was starting to understand that all of the correct responses in the category were also names of keys on the keyboard.

Watson also is not emotionally affected by game play. After giving the embarrassingly wrong answer “Dorothy Parker” when the Daily Double clue was clearly asking for the title of a book, Watson just jumped right back in like nothing had happened. A human would likely have been thrown by that. And while Alex and the audience may have laughed at Watson’s precise wagers, that was a cultural expectation on their part. There’s no reason a wager needs to be rounded off to the nearest hundred, other than the limitations of human mental calculation under pressure. This wasn’t a Turing test. Watson was trying to beat the humans, not emulate them. And he did.

So where does that leave us? Computers that can understand natural language requests and retrieve information accurately could make for a very interesting decade to come. As speech recognition improves, we might start to see computers who can hold up their end of a conversation. Watson wasn’t hooked up to the Internet, but developing technologies could be. The day may come when I have a bluetooth headset hooked up to my smart phone and I can just ask it questions like the computer on Star Trek. As programs get smarter about interpreting language, it may be easier to make connections across ideas, creating a new kind of Web. One day, we may even say “Thank you, Autocorrect.”

It’s important to keep in mind, though, that these will be human achievements. Humans are amazing. Humans can organize into complex societies. Humans can form research teams and develop awesome technologies. Humans can program computers to understand natural language clues and access a comprehensive database of knowledge. Who won here? Humanity did.

Ken Jennings can do things beyond any computer’s ability. He can tie his shoes, ride a bicycle, develop a witty blog post comparing Proust translations, appreciate a sunset, write a trivia book, raise two children, and so on. At the end of the tournament, he walked behind Watson and waved his arms around to make it look like they were Watson’s arms. That still takes a human.

UPDATE: I’m told (by no less of an authority than Millionaire winner Ed Toutant) that Watson was given the correct answer at the end of every clue, after it was out of play. I had been going crazy wondering where “Delete Key” came from, and now it makes a lot more sense. Thanks, Ed!

5 Responses to “It’s a Poor Workman Who Blames Yogi Berra: Artificial Intelligence and Jeopardy!

  1. Shakespeare Geek Says:

    Ken Jennings would be the first to tell you that Watson won because of its physical ability to click in faster than any human. It’s a reasonably safe bet, though we’ll never know for sure, that both human contestants certainly knew most if not all of the answers that Watson beat them to.

    Think of it like this. Say that three humans of equal caliber are playing. Now imagine that, instead of manually clicking in, Ken Jennings is given a box that will automatically click him in as soon as the buzzer is turned on. He will win the game. But what did that prove? That’s hardly a fair contest. Tell all three contestants that they are allowed to bring an auto-clicker box of their own design, and then maybe you’ve got competition. But in this particular example, Jeopardy pitted two oranges against an apple in a game of “Which of you tastes most like an apple?”

    A demonstration like this makes me wonder what exactly we just witnessed. Once upon a time we would have called it artificial intelligence, but our standards for that game have changed. Now we see it as a parsing and searching experiment. Imagine if chess were solved – that is, the entire possible tree of chess moves was able to be calculated and stored. The computer would never lose (assuming that a perfect game existed). But would that be intelligence?

    I’d like to see more work done on getting computers to exhibit intelligence by taking the same path that humans do, and not just leaping to a simulation of the same end result. You don’t get to just plug the works of Shakespeare into it – you make it read Shakespeare and do the work itself, and all that implies. Then you take that same engine – you did make it general knowledge, right? and not Shakespeare specific? – and feed it the history of the Academy Awards, and let it figure out how to store that knowledge. And so on for as many books as you can get your hands on.

    Or, here’s an even more interesting example. Give the computer a novel to read. Give that same novel to a human. Then give them both a test on it, including essay questions. At a fixed time later (so the computer doesn’t give itself a way by finishing in 30 seconds), have a panel of professors grade both tests, without knowing who wrote which one. See who gets the better grade.
    Repeat a few times with a few different books. That, to me, more closely approaches the spirit of the original Turing Test.

  2. Bill Says:

    But this wasn’t intended to be a Turing Test. Nor was it meant to be the final word on man vs. machine intelligence. The goal was to build a machine that could win a game of Jeopardy! and not anything else. Therefore, nothing beyond that should be read into the accomplishment.

    But the accomplishment itself is noteworthy.

    Imagine that you and a friend go to see a taping of Jeopardy!, and it turns out that one of the contestants is a cat. What? A cat? That’s crazy! But then the game starts, and it turns out that the feline contestant can bring it. By the end of the game, the new Jeopardy! champion has whiskers and a tail.

    “Wow,” you say to your friend, “that’s incredible.” Your friend shrugs. “Well, of course the cat’s going to win. He’s a cat. He has faster reflexes and will beat the humans to the buzzer every time.” You’re flabbergasted. “Dude, IT’S A CAT. How does a cat even know that much about opera anyway?” Your friend points out that the human players almost certainly knew more of the answers than your furry friend, and reminds you of how the cat thought that Benjamin Franklin was a US President. “If the cat hadn’t been so quick to the buzzer, it would have had no chance against a human player. I’m not impressed.”

    You may not find that a convincing analogy, but that’s how I see Watson, albeit to a much lesser degree, obviously. When I first heard that Ken and Brad were playing against a computer, I thought it was absolutely ridiculous. In my mind, AI wasn’t nearly there yet. And then I saw it. And it didn’t look like anything I had thought it would.

    Was reaction time the decisive factor? Probably. But the accomplishment here wasn’t that the computer beat the humans to the buzzer. The accomplishment was that the computer was just good enough at responding to Jeopardy! clues that reaction time could be the decisive factor.

  3. Shakespeare Geek Says:

    I know what you’re saying, and I agree, I just think that the accomplishment was not that big a deal given the context. This has often been the case for AI projects – they choose a single task, design themselves exclusively for that task, and then win at it. But what advancements were made, really? How are they applied going forward, and how is it different from what Google does every day?

    You get a database of a few hundred thousand ways that Jeopardy has ever posed a question. You do some pattern recognition so that you can group them and reduce the structure of the sentences down to things like “When the question is phrased like X, they’re probably looking for a Y.” This is hardly different from Google knowing that when you say “translate blah into french” that it should kick you over into translate.google.com.

    What they ended up making was a slightly-less-general-purpose google-in-a-box. Storage and speed were nothing, because with no constraint on budget and resources you could always add hardware to it until it held as much knowledge and ran as fast as you wanted.

    I’m not sure how your cat analogy works, because that would be the first talking cat I’d seen. Watson is hardly the first computer capable of spotting a pattern in the structure of a question and then performing a search. Watson, while amusing to watch, was really no more impressive than Google is when you type in stuff liek “What time is it in Malaysia?”

  4. Bill Says:

    The purpose of the cat analogy was to illustrate how dwelling on response time is – ahem – a red herring.

    See what I did there? I used “cat” and “herring” in the same sentence, and to acknowledge the unfortunate pun, I inserted a throat-clearing sound. From a human, this might elicit a smile or a groan; from a computer, likely confusion. Illustrate? Dwelling? Ahem?

    Language is infinitely complex. Google can answer simple questions, as long as you are careful to use the handful of pre-programmed phrases that communicate the exact nature of the question to the magical sprits who live in the computer. If you use your own language, Google can usually give you back a list of sites which have a good chance of containing the answer to your question. But it takes a human to read those sites and pick the answer out of the text. Google doesn’t claim to be able to do that.

    Watson does. He reads complex clues written on their own terms (not his), picks the most likely response, and develops a confidence rating. That’s new. And if the technology exists to break down intricate Jeopardy! clues with no Internet access, how far are we from a computer that can research basic information for us when we ask it to?

    There may even emerge a slew of online databases formatted specifically for these kinds of automated queries, and in a variety of disciplines. Imagine having a conversation with WebMD as it asks you specific questions about the nature of your symptoms. Similar databases could consult with lawyers about precedents, doctoral candidates about prior research conducted in their fields, and law enforcement personnel about potential suspects.

    Watson’s victory isn’t what makes all of this possible. All Watson can do is play Jeopardy! well. But it is a milestone in programming computers to respond to natural language prompts. This is a task that makes the complexities of chess pale in comparison. It is precisely because language is so complex that the veneration of Shakespeare makes any sense.

    We can agree to disagree on the scope of this particular accomplishment. I expect there will be many more to come.

  5. Ian Berger Says:

    I love the concept here of a blog dedicated to the subject of teaching Shakespeare, one of my favorite subjects to teach. As a middle-school teacher I’m a little more limited in the Shakespeare plays I can teach, but I manage a great unit with either “Much Ado About Nothing”, “The Tempest”, or “Twelfth Night”. My administration even gives me the support to create a full-fledged performance for the school and then the community.

    My point of view on teaching Shakespeare is that it must be dynamic and performance-based. When I teach the plays, students must get up in class and act as part of their unit grade. By the time we do the actual performance, the kids are already very used to speaking the language out-loud. We do read the plays, of course, but we read them as a “play”, not a piece of literature.

    I’ve been blogging about teaching my other favorite subject, fantasy and science fiction literature. Lately I’ve been writing about “The Tempest” and it’s links to both fantasy and science fiction. Here’s my link:

    http://www.teachthefantastic.blogspot.com

    Best regards!

    -Ian

Leave a Reply