Why Google Duplex Did Not Pass the Turing Test

Google Duplex Turing Test.png

Recently, a lot of press was given to Google’s Duplex Artificial Intelligence (AI) bot. Watch a video of the demonstration here. The Duplex bot calls businesses, such as a restaurant, on your behalf and makes a reservation. It does this by actually having a conversation with the human on the other end of the line. The voice is totally realistic and in test cases the humans receiving the call were not aware that they had been talking to an AI bot.

We’re going to save for another day the potential ethical issues raised by AI bots fooling people into believing they are talking to a person. Instead, let’s focus on whether or not Duplex has successfully passed the Turing test as some have suggested.

A QUICK REVIEW OF THE TURING TEST

The Turing test is a famous set of criteria that assess whether or not AI has become realistic enough to fool humans into thinking that they are really interacting with another person. You can find information on the Turing test here. The crux of the test is the ability for an AI process to seem human to a human.

By the precise letter of the rules, one could argue that the Turing test was passed. But I don’t believe that it did in the spirit of the rules. To me, the spirit of the Turing test is having an AI bot hold a typical, rambling, somewhat random conversation with a human and pulling it off. For all the success of Duplex in scheduling restaurant reservations, it would fail miserably in this more general test.

CONTEXT IS EVERYTHING

In the world of AI, context is everything. There are already many examples of AI processes that initially seem quite smart but that have some major issues. See here and here for two discussions on this topic. Bias can accidentally be built into AI models through skewed training data. An equally big issue is that AI is only accurate in the exact context in which it is trained.

I often use the example of two AI processes that are taught to identify two different scenarios from a photo. One is taught to determine if someone is just about to hit a tennis ball. The other is taught to determine if someone has just hit a tennis ball. When shown a picture of a person swinging a tennis racquet next to a ball, both models will with very high confidence claim to see what they are looking for.

However, as humans, we know that while the picture may be ambiguous, only one answer is possible. Either the person just hit the ball, OR they are just about to hit the ball. Both can’t be true simultaneously. Further, it may not be possible to tell from the picture which is correct. The AI bots don’t know this and don’t understand that context. So, they give answers that individually seem excellent, but that are clearly incorrect when taken together.

THE SHORTCOMING OF DUPLEX

This gets to the heart of why Duplex really hasn’t passed the Turing test in spirit. Yes, Duplex can successfully fool a restaurant hostess when making a reservation. However, the scope and context of that conversation is very, very limited. If the hostess asked an unexpected question or used some unusual slang, the bot would have no idea what to do.

The scope of a typical dinner reservation conversation is so small that you could almost pull it off with a range of more classic business logic. After all, you’re really just looking for a date and time in the request. Then, either confirming a reservation if the slot is available or offering back alternatives if it is not. Once the alternatives are offered, the person either accepts one or declines. In practice, a rules-based system could probably handle a conversation of this scope almost as well as an AI system.

In reality, today’s AI bots will likely be able to more rapidly, accurately, and completely learn to navigate the typical discussions held around making a reservation than rules-based systems. My point is not to suggest that we should stop making progress with AI chat bots, but simply to point out that the success of Duplex isn’t as amazing as it might at first seem once the narrow scope is taken into account.

PASSING THE TURING TEST WITHIN A GENERAL CONTEXT

In order for Duplex or similar bots to pass the Turing test in spirit, they’ll need to handle much more than a discussion that fits perfectly within a tightly predefined and expected context. The bots will also need to handle any random thoughts and requests that a person might state without locking up or giving nonsense responses. Duplex’s feat, while impressive, is successful within a very narrow context.

An important point to take away here is that as AI evolves and we read or hear about impressive new achievements, we must be sure to consider the context in which the models were built and tested. It is far easier to spoof people in a narrow context than it is to have a truly free flowing and spontaneous discussion. One day we’ll likely achieve the latter, but today we’ve only achieved the former.

Bill Franksbill franks