The gateway between you and your AI assistant is the Wake Word. It’s a small thing that a lot of people take for granted.
But it is no small thing to create an algorithm that’s always listening for a particular utterance that lasts less than a second that can be said by anybody, that can run on everything from a wristwatch to a car, and that maintains the privacy and security of everyone it can hear while it’s doing its job.
Wake word technology is a specific branch of AI, with its own unique challenges. For companies like Cisco, which are developing AI assistant products, there are teams of people who work on nothing else but tools to extract these sub-second signals in a continuous stream of noise.
Here are some of their biggest challenges.
1. Wake word technology has to run on everything. Except the cloud.
We expect our watches to respond to our voice. And our phones, laptops, cars, and also big communication devices like dual-70” screen room systems.
On small devices, the environment the wake word system has to deal with is relatively simple. A watch or smartphone can assume that only one person will be speaking to it at a time, and the person will be fairly close. Room-based systems have a more complicated challenge: They have to pick out vocalizations from multiple overlapping speakers who might be far way, and in acoustically complex spaces like big conference rooms.
In all cases, the system has to respond very quickly, which means the processing has to happen locally. The system can’t stream what its microphone hears to a cloud service continuously; the round-trip lag and the cloud-based processing would slow down wake word recognition enough to impact the user experience.
More importantly, no one wants everything they say to be sent over the Internet to a cloud service, where it might be recorded, analyzed, or possibly stolen. Using a cloud service to pick wake words out of an always-on audio stream is a security risk.
2. Wake word technology requires specialized hardware
In some ways, it is more difficult to build accurate wake word technology that works at a distance than it is to do continuous speech recognition. To trigger on a wake word — and nothing else — you need to grab a clear audio signal in a very small time window. On the input side, that means using microphone arrays or some other method to surgically extract potential wake word utterances from other noise in the room. Humans are very good at picking out a single voice in a crowd or at a distance. Machines using just standard microphones, much less so. And in meeting rooms, where several people might be talking at once, it’s that much harder to make this work.
And obviously, your wake word process needs to be running continuously, always analyzing the last second or so of sound for its cue. It’s not so difficult to find the processing cycles for this on a device with excess power, like a wall-mounted video system or a car, but running the AI continuously on a handheld, battery-powered device requires special algorithm tuning or hardware, or both. Apple, for example, runs its wake word process on the iPhone’s “Always On Processor,” a “low-power auxiliary processor,” which in turn is embedded in the Motion Coprocessor. Keeping the phone’s main CPU running constantly just to listen for the wake word would use too much power.
3. Diversity is quality
Wake word algorithms are based on neural nets, which need to be trained. The more data you provide for training, the better they’ll be. It’s not enough to just provide a lot of data, though. The datasets must be diverse. If training data is just men speaking, even if it’s millions of them, there will be more errors when women try to invoke the system. You’ll have the same problem if people with different accents or native languages try to use your system, and your training set didn’t include them. More diverse training cohorts make for better AIs — for everyone.
I could put some examples here showing where people-recognizing algorithms from the past have ended up favoring one group over another, but they are so cringe-worthy I don’t even want to link to them. Diverse training sets are required for building AI algorithms.
4. There are good wake words and bad wake words
There’s a subset of wake word theory that overlaps with linguistics, which is about creating the best wake words for the algorithms to pick out. Wake words need to be long enough to be distinct, but not so long that they vary a lot between speakers. They also need combinations of phonemes that are both easy to discern by a machine, and easy to say by a human. And what’s easy to say will vary based on which languages a speaker is comfortable with.
A good wake word has an uncommon collection of phonemes, with both fricatives (hard sounds) and distinct vowels. There are numerous things to avoid, at least in theory, like preceding a “sh” sound (technically, the unvoiced post-alveolar fricative) with an “h” as in “hey” (a voiceless glottal fricative).
5. Many of the wake words we use are terrible — thank goodness
Engineers don’t have the last say on a product’s wake word, though. The wake word is a huge part of a product’s brand, and marketers, lawyers, and other non-technical people have more say in what a product is named. Voice assistants’ invocation names need to be memorable and on-brand.
When engineers and marketing disagree on naming, marketing usually wins. Engineers don’t always get the wake words they want. But they make them work. I have seen that the challenges imposed on wake word coders for non-technical reasons actually make for more robust code.
But this and the two other reasons above are why you can’t change your smartphone’s wake word to anything you want. (See also this story from Salon: Don’t call it “Siri”: Why the wake word should be “computer.”)
6. Want to make your own wake word technology? There’s an app for that, sort of
If you’re a developer and you want to make an AI product that starts up when you call its name, you can use code available from several sources, like Snowboy, Sensory, and other companies.
These tools will get you off the ground, but when building an AI-powered tool, the much bigger challenge is providing it with training data. It takes a well-funded team to recruit the thousands of people needed to provide the voice recordings and additional human training time to coach a wake word neural net until it works well. These tools aren’t as easy to come by for a smaller company or a hobbyist. At least not today.
7. Wake words are a blip in time
Wake words won’t always be the sole method to get the attention of a speech-recognizing AI. Over time, we will have new invocation mechanisms, like intention based on context, intonation, gaze direction, a “wake wave,” and so on.
Wake word algorithms will also advance to be able to hold on to the conversational state for longer, just like one person does when talking to another person: If I start a conversation with Kathy, I don’t have to start every single sentence to her with her name: “Kathy, good morning. Kathy, want go get some coffee with me? Kathy, how are your kids?”
We will likely have wake words for the foreseeable future. But for some commands, like, “turn on the lights” saying them in a commanding tone might be enough to trigger the speech recognizer without requiring a specific wake word. In other cases, a more robust always-listening AI will be able to know when to chime in based on what you’re talking about (although we can’t build this without also working extremely hard on keeping our human conversations secure and private).
Waking up to the challenge
Everyone working on AI assistants is applying these lessons to their products. We just released a bot with wake-word technology (Spark Assistant) but we’re also going to roll out face recognition to tune and personalize the responsiveness of our AI. We have built a “wake wave” invocation for our always-on video connection experiment, TeamTV. We are also working on “eye contact wake” in Spark Assistant VR.
There’s still a lot to learn. We are at a very interesting point in the development of how we co-exist with AI-powered assistants. What do you think? Drop a note in the comments.