Speech Recogntion – Some Basic Q&A
With the commercial release of products like Ford Sync, Xbox Kinect, and iPhone 4s (Siri), speech recognition has received more attention and finally gone mainstream. As consumers of these devices, our expectations are quite high and it may take a few awkward moments initially to realize that there are still technical challenges causing us to change our behavior for now as the technology continues to advance toward increasingly more natural interactions. In this article I’ll draw on my experiences working on Xbox Kinect to talk about a few of these challenges. I’ll take your perspective and address the challenges in response to many common questions and experiences:
- Why do I have to press a button first to talk to the system?
- Why is that with some systems I can speak in full sentences and with others I have to know and speak specifically worded commands or phrases?
- Why does the speech system’s responsiveness vary so much?
- When a system doesn’t understand what I’ve said, why does it sometimes say “What?” and other times offer choices and ask me to pick?
- When the system doesn’t understand me, why doesn’t it help when I speak slower or add emphasis?
- Doesn’t the system understand that an emotional “YES” is different from “yes”?
Triggers
Why do I have to press a button first to talk to the system?
Imagine if these mainstream speech recognition systems were always listening. They cannot tell when they are being spoken to and are incapable of knowing if every little sound they “hear” is meant for them. Such systems would constantly respond to everything, including non-speech sounds! Even with a modest amount of intelligence, the number of intrusions and falsely activated responses would be so irritating, I’m confident most of us would turn the system off within a few short minutes.
So, for these systems to know that we are talking to them, they need a trigger; an event or action to signal that what follows is meant for them. Most devices use a simple trigger, a ‘push-to-talk’ button: You push a button, the system acknowledges it’s ready, then you speak to it.
Admittedly, however, it does seem strange that speech as a hands-free form of interaction should start by requiring you to use your hand first (hand trigger + speech command)! Additionally, for Xbox, we want everyone and anyone to jump in and interact without having to reach for or pass around a hardware device first. So, we designed it so you can use speech as the trigger to engage Xbox Kinect; simply start the conversation by saying a trigger word first followed by a command (speech trigger + speech command).
Of course, using a speech-based trigger comes at a cost… reliability. Since the system is constantly listening for a specific trigger word to be spoken, it could on occasion incorrectly recognize it when it was not spoken, or miss it entirely when it was actually spoken. To improve the reliability of our speech-based trigger, we needed a unique sounding word to minimize the chances of the system hearing something similar (confusable words or homonyms) and being falsely activated. We chose “Xbox” as our trigger word not only because it is unique sounding, but it is the name of the console and by speaking its name you’re addressing it. The concept is very similar to how you might speak with a friend; you call out her name first to get her attention, and then you say something. Similarly with Xbox Kinect you would say “Xbox” + command (e.g., “Xbox, go home”).
So, does a trigger have to be used each and every time?
It could get rather tedious if the trigger had to precede every command or spoken phrase. A well-designed system should minimize the use of triggers. One technique may involve automatic triggers, wherein the system uses context to determine and predict a speech response and so it automatically triggers itself to begin listening. For example, if the system were to ask a question, then it’s reasonable to expect a response and so it could automatically begin listening.
Another technique is to use a trigger to only start a conversation and don’t require it again until the conversation has ended and a new one starts. In other words, as long as you and the system are still in a conversation, the system will keep listening and a trigger is not needed. For example, let’s say you just searched for Batman using Xbox Kinect, and are looking at a list of results for all things Batman…
Then, a four-part conversation may go like this:
- “Xbox, The Dark Knight” → system goes to a details layer for The Dark Knight movie;
- “Related” → system goes to The Dark Knight’s related movies page;
- “Spider Man 2” → system goes to a details layer for Spider Man 2 movie;
- “Go Home” → system returns user to the top level home page
Notice, across all four phrases of that conversation you only had to say “Xbox” once. The key is to understand when the conversation ends, requiring you to say “Xbox” again only to start another conversation:
- Conversation could naturally end when a task is completed. For example, if you were watching a movie, you could start a conversation by saying “Xbox, fast forward”. Now, the speech recognition microphone can remain open and actively listen for follow-up commands… “Xbox, play” or “Xbox, stop.” Once it hears “Xbox, play“, the movie stops fast forwarding and begins playing, and the task has now ended along with the conversation.
- Conversation could time-out. For example, if you started a conversation, a timer would start and the speech recognition system would listen for input. With each new input, the timer could restart, but if the timer hits zero then the system could end the conversation.
- Conversation could be explicitly ended by you. For example, if you started a conversation intentionally or accidentally, you should still have a way to manually and explicitly end it. That way, you don’t need to keep quiet to avoid inadvertently issuing commands until the system stops listening. For Kinect, you would say “Cancel”.
Of course, to make this work well for you, the system has to offer great feedback. It’s important to ensure you can easily tell when the system is listening and when you need to use a trigger to get it to listen. I’ll save this UI design topic for another day.
Vocabulary and Syntax
Why is that with some systems I can speak in full sentences and with others I have to know and speak specifically worded commands or phrases?
Different systems (or even parts of the same system) use different strategies to recognize what you’re saying. Some systems may let you say what you want as they try to pick out and recognize only keywords (i.e., keyword-spotting). For example, if you said “I want to watch the movie Batman”, the system might be programmed to only recognize and pick out “watch”, “movie”, and “Batman” to figure out your intent and ignore everything else. These systems come across as supporting more natural interactions because you say what you want how you want. Thus, these similar phrases might be equally successful:
- “Show us Batman the movie”
- “Play the film Batman”
- “Start the first Batman motion picture”
Success, however, depends on several factors:
- Having lots of keyword synonyms for robust flexibility. Using the previous example, a robust system would let you say “movie”, “film”, “flick”, “picture”, or “motion picture.”
- Identifying the right words. If you told the system you wanted to watch a particular movie, could it accurately identify the actual movie’s title, or might it confuse the other superfluous words in your command as part of the title? For example, for the phrase “Start the first Batman motion picture”, the system would have to figure out if the title of the movie is “the first Batman”, “first Batman”, or “Batman”.
- Correctly interpreting intent. If the system is keyword-spotting, then it makes precarious assumptions about your intent. For example, if you said…
“I want to watch the movie Batman tomorrow night”, or
“Have my friends recommended I watch the movie Batman?”
In both cases, the system would pull out “watch”, “movie”, and “Batman”, and would probably just start playing the movie and miss your intent.
If such a system cannot correctly interpret your intent, then it is sending you mixed signals. On the one hand, it’s giving you the perception of freedom to say what you want, but on the other hand there are limits to what it can understand. Helping you understand the limits of your “freedom” can be tricky. If you find yourself constantly hitting the system’s limit, then you may feel the system is over promising and under delivering.
Of course we don’t want limits, but when they exist should they be hidden from us? If designed in such a way that you rarely encounter them, then perhaps they can be hidden. Otherwise, system designers should more clearly define and teach what’s possible to properly set and manage your expectations. Such is the case when some systems take a different approach with less freedom and a more well defined vocabulary and syntax/grammar. While it may be costly and unnatural for you to learn what the system expects and needs, if those restrictions are reasonable (i.e., easily learned, remembered, and used) then it may lead to more predictable interactions wherein the system can more accurately interpret your intent. For example, a smaller vocabulary and a verb-noun syntax may sound restrictive but could make things easier for the system and may improve the system’s performance in ways that reduce downstream costs for you. Thus, “Play Batman” or “Play Indiana Jones and the Temple of Doom” can be handled equally well as the system knows that anything following the verb “Play” will be the title of a movie. It doesn’t have to guess what part of your phrase is the title of the movie and avoid potentially coming back with clarifying questions. Of course, you must know the vocabulary and syntax; otherwise, if you say “Start Batman” it won’t work, or if you say “Play the first Batman motion picture”, then the system will try to find a movie titled “the first Batman motion picture”!
Context
When interacting with speech recognition systems, why does the system’s responsiveness vary so much?
When doing a web search, it seems like the system can take a long time to recognize what’s been said. However, when performing a specific activity with predefined speech choices the system seems to respond almost immediately. What’s going on here?
We’re talking about system performance and the speed and accuracy of recognizing what was said. When the system hears you say something, it basically compares the acoustic pattern it captured against its database of things it’s supposed to recognize (i.e., its vocabulary). As you can imagine, the size of that vocabulary can affect the system’s performance. The larger the vocabulary, the longer it will take to find the best match (speed) and the greater the chances that more than one entry in the vocabulary may be a good match (accuracy).
So, one way to improve the system’s performance is to reduce the vocabulary to the smallest size possible. In creating that vocabulary, how would engineers know what to cut out and what to leave in? Like us, engineers and designers can use context. We rely on context to help us better understand events and activities around us. For example, if you heard something unclearly, sometimes context helps you make sense of it. This is possible because context helps you reduce the set of possibilities to a more relevant set of choices. Today’s speech recognition systems can employ the same strategy. Instead of trying to recognize every possible thing you may say at any given time, some systems use context to create a smaller more likely set of spoken commands and phrases. With a smaller vocabulary and grammar set, the system’s speed and chances of more accurately recognizing your utterances improve.
Using this strategy, designers can create different sets of vocabulary that will be available only in specific areas / situations (context sensitive). Then, as you navigate from one area of your device or application to another area, the system can dynamically unload the vocabulary for the former area and load a new vocabulary for the new area.
The additional advantage of having a smaller, contextual vocabulary is that speech recognition is not as computationally intensive so it could happen quickly on your device. For example, when navigating to different parts of a system or interacting with on-screen elements, a smaller contextual vocabulary processed locally on your device enables the system to respond immediately to a spoken command. However, when doing web searches, there may be no context and so the system uses a larger vocabulary and recognition program in the cloud, taking more time until results are returned to your device.
As the tech improves, one can easily imagine that all commands and phrases will be available everywhere all the time and context will be one part of an algorithm used to recognize and understand what you said.
Disambiguation
When a system doesn’t understand what I’ve said, why does it sometimes say “What?” and other times offer choices and ask me to pick?
First, the system has to actually detect something in order to attempt recognizing and understanding it. When some systems hear nothing within a specific period of time, they may choose to prompt you again instead of just ending the conversation. For example, phone menu systems may say “I’m sorry, I didn’t hear you. Please say your answer again.” Other systems designed for more open-ended uses may be infinitely patient. Xbox Kinect, for example, will simply time out if it doesn’t detect anything.
When a system does detect something, it can compare and contrast the acoustic patterns it heard against words or phrases in its database and generate a numeric value representing match confidence. For example, if you speak “Batman”, the system may come back with something like…
- 95% “Batman” (read as: system is 95% confident it thinks you said “Batman”)
- 80% “Hat Man”
- 60% “Darkman”
- …
- 40% “Superman”
- 35% “Spider Man”
- …
- 5% “The Hulk”
- Etc.
The system can now rule out all words and phrases that do not generate a confidence match value that’s high enough. If all values are not high enough, then it seems prudent for the system to respond with a “What?” or “Please try again” type message. Note that in such cases the system is obligated to respond as that confirms the system is listening and something was detected but could not be understood. This may also be a good time for the system to provide some additional hints or tips as to what can be said just to help ensure you are trying to say something valid.
If several words or phrases match with high enough confidence values, then should the system just return the highest matching word/phrase as being recognized? Yes, but only if there’s enough distance between it and the next highest value. Using the example above, let’s say both “Batman” and “Hat Man” score high enough to be considered a good match, and since the 15% gap between them is rather large the system can return “Batman” confidently. However, if the gap between them were very small, say only 1%, then there’s a good chance you could have spoken either word/phrase. At this point, the system could present you with its top matches and ask you to pick one. Offering these choices is much better than asking you to try again. The assumption here is that any repeated attempts will not create any greater distinction between choices that sound alike. Lastly, of course, when asking you to pick from similar sounding choices, the system should offer different labels instead of the actual words/phrases themselves! For example, say “one” for Batman or “two” for Hat Man.
Recognition Reliability
When the system doesn’t understand me, why doesn’t it help when I speak slower or add emphasis?
In the previous section, we talked about the system’s ability to recognize what you said. Sometimes, even if you say exactly what the system expects, there’s a chance it may not recognize it with 100% certainty. Why is that? Obviously, there’s more at stake here than just the actual words or phrases being spoken.
People speak differently; articulation, pronunciation, accents, pitch, and more are factors that can play into a system’s ability to recognize what you say. Consequently, speech systems will recognize some people very well while having difficulty with others. Of course, all of this holds true even with person-to-person conversations. As masters of communication, when another person doesn’t understand what you’re saying, you naturally adjust your delivery when trying again. You may slow down, articulate more, add emphasis, and use more simple words or grammar. Well, when a speech recognition system doesn’t accurately recognize your commands, you instinctively resort to the same delivery adjustment behavior. Unfortunately, almost all of these tactics will actually make things worse! Speech recognition systems are trained or tuned to recognize words and phrases spoken “normally”, without your added emphasis or emotion. Making all of those “helpful” adjustments changes the sound patterns, making it less likely to match the patterns in the database. This, of course, is why we tend to believe we should speak robotically when addressing speech systems.
There are systems out there that can be tuned or trained to your specific voice. In most cases, you train the system by speaking a predefined set of phrases so the system can tune itself to you. In the near future, systems should be able to identify who is speaking and dynamically tune that person’s speech profile on an ongoing basis as a by-product of normal everyday use.
Emotional Understanding
Doesn’t the system understand that an emotional “YES” is different from “yes”?
We know communication involves a lot more than just the words we speak. We communicate through movement and facial expressions, and how we say things can be just as important as what we say. For example, a timid “yes” can imply uncertainty and mean “maybe” more so than “yes”, providing an opportunity for the system to be more helpful. If I say “YES” joyfully and with exuberance, then the system could forego the follow-up confirmation “Are you sure?” If I sound hurried, I want the system to feel my sense of being rushed and also respond quicker. If I speak in a hushed manner, I want the system to lower its voice when it responds.
Today, almost all systems seem blind to these richer, multi-channel forms of expressive communication, reducing them to a single channel. Of course, any emotional expression also risks worse recognition, but have faith that as technology advances, we’ll move closer and closer to more natural and expressive interactions.
– – – – –
While much of the information in this article may apply generally to speech recognition systems, it’s not my intent to represent the technical underpinnings of other products. Besides, mainstream adoption pushes technology to advance quickly and the information I’ve presented in this article will be outdated just as quickly. For now, I hope I have answered some basic questions regarding the use of your speech recognition products.
Related Reading
- Rest in Peas: The Unrecognized Death of Speech Recognition
- How Speech Recognition Works
- Keyword Spotting (Wikipedia)
- Key-Word Spotting- The Base Technology for Speech Analytics (white paper)
- Siri Meets Eliza