There are four methods that a speech recognition developer can use to plan for (and react to) potential ambiguities in a response from a caller to a speech-enabled IVR system. These methods can be employed during the design phase of the speech recognition application, and also be added into the logic of the speech solution once data from post-deployment tuning is collected.
Let’s start off by defining disambiguation as it relates to speech. Disambiguation is the process by which the speech recognition application decodes an uncertain utterance by a caller into a meaning that best matches the caller's spoken intent. In other words, often what a person says and what they mean to communicate are different, and it is the responsibility of the speech application to solve this “double meaning.”
To illustrate disambiguation methods, let’s look at an IVR that was created for a car rental agency, Express Rentals. Express Rentals has a speech-enabled IVR that prompts a caller for the city and state of where the caller would like to rent a car from. The system then transfers the call to the appropriate local agency for further service.
In this scenario, an expected and good response from a caller would be, “Eureka, California.” While 75% of the callers return a good response, the system is also faced with three types of challenging responses:
As we develop the application, we can predict from our experience that when a caller says “ California” that they mean the Golden State, and not the three tiny cities of California, MD (pop. 9,307), California, MO (pop. 4,005) and California, PA (pop. 5,274). In this case, instead of asking them to repeat their answer, we can simply put into the application logic the ability to prompt them for just the city in California they would like to rent from.
If the caller just says a city, such as “Eureka,” we can use Semantic Interpretation (which is the process of converting the raw spoken text into a more clearly meaningful, or unambiguous form) to create a list of possible meanings. We can use this list to play back to the caller the states in the US that have a city called “Eureka.” In this case there are 7 states: California, Nevada, Montana, Missouri, Illinois, South Dakota and Kansas. Since that is quite a long list, we can simply prompt the caller to speak the state, get the correct clarifying response, and then transfer them to the Express Rentals office of the “ Eureka” that they chose.
A complementary way of solving a challenging response—such as did they say “Eureka, CA ” or “Yreka, CA?”-- is the use of NBest. NBest, which can run in the background of your application, can provide a list of likely possibilities ordered from most likely to least likely. In other words, NBest finds all the utterances that sound similar and groups them together in an array of values. In our case, Eureka, California is the most likely response according to the NBest results, and we can confidently transfer the caller to the Eureka, CA Express Rentals office.

Another common city pairing we can use as an example as a “challenging response” is Boston and Austin. If the NBest results listed Boston and Austin with equally reasonably high probability, then to disambiguate you would ask the caller "Did you say Boston, Massachusetts or Austin, Texas?”
Grammar Weighting means including probability statistics into the grammar. This technique both helps guide the Speech Engine during processing and affects the probability scores (generally called the confidence score) returned back by the speech engine. For example, after the application has been deployed, the developer can gather the call data of the most asked for cities and apply “weights ” to those cities. If the Speech Engine recognizes an utterance as being a close match between two or more words, it will favor the word with a higher weight. Back to our example, we found that Eureka, Ca was uttered 100 times more than Yreka, CA, so we were able to put a higher weight, or probability statistic, on Eureka, CA.
You can also apply weights before an application is deployed. In the Express Rentals example, we could use population of cities to apply weights to similar sounding phrases. The population of Eureka, CA is 26,128, while Yreka is only 7,290. In this speech recognition application, population size logic is an appropriate value to base weight values on.
As you can see, incorporating the use of Application Logic, Semantic Interpretation, NBest, and Grammar Weights can all contribute to an overall strategy to disambiguate large grammars. As a developer, it is important to use, depending on the type of speech recognition application, the right mix of these techniques that will work to bring your speech solution to success.
View PDF file
View a print ready version of this page.