Tuning Strategies for Speech Recognition Applications

Tuning a speech recognition application results in a more accurate solution that can improve caller experiences. The process involves using prompt, grammar, call flow, and caller data to help solution developers make the needed changes. While nearly 40-50% of total development and deployment time is often spent on application tuning, there are techniques to create a repeatable process that reduces time and gives the desired results.

Throughout, we will uncover how the application log can inform you about overall performance, while the speech engine log can be helpful in finding, tuning and fixing a speech recognition problem. Ultimately, determining the success and performance of a speech recognition application hinges on the use of all available logs, coupled with an effective tuning procedure.

The Tuning Tradition: Benefits and Challenges

After a speech recognition application is deployed and enough data has been collected, speech developers traditionally transcribe and analyze every call in the speech engine call log.

tuning tradition
Figure 1: Speech Engine Log
++SIL++ is silence, ++Background++ is background noise

Common problems they are looking for might be out-of-grammar, prompt issues or background noise. This process is very thorough and is a good idea to do on occasion because it gives the developer a large set of data to work with. A developer will also get a complete set of out-of-grammar and confidence score data.

This technique, however, makes it hard to measure application performance, because the speech engine logs are designed to measure how the speech engine is doing, not how the application as a whole is performing. And the performance of an application has to be measured properly in order to evaluate its effectiveness.

Furthermore, there are additional challenges with the traditional tuning method:

Overcoming the Challenges

Let’s look at a case study, a speech-enabled gift code redemption line, to illustrate how we can successfully approach these roadblocks. An electronics retailer wanted to create a program for their frequent buyer card customers to reward them for their loyalty.  The retailer sent out a mailer with a unique gift code, an 800 number to call, and the choice of three gifts: a wireless keyboard, a food processor, or a cordless phone.

Since there were thousands of frequent buyers, the retailer did not want to overburden their customer service department, yet it still wanted to provide this valuable marketing program. Therefore, the goal of the speech-enabled IVR system was quite simple: Callers call in, say their gift code number and the name of the gift they want, then hang up. Once the information was captured, the gift would be mailed out to them on the next business day.

Speech Recognition

The program was a success, with high call volumes. When they first started tuning the application, the electronics retailer transcribed 1000 calls per day, which was every single call in the speech engine log. As you can imagine, this task became overwhelming, yet they still wanted to tune on a daily basis. So they started using the application log to focus only on the calls that failed, which instantly reduced the number of calls they needed to tune.

The application log provides data such as call hang-ups, call routing information, etc. Basically, all of the non-speech related functions of an IVR system. Figure 2 shows how they reduced the number of calls to analyze from 1000 to 130 calls.

Gift Code Redemption Line
Figure 2: Using application log data

This example illustrates that the retailer does not need to focus on calls from customers that either try to get two gifts (which is not allowed), or those that haven’t paid their bills (they don’t get to redeem a gift), because these situations do not point to a failing of speech recognition.

Therefore, they found they truly only need to look at 13% of the calls in the speech engine log. What is most important to them, as voice user interface designers, is finding out why the remaining small percentage of callers became frustrated with the system and transferred to Customer Service or simply hung-up without completing the call. These 130 are the calls they need to cross compare with the speech engine log.

The Tuning Process

So, now that the retailer has reduced the number of calls to analyze from 1000 to 130, they went about the task of transcribing these calls with the LumenVox Speech Tuner. Using a filter to find the calls that potentially had background noise or out-of- grammar issues, they were able to pinpoint an out-of-grammar issue and were able to adjust it immediately.

For example, one of the gifts in the grammar of the application was “wireless keyboard”, but many of the callers were calling it a “cordless keyboard.”

They were able to add “cordless” to the grammar, and used the built-in grammar tester to immediately test against the speech engine to see if it the change was effective.

The Tuning Process
Figure 4: The Speech Tuner allowed them to add "Cordless" to the grammar, test, and then redeploy.

After all of the required changes were done and they tested the results with the Speech Tuner, they were able to redeploy the application again with a greater accuracy.

Each time that they do this process they narrow the percentage of calls to analyze from 13 to 12 to 11 %, and refine the tuning process so it doesn’t have to be a burden.  Most importantly, they are getting application performance metrics that they couldn’t get if they simply only looked at the speech engine log.

The Final Hurdle: Logging Standards

Voice XML Forum Lastly, it can be difficult to learn and understand all the different logging systems of the speech recognition products provided by different vendors. It is essential that the speech development community have uniform logging components that make the tuning process easier. The VXML Tools Committee, Data Logging Group, has proposed a generic standard with the goal of platform and vendor independence. The benefit of a well-defined format is clear: Reduced time learning disparate logging systems (so everyone can “read the data”) leads to an easier tuning process.

Conclusion

Speech recognition application tuning is an absolutely essential function of a speech recognition solution deployment, but it does not have to be a painful one. If you follow the techniques outlined in this paper you can decrease your time spent while producing a dynamic and well performing speech solution. Remember:

Print Version

View PDF file
View a print ready version of this page.

Key Takeaways