Difficult Utterances

Capturing Difficult Utterances

(APR 2009) — Many speech recognition developers have a strength they don't even realize — access to a lot of data that the Speech Engine does not. Making use of this data is a complex task, but for developers seeking very high accuracy, the gains in recognition quality can make it a worthwhile one.

In past tech bulletins, we have discussed some of these concepts, such as using wildcards and database lookups to capture digit strings, and we have a whitepaper on a similar topic.

In some ways, this is a continuation of last month's discussion about SRGS grammar weights, as that also represents a method of giving extra information to the Speech Engine. This time we will explore even more advanced ideas. This tech bulletin is more a document on how to approach difficult tasks than it is a simple "how to" guide, but we hope advanced developers will find it something that spurs them to further development and thought.

Learning From Our Mistakes

One of the most difficult tasks for a speech application is open–ended alphanumeric string capture. If you wanted to capture a randomly–generated string of mixed letters and numbers, it would be quite difficult, and the difficulty increases with each character you add to the string ("2Q" is much easier to recognize than "2QNJFECDBV").

It is always hard to recognize spoken letters as they sound similar — this is why militaries, police forces, pilots, and anyone else who reads a lot of letters into radios and telephones use special alphabets (think "alpha bravo" instead of "AB"). Also, since there is always a chance to misrecognize any word in an utterance, as the length of a string increases, the chance that the Speech Engine will make a mistake increases.

But even when the Speech Engine misrecognizes a word in a string, it returns valuable information. If a 10–character string has only one misrecognition, that means 9 of the 10 were correct. If your application asks callers to repeat themselves after an initial incorrect attempt, you can compare the second result to the first.

You may very well find that only the third character has changed — this is a pretty good indication that there is a problem recognizing the third character. You might then ask the caller to just speak the third letter, and even provide some disambiguation advice. "If your letter is A, say 'Alpha' and if it is K, say 'Kilo.'"

When you combine this sort of logic with our n–best feature, you can infer quite a lot about what a user said. N–best causes the Engine to return not just a single result, but several, ordered by confidence scores. So you could actually get the top 5 most likely strings each time you asked the caller the question, and after 2 or 3 attempts you might have 10 to 15 possible strings. You then have a matrix you can mine for a lot of information: which positions in the string have characters that change frequently? Which are predominately the same? With this sort of information you can really start to zero in on what the caller is actually saying.

Moving Away from Randomness

The examples so far have focused on the idea that your callers are speaking a random string, which is the hardest problem you could tackle. In the real world, however, you are almost never asking for truly random data. As we discussed with weighting, certain phrases are often far more likely to occur than others, and you should take advantage of that fact.

For instance, if you are trying to capture strings of alphanumeric characters, it's probably because people are reading some sort of ID number, be it a part number, a prescription number, or a flight number. You can leverage information you have about the caller to try and reduce errors.

If your application is a flight lookup, you might be able to use the caller ID from a caller against phone numbers in your passenger database to narrow the number of potential flight numbers. If you knew that a caller from that caller ID was booked on flight ABCD and the Speech Engine returned the caller saying "ABCE" your application might simply do a little bit of reasoning and ask if the caller was looking for flight ABCD. If the caller says no, then you can ask about ABCE.

The idea here is to use extra information you have access to in order to guide your application when it asks for confirmations, not to shoehorn the caller into a certain path. In that flight number example, we can guess that the letters D and E are easily confused over the phone, so when we have a compelling reason to favor D over E, our application can give it preferential treatment.

This is a very similar idea to our previous technical bulletin that dealt with putting "wild cards" in an alphanumeric grammar, which would then be used to compare results from the Engine against a database. Anytime you can compare the Engine's results against a predefined set of data, you can expect to improve overall application accuracy.

To help you with those sorts of tasks, here is a list of the most commonly misrecognized numbers and letters based on internal LumenVox testing with our Speech Engine and alphanumeric recognition. The Spoken row indicates what a caller said; the Result row indicates what was recognized. You could consider replacing some of these with wildcards in your grammar if you are able to do these sorts of database comparisons.


Custom Acoustic Models

One final idea that really presses this notion of adding extra data to speech applications is to build custom acoustic models that reflect the users of a specific application. An acoustic model is the data that the LumenVox Speech Engine uses to understand how languages sound. The American English acoustic model, for instance, contains hundreds of hours of transcribed audio from American callers.

A custom acoustic model can be useful if there is something different about the way callers to your application sound. This may be a specific regional accent, but it could also be some other change in the acoustic qualities of the calls: maybe an overwhelming number of your callers are using speakerphones, or a specific sort of handset, or something similar.

If you are able to collect and transcribe audio, LumenVox can work with you to build custom acoustic models. By taking audio from your callers and building a model with it, the Speech Engine can be trained to better recognize the callers of a specific system.

As this sort of custom training can be fairly complex and time consuming, we generally only recommend it for applications with a need for very high accuracy or that have a particularly challenging acoustic component.

© 2016 LumenVox, LLC. All rights reserved.