RESOURCES

Types of Speech Recognition

There are two types of speech recognition. One is called speaker–dependent and the other is speaker–independent. Speaker–dependent software is commonly used for dictation software, while speaker–independent software is more commonly found in telephone applications.

Speaker–dependent software works by learning the unique characteristics of a single person's voice, in a way similar to voice recognition. New users must first "train" the software by speaking to it, so the computer can analyze how the person talks. This often means users have to read a few pages of text to the computer before they can use the speech recognition software.

Speaker–independent software is designed to recognize anyone's voice, so no training is involved. This means it is the only real option for applications such as interactive voice response systems — where businesses can't ask callers to read pages of text before using the system. The downside is that speaker–independent software is generally less accurate than speaker–dependent software.

Speech recognition engines that are speaker independent generally deal with this fact by limiting the grammars they use. By using a smaller list of recognized words, the speech engine is more likely to correctly recognize what a speaker said.

This makes speaker–independent software ideal for most IVR systems, and any application where a large number of people will be using the same system. Speaker dependent software is used more widely in dictation software, where only one person will use the system and there is a need for a large grammar.

The LumenVox Speech Engine, which powers all of our speech software, is speaker–independent. It is not dictation software, it is not the same as voice recognition, and it is not capable of recognizing an unlimited number of words at once. It is designed for recognizing specific information, primarily by callers into a telephone IVR. It works well as a call router, auto–attendant, or any other application where designers have an idea what sort of words a speaker is likely to say.

To build it, we take hundreds of hours of transcribed audio and use it to build a language model. This is a database that tells our Speech Engine what sounds look like mathematically — math is the only language computers really recognize.

Because the audio we use to build the models contains hundreds of speakers, the Engine has a wide variety of voices it can recognize. This is what makes it speaker–independent.

When the Engine receives input from a speech application, it converts the speaker's audio into a mathematical representation and compares it to its internal models. This gives the Engine an idea of which sounds make up the audio, and it compares those sounds to the words specified in the speech application's grammar.

This is not an exact process. Because there are many subtle variations on how words are pronounced, the Speech Engine is never able to be positive what the speaker said. For instance, even humans can never be sure what somebody said to them if the audio is not clear. Consider how difficult it is to distinguish between the letters "t" and "b" when a person is spelling a word.

Our speech recognition software handles this uncertainty by using a method based on probabilities. In the same way a public opinion poll has a margin of error for a specific confidence threshold, the Speech Engine returns a confidence score for any audio it attempts to recognize. This score represents how likely it is that the Engine's recognition result matches what the speaker said.