Speech Recognition Glossary

Speech Recognition Glossary

Don't know the difference between speaker–dependent and speaker–independent, speech recognition? Confused by the distinction between ABNF and grXML?

The following glossary defines many common terms used by the speech recognition industry:

Augmented Backus–Naur Form (ABNF): A way of writing grammars that complies with the SRGS standard. It is a fairly simple and very human–readable format.

automatic (or automated) speech recognition (ASR): see speech recognition.

confidence: The probability that the result returned by the speech engine matches what a speaker said. Speech engines generally return confidence scores that reflect the probability; the higher the score, the more likely the engine's result is correct.

dictation software: A type of computer application that allows users to speak freely as the application transcribes each word the users say. This kind of software is almost always speaker–dependent.

dictionary: A large set of data used by a speech engine while doing speech recognition that defines the phonemes in a language or dialect.

directed dialogue: An approach to speech application design that prompts users to say specific phrases. Contrast with natural language.

dual–tone multi–frequency (DTMF): The tones produced by pressing keys on a telephone. DTMF, also called Touch–Tone, is often used as a way of sending data to IVRs.

grammar: A file that contains a list of words and phrases to be recognized by a speech application. Grammars may also contain bits of programming logic to aid the application. All of the active grammar words make up the vocabulary. See also ABNF, SRGS, and grXML.

grXML: A way of writing SRGS grammars in XML. It is a standard format that is less readable by humans than ABNF grammars, but is used widely by grammar editing tools.

interactive voice response (IVR): An automated system that allows callers to interact with a computer, using a telephone (or VOIP). An IVR may use speech recognition, DTMF, or a combination of the two.

natural langauge: An approach to speech application design that encourages users to speak naturally to the system. Contrast with directed dialogue.

phoneme: The basic unit of sound. In the same way that written words are composed of letters, a spoken word is composed of various phonemes, though they may not line up precisely. For instance, the English word "food" has three phonemes (the "f" sound, the "oo" sound, and the "d" sound) but four letters. A speech engine uses its dictionary to break up vocabulary words and utterances into phonemes, and compares them to one another to perform speech recognition.

speaker–dependent: speech recognition software that can only recognize the speech of users it is trained to understand. Speaker–dependent software allows for very large vocabularies, but is limited to understanding only select speakers.

speaker–independent: Speech recognition software that can recognize a variety of speakers, without any training. Speaker–independent software generally limits the number of words in a vocabulary, but is the only realistic option for applications such as IVRs that must accept input from a large number of users.

speech application: An application in which users interact with the program by speaking. Speech application is a broad term, but usually it runs separately from the speech platform and the speech engine.

speech engine: The software program that recognizes speech. A speech engine takes a spoken utterance, compares it to the vocabulary, and matches the utterance to vocabulary words.

speech platform: A piece of software that runs the speech application. It follows the logic of the application, collects spoken audio, passes the audio to the speech engine, and passes the recognition results back to the application.

speech recognition: The process by which a computer speech engine recognizes human speech.

speech recognition engine (SRE): See speech engine.

Speech Recognition Grammar Specification (SRGS): A W3C standard for writing grammars. SRGS grammars can be written in two main formats: ABNF and grXML, both of which are equivalent to one another. Grammars may be easily translated between the two formats.

training: The process of teaching a speaker–dependent system how a specific user speaks. This is often a long process that requires a user to read some pre–written text into the system, and then to continually adjust the recognition results.

Touch–Tone: see DTMF.

utterance: Spoken input from the user of a speech application. An utterance may be a single word, an entire phrase, a sentence, or even several sentences.

vocabulary: The total list of words the speech engine will be comparing an utterance against. The vocabulary is made up of all the words in all active grammars.

voice recognition: A form of biometrics that identifies users by recognizing their unique voices. Though it is often used interchangeably with speech recognition, the two are different. Voice recognition is concerned with recognizing voices, while speech recognition is concerned with recognizing the content of speech.