The following glossary defines many common terms used by the speech recognition industry:
automatic (or automated) speech recognition (ASR): see speech recognition.
confidence: The probability that the result returned by the speech engine matches what a speaker said. Speech engines generally return confidence scores that reflect the probability; the higher the score, the more likely the engine's result is correct.
dictation software: A type of computer application that allows users to speak freely as the application transcribes each word the users say. This kind of software is almost always speaker–dependent.
dual–tone multi–frequency (DTMF): The tones produced by pressing keys on a telephone. DTMF, also called Touch–Tone, is often used as a way of sending data to IVRs.
grammar: A file that contains a list of words and phrases to be recognized by a speech application. Grammars may also contain bits of programming logic to aid the application. All of the active grammar words make up the vocabulary. See also ABNF, SRGS, and grXML.
interactive voice response (IVR): An automated system that allows callers to interact with a computer, using a telephone (or VOIP). An IVR may use speech recognition, DTMF, or a combination of the two.
phoneme: The basic unit of sound. In the same way that written words are composed of letters, a spoken word is composed of various phonemes, though they may not line up precisely. For instance, the English word "food" has three phonemes (the "f" sound, the "oo" sound, and the "d" sound) but four letters. A speech engine uses its dictionary to break up vocabulary words and utterances into phonemes, and compares them to one another to perform speech recognition.
speaker–dependent: speech recognition software that can only recognize the speech of users it is trained to understand. Speaker–dependent software allows for very large vocabularies, but is limited to understanding only select speakers.
speaker–independent: Speech recognition software that can recognize a variety of speakers, without any training. Speaker–independent software generally limits the number of words in a vocabulary, but is the only realistic option for applications such as IVRs that must accept input from a large number of users.
speech application: An application in which users interact with the program by speaking. Speech application is a broad term, but usually it runs separately from the speech platform and the speech engine.
speech platform: A piece of software that runs the speech application. It follows the logic of the application, collects spoken audio, passes the audio to the speech engine, and passes the recognition results back to the application.
speech recognition: The process by which a computer speech engine recognizes human speech.
speech recognition engine (SRE): See speech engine.
Speech Recognition Grammar Specification (SRGS): A W3C standard for writing grammars. SRGS grammars can be written in two main formats: ABNF and grXML, both of which are equivalent to one another. Grammars may be easily translated between the two formats.
training: The process of teaching a speaker–dependent system how a specific user speaks. This is often a long process that requires a user to read some pre–written text into the system, and then to continually adjust the recognition results.
Touch–Tone: see DTMF.
utterance: Spoken input from the user of a speech application. An utterance may be a single word, an entire phrase, a sentence, or even several sentences.
voice recognition: A form of biometrics that identifies users by recognizing their unique voices. Though it is often used interchangeably with speech recognition, the two are different. Voice recognition is concerned with recognizing voices, while speech recognition is concerned with recognizing the content of speech.