From Speech Recognition to Voice Biometrics to Chatbots, voice technology has its own lexicon.
Check out our glossary to learn about the industry terms and what they mean.
Active Voice Biometrics: A text-dependent voice biometrics system which requires a speaker to say a specific set of words or phrases to authenticate.
Answering Machine Detection (AMD): Software that determines whether a machine (voicemail or answering machine) or a human has answered a call. AMD can be applied to all automated outbound calling use cases (outbound IVR, Text-to-speech, pre-recorded automated calls, or click-to-call).
Artificial intelligence (AI): AI is the simulation of human cognition, like visual perception, speech, and face recognition, translating, decision-making and problem-solving, by computer systems. Through learning and natural language processing (NLP) technologies, computers are trained or “taught” to perform designated tasks by processing substantial amounts of data and find structures and regularities in that data.
Augmented Backus-Naur Form (ABNF): A way of writing grammars that complies with the SRGS standard. It is a simple and very human-readable format.
Authentication: When a form of identification–i.e., something you are (biometrics), something you know (knowledge-based), or something you have (possession-based) –is determined to be valid.
Automated Dialer: See predictive dialer and message delivery system.
Automated Password Reset: Enables customers to simply and securely reset password and PINs just by using their voice.
Automatic Speech Recognition (ASR): The process by which a computer speech engine recognizes human speech.
Call Progress Analysis: The overall concept of having software monitor the call’s progress as it plays out to determine whether a human or machine is on the other end, then providing notifications to other software (predictive dialers or outbound message systems for example) to let them decide what to do in various situations.
Caller Authentication: Validating the identity of a caller using their voice by either actively authenticating the caller with a vocal password or by passively listening to the caller while they are talking with a contact center agent.
Chatbot: A chatbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent.
Claimant: An individual who submits biometric data for identity verification. A claimant can be true or false.
Confidence: The probability that the result returned by the speech engine matches what a speaker said. Speech engines return confidence scores that reflect the probability; the higher the score, the more likely the engine’s result is correct.
Conversational AI: Conversational AI is the set of technologies behind automated messaging and speech-enabled applications that offer human-like interactions between computers and humans. Conversational AI can communicate like a human by recognizing speech and text, understanding intent, deciphering different languages, and responding in a way that mimics human conversation.
Conversational UI: A conversational user interface is a user interface for computers that emulates a conversation with a real human. Historically, computers have relied on text-based user interfaces and graphical user interfaces to translate the user’s desired action into commands the computer understands.
Deep Neural Network (DNNs): To give you a little background, researchers have used high-quality scans to watch how neurons fire within the human brain. Based on this research, DNNs are a series of algorithms, whose job it is to identify relationships within a set of data, a process that simulates the way a human brain identifies underlying connections.
Dictation Software: A type of computer application that allows users to speak freely as the application transcribes each word the users say. This kind of software is always speaker dependent.
Dictionary: A large set of data used by a speech engine while doing speech recognition that defines the phonemes in a language or dialect.
Directed Dialogue: An approach to speech application design that prompts users to say specific phrases. Contrast with natural language.
Dual-Tone Multi-Frequency (DTMF): The tones produced by pressing keys on a telephone. DTMF, also called Touch-Tone, is often used as a way of sending data to IVRs (Interactive Voice Responses).
Equal Error Rate (ERR): The rate in which the rate of false rejection is almost equal to the rate of false acceptance.
False Rejection Rate (FRR): The measure of the likelihood that the biometric security system will incorrectly reject an access attempt by an authorized user. A system’s FRR typically is stated as the ratio of the number of false rejections divided by the number of identification attempts.
False Acceptance Rate (FAR): The measure of the likelihood that the biometric security system will incorrectly accept an access attempt by an unauthorized user. A system’s FAR typically is stated as the ratio of the number of false acceptances divided by the number of identification attempts.
Fraudster Watchlist: A list comprised of known fraudster voiceprints, used to compare suspected fraudster voiceprints.
Grammar: A file that contains a list of words and phrases to be recognized by a speech application. Grammars may also contain bits of programming logic to aid the application. All of the active grammar words make up the vocabulary. See also ABNF, SRGS, and grXML.
grXML: A way of writing SRGS grammars in XML. It is a standard format that is less readable by humans than ABNF grammars but is used widely by grammar editing tools.
Identification/Claimed Identity: A biometric sample of an enrolled user of the system.
Interactive Voice Response (IVR): An automated system that allows callers to interact with a computer, using a telephone (or VoIP (voice over IP). An IVR may use speech recognition, DTMF, or a combination of the two.
Machine Learning (ML): ML is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. ML focuses on the development of computer programs that can access data and use it to learn for themselves.
Message Delivery System / Outbound Messaging System: An automated dialer used to dial specified numbers and deliver some message “payload,” which can vary–anything from appointment reminders, political campaign messages, emergency notifications, etc. These calls are usually queued up and the system works through the list, delivering the messages when it can.
Natural Language: An approach to speech application design that encourages users to speak naturally to the system. Contrast with directed dialogue.
Natural Language Understanding (NLU): Natural language understanding is a branch of artificial intelligence that uses computer software to understand input in the form of sentences using text or speech.
Natural Language Processing (NLP): Natural language processing refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.
Passive Voice Biometrics: A text-independent voice biometrics system, which does not require specific words to be spoken; instead, the speaker can converse naturally to authenticate.
Phoneme: The basic unit of sound. In the same way that written words are composed of letters, a spoken word is composed of various phonemes, though they may not line up precisely. For instance, the English word “food” has three phonemes (the “f” sound, the “oo” sound, and the “d” sound) but four letters. A speech engine uses its dictionary to break up vocabulary words and utterances into phonemes and compares them to one another to perform speech recognition.
Predictive Dialer: An automated dialer that uses statistics to predict when a call center agent is going to be freed up for the next call. This is based on monitoring calls from all agents and taking several factors and timings into consideration.
Registration: The process of gathering biometric data of a user, processing this data, and storing it in a database.
Security Threshold: A setting to ensure a biometric system is highly secure, highly convenient, or somewhere in between. The threshold/ acceptance level may be tightened or widened according to business requirements.
Speaker-Dependent: Speech recognition software that can only recognize the speech of users it is trained to understand. Speaker-dependent software allows for exceptionally large vocabularies but is limited to understanding only select speakers.
Speaker-Independent: Speech recognition software that can recognize a variety of speakers, without any training. Speaker-independent software limits the number of words in a vocabulary but is the only realistic option for applications such as IVRs that must accept input from many users.
Speech Application: An application in which users interact with the program by speaking. Speech application is a broad term, but usually it runs separately from the speech platform and the speech engine.
Speech Engine/ Speech Recognition Engine: The software program that recognizes speech. A speech engine takes a spoken utterance, compares it to the vocabulary, and matches the utterance to vocabulary words.
Speech Platform: A piece of software that runs the speech application. It follows the logic of the application, collects spoken audio, passes the audio to the speech engine, and passes the recognition results back to the application.
Speech Recognition: The process by which a computer speech engine recognizes human speech.
Speech Recognition Grammar Specification (SRGS): A W3C standard for writing grammars. SRGS grammars can be written in two main formats: ABNF and grXML, both of which are equivalent to one another. Grammars may be easily translated between the two formats.
Speech-to-text: The ability of a computer program to recognize a person’s speech output and translate it into text that can be leveraged as either actual text or a command.
Text to Speech (TTS): A technology that creates an engaging and personalized user experience by converting text into a spoken voice output.
Touch-Tone: see DTMF.
Training: The process of teaching a speaker-dependent system how a specific user speaks. This is often a lengthy process that requires a user to read some pre-written text into the system, and then to continually adjust the recognition results.
Transcription engine: A technology that automatically transcribes multi-speaker audio data with the high accuracy and turns pre-recorded or love audio streams into actionable data that can be used for a host of new applications.
Utterance: Spoken input from the user of a speech application. An utterance may be a single word, an entire phrase, a sentence, or even several sentences.
Virtual Assistant: Commonly known as virtual customer assistant (VCA), virtual digital assistant (VDA) or virtual personal assistant (VPA). A virtual assistant is a conversational self-service tool powered by artificial intelligence (AI) that can have an automated conversation with a customer in any digital channel; capability to utilize the learnings from past conversations to continuously improve the engagement.
Vocal Password: Users can simply and securely use a spoken passphrase to validate their identity by matching their voiceprint on file.
Voice Authentication: A security process in which a user can use their unique voiceprint to authenticate who they are by matching the voiceprint on file.
Voice Biometrics: The use of behavioral and physical voice patterns to identify individuals.
Voice Identification: Similar to voice authentication, voice identification is a security process in which a user can actively use their voice to authenticate by using a vocal password, or passively, by listening in the background while the user talks to a contact center agent.
Voice Recognition: The ability of a computer program to recognize a person’s unique voice signature and assist with identification.
Voicebot: A voicebot enables its user to interact with a device or a service simply by speaking. Powered by artificial intelligence and Natural Language Processing (NLP), a voicebot can understand a spoken question or request and structure a fitting audio response.
Voiceprint: A mathematical representation of the speaker’s voice using that can be compared to a secondary voiceprint to facilitate authentication.
Word Error Rate (WER): Word error rate measures the performance of a speech recognition engine.