Part 2 Speech Recognition Basics Video



  • Delve into speech recognition further with the second part of our series on Speech Recognition Basics. In this video we cover speaker dependent vs. independent recognition, and grammars and vocabularies.
  • RUNTIME 10:08


Video Transcription

Part 2 Speech Recognition Basics

In this segment we will looking at:

  • The process a speech engine uses to recognize a speaker's speech
  • Types of speech recognition software
  • Concepts that are important in speech recognition, the grammar and the vocabulary. They are similar, but it is important to know what they are and how they work


The heart of the process is the Speech Engine. This is the software that will do the actual speech recognition, also commonly referred to as automatic speech recognitions systems.

  1. The first thing that it does is load up a list of words that you want to have recognized. Only words that have defined in advance can be recognized. This list of recognizable words is the grammar.
  2. It then loads audio. Once the grammar has been established, the speech engine also requires audio. The audio may be words from someone speaking, a microphone, telephone or perhaps a type of computer application. There are many different ways to get audio to the speech engine.
  3. It analyzes the audio for distinct sounds and characteristics. Once loaded, the speech engine will break the audio down into a mathematical representation of sound called a waveform. The engine will look at specific key mathematical characteristics of sound. It will separate speech from background noise, which is important. What the speech engine is really looking for are sounds that make up a language, sounds that mark the transition between words, and other technical aspects of sound that helps the engine figure out how the audio translates to spoken language.
  4. It compares the sound to internal acoustic models, using the grammar. The acoustic models are in the speech engine and are huge sets of data of transcribed speech by many speakers. This is used by the engine to determine how words sound. The engine compares the audio to the internal acoustic models, searching for matches using the grammar as a guide. Like a search engine on the Internet looks for Web pages, our speech engine searches the acoustic model and also returns results.
  5. It returns probable matches. Like a search engine, we don't just return one result, although the top response is the most likely match. We will also provide other possibilities. The engine returns a list of the most likely result that match.

Types of Speech Recognition

Speaker Dependent Software

The heart of the process is the Speech Engine. This is the software that will do the actual speech recognition, also commonly referred to as automatic speech recognitions systems.

  • This is software that works with a specific voice. This type has to be trained, meaning you will have to read 10 to 15 minutes of predefined text into the engine. While doing so, the engine is carefully studying how you speak. Since every voice has its own characteristics, the engine learns how you pronounce words and phrases and sounds. It uses those to build acoustic models tailored to you.
  • Speaker dependent software allows a very large vocabulary, a large number of words that can be recognized. This is important because when using speaker dependent software you will want to able to speak naturally, so you'll want to have a vocabulary of 100,000 words or more. This is why speaker dependent software is usually used for dictation software.
  • The downside of this type of software is that you'll have to train it, and correct it until it slowly adapts to your specific speaking pattern. This is not good if you have a telephone application, IVR, call center or a call router where it's not feasible to have customers to call in and spend time training the software before they use it.

Speaker Independent Software

  • Speaker independent software requires no training. Instead we have acoustic models and they are more generalized. The software is independent of the speaker so anyone can use it.
  • The downside is that it does not have as large of a vocabulary. The software doesn't recognize 100,000-plus words as needed for dictation software. Speaker independent software is not generally used for dictation applications or for transcribing natural speech.
  • What it is good for are telephone applications. A vocabulary of thousands or even tens of thousands of words can be used, but you cannot do natural speech transcription.

At LumenVox we make speaker independent software, which is great for uses such as IVRs, call centers, call routers, etc.

With speaker independent software comes two very important concepts, grammars and vocabularies:


  • Because we have a limited number of words we can have recognized at once, we will have to specify which words those will be in advance. We do that using what we call grammars. A grammar is a file on a computer that has a structure to it that tells the engine which words need to be recognized. So if we have a call router, the grammar may consist of all the names and departments that we would like to have recognized. Grammars also contain other things, such as programming logic called semantic interpretation which allows the system carry out instructions like when this word or phrase is heard perform a specific task. So grammars consist of words and phrases and sentences and the many different ways they can be ordered. Another thing that is important to know is that multiple grammars can be loaded at once. Grammars can be loaded and unloaded dynamically as needed. For example, you may have one prompt use one grammar and the next prompt using a second grammar.


  • At any given time, the speech engine can only recognize the words in the loaded grammars. That set of words is called the vocabulary. The vocabulary is the sum of all of the loaded grammars. So if you have 10 grammars with 50 words each, you have a total vocabulary of 500 words. The vocabulary will always determine what words will be recognized at once. This will always influence your prompt design, and your prompt design will always influence your grammar and thus your vocabulary.

An important thing to understand once you start getting involved with speech development, is that generally speaking, smaller grammars equal higher accuracy for recognition. Smaller grammars narrow the search space. When using larger grammars you increase the chances of having two words or phrases sounding very similar. When that happens it becomes more difficult for the speech engine to differentiate between them. Larger applications with large vocabularies can be used, however they will require more time to build and test, and may require more troubleshooting.

© 2018 LumenVox, LLC. All rights reserved.