Video Transcription

Part 2 Speech Recognition Basics

In this segment we will looking at:

Process

The heart of the process is the Speech Engine. This is the software that will do the actual speech recognition, also commonly referred to as automatic speech recognitions systems.

  1. The first thing that it does is load up a list of words that you want to have recognized. Only words that have defined in advance can be recognized. This list of recognizable words is the grammar.

  2. It then loads audio. Once the grammar has been established, the speech engine also requires audio. The audio may be words from someone speaking, a microphone, telephone or perhaps a type of computer application. There are many different ways to get audio to the speech engine.

  3. It analyzes the audio for distinct sounds and characteristics. Once loaded, the speech engine will break the audio down into a mathematical representation of sound called a waveform. The engine will look at specific key mathematical characteristics of sound. It will separate speech from background noise, which is important. What the speech engine is really looking for are sounds that make up a language, sounds that mark the transition between words, and other technical aspects of sound that helps the engine figure out how the audio translates to spoken language.

  4. It compares the sound to internal acoustic models, using the grammar. The acoustic models are in the speech engine and are huge sets of data of transcribed speech by many speakers. This is used by the engine to determine how words sound. The engine compares the audio to the internal acoustic models, searching for matches using the grammar as a guide. Like a search engine on the Internet looks for Web pages, our speech engine searches the acoustic model and also returns results.

  5. It returns probable matches. Like a search engine, we don?t just return one result, although the top response is the most likely match. We will also provide other possibilities. The engine returns a list of the most likely result that match.

Types of Speech Recognition

Speaker Dependent Software

Speaker Independent Software

At LumenVox we make speaker independent software, which is great for uses such as IVRs, call centers, call routers, etc.

With speaker independent software comes two very important concepts, grammars and vocabularies:

An important thing to understand once you start getting involved with speech development, is that generally speaking, smaller grammars equal higher accuracy for recognition. Smaller grammars narrow the search space. When using larger grammars you increase the chances of having two words or phrases sounding very similar. When that happens it becomes more difficult for the speech engine to differentiate between them. Larger applications with large vocabularies can be used, however they will require more time to build and test, and may require more troubleshooting.

Description

Delve into speech recognition further with the second part of our series on Speech Recognition Basics. In this video we cover speaker dependent vs. independent recognition, and grammars and vocabularies.

Runtime

Video playtime
10:08

Chapters In This Section

Tips & Resources


Part 1 Speech Recognition Basics
Part 1

Now Playing

Part 2 Speech Recognition Basics
Part 2
Part 3 Converting DTMF to Speech
Part 1
Part 4 Converting DTMF to Speech
Part 2
Part 5 Distributed Architecture
Part 6 MRCP vs. API, Part 1
Part 7 MRCP vs. API, Part 2
Part 8 Speech Recognition Don'ts
Part 9 Localizing Speech Applications, Part 1
Part 10 Localizing Speech Applications, Part 2