Tools

Introduction to ASR

Reference Number: AA-00602 Views: 15334

0 Rating/ Voters

The LumenVox Automatic Speech Recognition Engine is a piece of core technology that will allow you to load audio and grammars (lists of words and phrases to be recognized by the Engine), and obtain the decoded utterance. The task of obtaining decoded text from audio can be broken up into three general phases:

Setup: Your application initializes a speech port and then loads and activates grammars.
Interacting with Audio: Audio is loaded into the Engine which determines where speech begins and ends. It then decodes the audio and makes results available.
Getting Results: The Engine makes results available as a parse tree from a grammar or a semantic interpretation entity which your application can obtain and process. You may then close the speech port.

Overview of Application Design

The Engine accepts audio in one of two ways. The simplest way is in a batch (offline) mode, where you have pre-recorded audio and you load them into the Engine. In this case, your application will need to follow a process like this:

Open a speech port.
Add grammars into the port.
Add the available audio.
Ask the Engine to decode the added audio based on the loaded grammars.
Get results.

Most applications, however, will be more complex as they will require that a streaming (online) interface load audio directly from a live source, such as a telephone caller. In this case, your application will have two separate threads.

The first thread will be a main processing thread, similar to the simple application process spelled out above:

Open the speech port.
Add grammars into the port.
Set up a streaming interface for the Engine to acquire audio.
Start the streaming interface and wait for the Engine to return a decode.
Get results.

At the same time, you will have a streaming thread that loops, continuously feeding contiguous audio data buffers to the Engine. While this feeding of audio happens, the following events occur in the Engine:

The Engine will evaluate the audio data, looking for speech. If it detects speech, the Engine will announce it in the callback.
The Engine will continue to evaluate audio until it detects the end of speech.
When the end-of-speech is detected, the Engine will perform a decode if AutoDecode has been set; otherwise it will simply return that it is ready to decode.

You will want to keep this overall design in mind as you get started learning about the Engine. This guide will walk you through the basic steps, providing example code along the way. To get started, continue on to Initializing a Speech Port.

If this is the first time you are attempting automatic speech recognition, you may wish to browse through the core programming guide to get an idea of our API, and then immediately read our tutorial on writing SRGS grammars. Much of your application's success will depend on the quality of your grammars, so understanding how to write quality grammars is important.