Browse
 
Tools
Rss Categories

Recommended Engine Settings

Reference Number: AA-01485 Views: 11190 0 Rating/ Voters

In order to get the highest accuracy rates out of the Speech Engine, it is useful to adjust a number of parameters that control how the Engine distinguishes speech from other background noise.

These parameters are called voice activity detection (VAD) parameters. Adjusting them gives developers the ability to fine-tune their speech application (see LV_SRE_StreamSetParameter for adjusting these parameters in the C API, and LVSpeechPort::StreamSetParameter for the C++ API).

Many problems with recognition are caused by circumstances such as the Engine mistaking background noise for speech, or when it hears prompts playing and triggers barge-in prematurely. Changing the VAD parameters allows developers to combat these problems.

We recommend the following settings for most applications:

Parameter Recommended Value
STREAM_PARM_VAD_EOS_DELAY 1250
STREAM_PARM_VAD_WIND_BACK 750
STREAM_PARM_VAD_VOLUME_SENSITIVITY 50
STREAM_PARM_VAD_SNR_SENSITIVITY 50

One of the most useful VAD parameters to adjust is the end-of-speech delay (STREAM_PARM_VAD_EOS_DELAY). This is the amount of time, specified in milliseconds, that the Engine must detect silence after speech before it begins processing the utterance.

Adjusting this depending on the sort of input you expect from the caller will make the system seem a lot more responsive. If the speech application asks a caller a yes/no question, this value can be very short, as the answer is only going to be a single word.

Because users tend to pause frequently speaking long strings, such as account numbers, the end-of-speech delay should be set higher while collecting information like account numbers. A good starting value would be 2000 milliseconds for this sort of collection. You should almost never set this value below 600.

STREAM_PARM_VAD_WIND_BACK is also useful, as it will "wind back" the Engine the specified number of milliseconds after barge-in, capturing a little bit of audio before the Engine detected the start of speech. If you find your utterances are regularly being cut off, set this value higher.

The STREAM_PARM_VAD_VOLUME_SENSITIVITY and STREAM_PARM_VAD_SNR_SENSITIVITY will control how easily your callers can trigger barge-in. For most applications, you will not need to adjust these. If you find that barge-in is being triggered by things other than the speech of your callers, you may want to set them higher.

In particular, STREAM_PARM_VAD_VOLUME_SENSITIVITY is designed to help deal with equipment that has poor echo cancellation. By setting this higher (making barge-in less sensitive), it will make echoed prompts less likely to falsely trigger barge-in.

For more information about the SNR and volume sensitivity settings, see Sensitivity Settings.

Confirmation Scores

The confidence score is a rough measure of how closely the speech matched the phrases in the grammar. The score ranges from 0 - 1000. The higher the score, the higher the estimated probability that the result is correct. A score of 500 indicates the Engine is 50 percent sure the result is correct. Typically, an application designer will use the confidence score to make decisions about the quality of a recognition result.

For most applications, the rejection confidence score should be set to 450, so that any utterances with scores below 450 are rejected.

The confirmation confidence score, the level below which you ask for confirmation, should usually be set to 700. Anything above 700 can usually be safely accepted by an application.