Redmond VAD

Employing Voice Activity Detection

The next area of attention was in the LumenVox Speech Engine's voice activity detection capabilities. VAD is software that is used to distinguish between human speech and background sounds in audio.

Voice Activity Detection VAD

Voice activity detection is the part of the software that detects speech in an audio signal. It is used in speech recognition to distinguish the speaker's voice against background noise. In the LumenVox Speech Engine, there are 4 settings used to configure VAD in an application

Text Divider

Volume Sensitivity

Volume Sensitivity Chart

Volume Sensitivity — The first thing the Engine has to look at is how loud the speech audio is. This setting establishes thresholds that must be broken by the audio in order to be recognized as speech. They represent an absolute volume level. Think of the signs at a roller coaster that say "You must be this tall to ride..."

Text Divider

SNR Sensitivity

SNR Sensitivity Chart

Signal–to–Noise–Ratio — Next the Engine needs to know how much louder the speech audio needs to be than the background noise in order to trigger barge–in. Where volume sensitivity is a threshold based on an absolute value, SNR is based on a relative value — the ratio of the volume of the background noise and the speech audio.

Text Divider

Wind Back

Wind Back Chart

Windback — When audio is being passed to the Engine, it records that audio and listens for speech. There's a chance that the Engine may recognize speech just a bit too late, and the beginning of those words may be cut off. This setting tells the Engine how far back to rewind the recorded audio to capture the beginning of the word. With too little, beginnings of words can be cut off. With too much, the Engine may get background noise, and try to decode that instead of speech.

Text Divider

End of Speech Delay

End of Speech Delay Chart

End of Speech Delay — Finally, this setting determines how much silence must be present after speech before the Engine begins processing the audio. Care must be taken with this setting to make sure that the Engine does not cut off the speaker during a pause between words. Conversely, you don't want too much time at the end, forcing the caller to have to wait for their recognition.

Text Divider

Speech has distinct qualities like volume, pitch and duration, which the LumenVox software can listen for. This allows it to detect when someone's speaking or when the audio is just noise. This technology is invaluable in a car setting.

The first VAD setting that LumenVox developers looked at was volume sensitivity. It is a measurement of how loud the signal must be for it to register as speech versus noise. In a telephony environment, the Engine's default settings usually work fine. This was not the case in a running vehicle. The sensitivity had to be turned down so that it was not operating with a hair trigger.

Next, the developers looked at the windback setting. When the Engine is listening for speech, it picks up some of the background noise present just before the speech. However, it often isn't able to process that someone has started speaking until a fraction of a second into the first word. Once speech is detected, that's when it starts processing.

When this happens, the word tres can often be recorded as "es." This harkens back to the seis versus tres problem. The solution is to tell the Engine to rewind to just a half of a second before it detected speech, recapturing the beginning of that first word.

By the time the Redmond and LumenVox teams had completed the application tuning with the audio they'd gathered, they came to an intriguing conclusion — they'd had to tune more aggressively for ChevyStar than either company would have previously thought appropriate. It illustrated to them just how challenging it is to work with the acoustics of a car interior.

© 2016 LumenVox, LLC. All rights reserved.