An important part of speech recognition is correctly performing voice activity detection (VAD), determining when a user has begun speaking. These settings can be controlled via LV_SRE_StreamSetParameter in the C API and LVSpeechPort::StreamSetParameter in the C++ API.
We have Recommended Parameters that should act as good defaults for most systems, but two important settings may require more fine tuning. These are volume sensitivity and SNR sensitivity, which control how easily the Engine will barge-in when a caller starts speaking.
The volume sensitivity is the volume that can trigger barge-in; i.e. audio must be louder than this threshold in order for the VAD to attempt to recognize it as speech. The parameter takes a value between 1-100, but this is not a linear scale.
In the Engine, audio volume is represented on a scale of 0-32,000 (these numbers are internal values only and do not correspond with real world units). The vast majority of audio has a volume level that falls somewhere between 0-16,000 on this scale, and thus a volume sensitivity setting of 100 represents an internal audio value of 16,384.
Because it is more important to have fine control at the lower end of the volume spectrum, where most audio falls, incrementing the volume sensitivity from 1-50 has less of an effect on the sensitivity than incrementing it from 51-100. E.g. a volume sensitivity setting of 25 will let barge-in begin at a volume of 252, a volume sensitivity to 50 corresponds to an audio volume of 699, and setting it to 75 corresponds to 3,385. Thus one should take great care when changing the volume sensitivity above 50.
The following graph should give you an idea of how the relationship between volume sensitivity and absolute volume scales:
The SNR (signal-to-noise-ratio) sensitivity controls how much louder speech must be than the background noise in order for barge-in to trigger. When the Engine first detects an audio stream, it examines the very start of the audio for the volume of background noise. Any signal (i.e. any sound such as speech) that can trigger barge-in must be significantly louder than this noise.
The signal to noise ratio is determined by the following equation:
SNR = 10 * log 10 (Volume of Sound / Volume of Background Noise)
The volume of sound and the background noise are on the same 0-32,000 scale as used by volume sensitivity.
So if speech is at 8,000 and the background noise is at 2,000, the SNR is 6.02 (or 10 * log 10 4)
Like volume sensitivity and its relationship to volume, the SNR sensitivity setting does not have a linear relationship with the actual SNR. It is also set on a scale of 0-100, with 50 being a default setting. This corresponds with an SNR of 5; a setting of 25 corresponds to an SNR of 4.25 and a setting of 75 corresponds to a setting of 12.5.
The curve is illustrated in the graph below: