Humans are exceptionally good at speech processing. We handle a variety of accents, speaking styles, pitch differences, noise and more, with a high degree of accuracy. No two speakers say the same word exactly the same way. Needless to say, this represents a considerable challenge for automated speech recognition. Accurate transcription and tuning is essential.
Speech recognition software uses a statistical model of speech in order to perform good recognition. These models are built during training, where speech audio and text transcriptions are combined with algorithms that 'learn' how speech sounds. The models attempt to determine what 'average' speakers sound like when they speak particular words, and apply that knowledge to new incoming speech to determine what words were spoken.
Words and speaking styles are different for every application domain (i.e. the vocabulary for a travel system is quite different from that of a financial or banking application). Speech recognition applications benefit from acoustic models specially trained with data from their specific domains.
Transcribing audio data must be exact, word for word, and include noise tags so the system can learn the differences between noise and speech. To do this, the data must include as many speakers (both male and female) as possible, so that the new acoustic models accurately reflect the average speaker, and not just one or two particular speakers.
With a larger volume of transcribed audio data, new models will perform better. New acoustic models will likely require a new round of speech tuning, particularly with respect to confirmation thresholds.
© 2017 LumenVox, LLC. All rights reserved.