On Demand Webinar: Composable CX and the Conversational Cloud

Why is Good Speech Recognition so Hard to Find?

Why Is Good Speech Recognition So Hard to Find


As an organization that interacts with customers through speech applications, the quality of your speech recognition technology can make or break your CX. 

In an ideal world, communicating with technology via speech would be as easy and natural as conversing with a human. This would make it so simple to access information and services remotely. It would also offer more independence to those who have no other option but voice user interfaces, such as young children who aren’t literate yet and people living with visual, motor or mobility impairments.    

While some speech recognition technologies have made great strides in achieving these ideals, others are still falling far below expectations. This raises the question, why do some speech recognition technologies work well, while others fail? 

The reality is: human speech is complex and constantly changing. 

The challenges faced by modern speech recognition tools

An Automatic Speech Recognition (ASR) engine’s job is to take speech and identify it as something meaningful. Some ASRs have transcription capabilities, which allow them to turn that meaning into something useful, like text.  

Getting this right is actually an incredibly challenging process. Firstly, ASRs must keep pace with the fact that language is constantly changing. In 2021, for instance, Merriam-Webster added 520 new words and definitions to its American English dictionary. 

Secondly, they must navigate a huge amount of variation that occurs within each language itself. This includes a diversity of accents and dialects between speakers of the same mother tongue. This is a huge stumbling block for many speech applications. One study found that 66% of people cite accent or dialect recognition issues as a barrier to voice technology adoption

Also, ASRs must be able to separate speech from background and environmental noise. This could be the sound of traffic, a busy shopping mall, or even the interference that occurs due to the quality of the microphone used.  

Unfortunately, many ASRs are simply not capable of handling these variables efficiently. 

How to solve these problems

All this considered, companies need to choose their ASR engines carefully when building or modernizing speech-enabled customer experiences. 

There are many different types of ASR engines on the market. Ideally, you want one that:

  • Supports all dialects within a given language
  • Offers advanced artificial intelligence and machine learning capabilities for maximum accuracy
  • Is able to continually learn from real-world usage and expand the language model to serve a more diverse base of users

LumenVox ASR with Transcription: Next-generation speech recognition 

Status-quo speech recognition engines don’t have the machine learning capabilities to manage all the differentials in natural human speech—certainly not with the accuracy users expect. This is where LumenVox’s new ASR engine changes the game.

The technology that sets the LumenVox ASR engine apart is its end-to-end Deep Neural Network (DNN) architecture and state-of-the-art natural language processing and understanding capabilities. This creates an ASR engine that serves a much more diverse base of users. 

Whereas other ASR engines treat different dialects as separate languages, LumenVox’s new ASR Engine with Transcription supports multiple dialects with one language model. This considers many different pronunciations in a single language, as opposed to having to train according to each individual user. The end-to-end recognizer matches audio to the written word—regardless of accent or other factors that impact pronunciation. 

Additionally, no matter where the call or audio is coming from, the LumenVox Speech Recognizer separates speech from background noise using Voice Activity Detection (VAD). This takes a range of qualities into consideration, including energy level (volume), frequency (pitch) and changes in duration, to accurately detect the actual speech.

All this means that your speech solution can cater for a more diverse user base, in a broader range of scenarios, with market-leading accuracy.

Improve your speech application success rate with tuning

To get maximum value from your speech applications, LumenVox also offers an advanced turning tool that does all the heavy lifting for you, making it far easier for you to manage tuning in-house (and avoid expensive professional service fees). 

LumenVox’s Speech Tuner performs transcriptions, instant parameter and grammar-tuning, and version upgrade-testing of any speech application, in less time and with less effort. This way, you can continually enhance speech recognition accuracy and build competitive advantage. 

Looking ahead

While there is room for improvement in the speech recognition technology landscape, the demand for voice-enabled solutions continues to grow. A study by National Public Media found that 52% of voice-assistant users say they use voice tech several times a day or nearly every day, compared to 46% before the pandemic. 

If your company gets speech recognition right, you will be in a strong position to capitalize on this market growth. 

Learn how LumenVox can help you seize the future of speech recognition, request a demo today.

Related Resources

We compared Nuance to three major speech recognition alternatives – Google Cloud, Amazon Transcribe, and LumenVox – across several key categories, including accuracy, price, and support.

Ready to create an extraordinary voice experience for your customers?​