Video Asterisk Speech Recognition 101 - Part 1



  • In Asterisk Speech Recognition 101, you'll get an introduction to using speech recognition software on the Asterisk PBX. In this first part of the training class, we'll take a look at some of the basic concepts of speech recognition as used in IVRs and other telephony applications. This video describes how speech recognition works, covers some common misconceptions about speech software, and addresses the strengths and weaknesses of the technology.
  • RUNTIME 10:50


Video Transcription

Asterisk Speech Recognition 101 - Part 1

Welcome to Asterisk Speech Recognition 101. This is our video series at, where we will guide you through designing and developing your first Asterisk speech application.

So we're going to assume you know how to use Asterisk but almost nothing about speech recognition. So we will go ahead and talk about the basics of speech, how it all works on Asterisk and then we will go through step by step exactly how to build a speech enabled call router where you can call up and talk to people and ask for them by name and get routed to their proper extension.

So we're going to get started by covering the basics of speech recognition before we get into the Asterisk specific stuff. It's important to have a good grounding, because we're going to find that building speech applications — even if you're an expert IVR designer — can be a little tricky since it is not quite the same as the traditional DTMF or TouchTone systems, so it's good to cover some basic ground to start with. The first thing that's important whenever you start using a new technology like speech recognition probably is to consider why you want to use it.

So what are the strengths of speech? The first and probably the most useful universally is that we can add more options to our IVR menus, with DTMF you're always going to be limited to 9 options or so. You have the keys on the keypad and maybe the star, or the pound or the zero but you don't want to have 12 DTMF options at once, right?

It's going to be hard if I have to say 1 is this, 2 is this, 6 is this, star is this, that's hard so usually of course what would happen is you limit it and give people 3 or 4 options at a time and then you make them navigate. So I press 1, then I get a sub menu, then I press 3 and I get to another menu, then I press 2, then I get to another menu, and then I press 1 and finally I get to where I want to be. Well speech eliminates all of that nested menu, unlimited option type of stuff because I can have 25, 50, however many options available at my main menu.

The nice thing is it's easy to remember what I want because I just say what I want because it's natural and it's also nice because I can go right to somewhere deep within the application from the main menu. I don't have to navigate that infinite nested tree right?

So if we have a banking application normally I might have to press 1 to hear account balance, then 2 to pick checking, then 3 to pick wherever. With speech I can just say, Hear checking account balance right from the main menu and then I'm there. It makes it nice for your callers and it keeps the call short which is good for you and your callers.

There are also some sorts of applications you can build with speech, which you can't build with DTMF alone. We're going to be building a call router, which is a great example. You can do a call router with DTMF right, there is one in Asterisk built in but you got to dial a couple of letters of the person's last name and if it's a near match you got to pick from a list, its not always ideal if you don't know how a person's name is spelled. How's that work for you? It's a pain, right?

Well if I just say I want to talk to whoever, the application routes me there, that's something not easily done on DTMF. Or what if I have an application that's looking for a location and I want a store located in this place, well how can I do that on DTMF? I want to find one in San Diego, California. I'm supposed to enter in the ZIP code, well how many ZIP codes are in San Diego? Lots, right?

Well, if I'm driving around and calling from my cell phone I don't know what ZIP code I am in, I can just say San Diego, California into a speech application and there we go.

So that's nice as well, these new applications are really nice to build with speech recognition and another thing companies really like is that speech is more personal, it's engaging to talk to the system instead of just dialing keys. It feels a lot more natural and so if you can do some fun stuff with voice talent and build a little more of a personal application so that's one great thing about speech as well.

Now there are a lot of misconceptions that people have when they first go into speech recognition and the first one is that it is not artificial intelligence. Just because your computer can recognize the words you can speak doesn't mean it can have a conversation with you. This is speech recognition, it doesn't mean that it understands English, even though that would be kind of cool, but it's not just there yet. So you can't have a conversation with a computer so just expect you're not going to be able to carry on full natural type sentences with the speech recognition.

The other thing is a lot of people have used dictation software in the past, so they tend to think that all speech recognition works more or less the same way and that's not the case. The kind of speech recognition we use here at LumenVox is designed for telephony applications and IVRs is what we call speaker independent software. This means you don't have train it.

If you've used dictation software, you know you usually have to read a bunch of text first so it can learn how you speak. We don't have to do that with our software — anyone can call up and use it, which is great because you can't have your callers train the system every time they want to use it. So we can use and it works but the downside is it's speaker independent and we can't accept full natural speech the way dictation does so instead we're going to have to constrain what words can be recognized at a given time. I can't just recognize any of the hundreds of thousands of words in a given language so we're going to have to tell the computer what words we want to recognize and that's going to limit it.

Also the technology has improved significantly in the last couple of years. One thing people have used is a lot of early speech apps and they weren't good. Just flat out, they were poorly designed. It took us a while to figure out how to build a good speech application, how do you work within the limitations of speech.

So we understand that better now, we know how we can overcome some of these weakness that we're going to discuss in a second, the other thing is the pure core technology is much better now. Within the last five or ten years but especially in the last five, it's much more accurate, much more robust, so if you've worked with speech in the past and you just didn't like it give it a try again.

Like all technology it improves and this is still a technology that has really matured over the last couple of years. It's been around for a long time but for a long time it wasn't that good. Now computing power has caught up to the point where it is quite good.

There are still some weakness with speech and it's good to know them because when you start developing, you need to be aware of them so you can account for them, and we're going to talk all about how we can deal with that.

First and foremost, recognition is not one hundred percent certain when we recognize words. You know when you get DTMF menus and when you get a DTMF response from a caller you know for a fact that the caller pressed two. We're never going to know with one hundred percent certainty that the caller said "checking account balance."

What we're going to say is probably the caller said this and we're going to give you some information to let you know how certain we are when we hear it, when we recognize it, but that's just something you have to know going in. We can't guarantee that we heard the caller correctly, that we understood the caller correctly, so that is going to require you to develop the application differently because of that.

The other big weakness is that speech apps are just harder to build than DTMF applications. They take more time. There are some limitations, like I said you want to make sure you get the recognition right, which means we have to do more error handling, we have to spend more time thinking about the design before we actually begin programming.

So we're again going to talk about how you work with all this but just know once you start developing speech you get benefits and you pay a cost, you have to spend a little more time working on it. One thing in particular you will have to do is what we call tuning, once you've deployed an application that uses speech recognition you'll need to figure out how the callers are using it so you can make changes based on that.

When you present callers with these sorts of open-ended prompts that let them say whatever they feel like, you will find very quickly that they start saying all sorts of crazy things and you have to start trying to account for it, they're unpredictable. Because of that, it takes more time after deployment making changes.

OK, so speech recognition sort of in a nutshell works like this: you give us audio files and you give us what we call a grammar, which is a list of words and phrases that we can recognize. The grammar constrains us, we can only recognize the words specified in the active grammars. Now we take the audio and we take the grammar and we compare it to what we call our acoustic model and our acoustic model is this huge collection of data.

We get thousands and thousands of speakers and thousands of hours of audio samples of users saying all sorts of words in a language and we have the computer start learning how sounds are made in that language and what kind of meaning they have based upon all that data. We build that data and it's called the acoustic model which is like a giant database of how a language sounds.

So what's going to happen is we're going to take your audio and your grammar and search our acoustic models to try and match what's in your audio with what the words are in your grammar and we output what we call recognized text or raw text and basically it is just what the user said based on our best guest of our acoustic models and your grammars.

So remember that grammars are always going to constrain what we can recognize, if it's not in the grammar, it will never return as a possible response.

With all this said, there are a couple of things to keep in mind. First, don't mimic DTMF call flows. This is probably the biggest thing I can tell you don't design nested menus. You're using speech, that's one of your big strengths, right? You don't have to have these endless levels of nested menus so stay away from that.

Design with the focus that speech can help your customers, make calls faster, can help them do new kinds of things. That's why your spending this money on speech recognition, that's why you're putting in this development time, don't hamstring yourself by just building a DTMF application that can accept speech input.

You're also going to need to test and tune your application after you put it into deployment so once you start planning out your project development cycle keep this in mind, you will have to make changes after deploying it more so then with DTMF.

Finally, start small if you can. Don't try and take a gigantic application and add speech to it and hope to do it in a short period of time. Instead, what you should do is take a small app and speech enable it, or take a small portion of a big app and speech enable it. So you can learn how speech works before you try and tackle these big tasks.

© 2018 LumenVox, LLC. All rights reserved.