Video Transcription

Speech Application Development - Design Considerations

Hi, I'm Kyle, and in this section we're going to talk about natural language and directed dialog in voice recognition applications.

The idea about natural language is the idea of allowing the caller to speak in a more natural fashion. This is not exactly contrary to directed dialog but typically it carries some preconceptions about what you can do with the technology. Directed dialog is just the idea that you are going to ask a series of questions to the caller, going from one piece of information to the next, until you have enough to have your caller reach their goal.

With natural language, you'll find that the complexity depends greatly upon the scope of your natural language interaction. To give you two extremes, on one side you have the idea of "How may I help you?" On the other side, you have a currency interaction where you're going to go ahead and ask the person for a dollar amount.

Both are kind of an example of natural language. In one case the person is going to tell you how they need to be helped and in the other case the person is saying "four thousand seven hundred and forty five." Both of them are allowing the caller to respond in a natural fashion but one is much more complex then the other.

What ends up happening is the predictability of the question you're asking decreases the complexity of the question you're asking and in fact that's the key idea here. For natural language to work with traditional speech technologies, you have to predict everything that the caller may say — not what they might talk about, but what they're actually going to say.

This ends up making natural language a cost prohibitive process which effectively costs hundreds of thousands of dollars or multiple of hundreds of thousands of dollars and a long development time which ends up being a very fragile application in the end anyway.

Now let's talk a little bit about directed dialogs. Directed dialog is going to be your standard tool even if you were to utilize natural language for most of your transactions. You're still in a more traditional directed dialog state. This is the idea that each question will gather one piece of information and then move on and progress the call to the next one and with this again you will achieve your goals.

We will be doing this with prompts, grammars, and the prompts and grammars set up the transaction. So the prompt is going to have the caller, we're basically going to tell the caller what we're expecting them to say or provide to us. The callers are going to provide that to us and that will be represented in the grammar and then after that we have our application logic because speech recognition isn't artificial intelligence.

For every response the caller might give us we need to have programmed in the appropriate actions so that the caller feels that we're taking care of them in an appropriate fashion and that's in effect our supporting transactions.

Now there are some that we have to take care of that are similar to TouchTone or IVR applications which are we didn't understand what they said if the person pushed the wrong button on TouchTone application we would have to say that's an invalid response. With speech we do it with the same kind of idea but it's in a little more natural fashion since we're asking questions to the caller the caller may say something we weren't expecting or we didn't understand them and then we will come back with "I'm sorry we didn't quite get that, can you please repeat yourself?"

Under confirmations, this is the idea of we weren't so sure of what they said. We're pretty sure they said "main menu" but to be sure you might want to confirm it and that's what the ASR engine will tell you. As the application developer at that point you will then confirm that the caller actually wanted to go to the main menu and again the idea for every response the caller makes that we decide that this is what they said you have to have application logic to support it.

So let's go ahead and talk a little bit about dialog complexity and we will take our city/state example. Now I may have an application that says please tell me the city and state that you will be traveling to and the caller can respond with "San Diego, California."

Now, this is not too bad of an application dialog. It's pretty simple. I'm going to have one grammar with all my cities and states in it and then I'm going to have a prompt that is simply going to address the caller. Now I'm done, right? Well, what happens if the caller says just a state? And in fact I've had cases where the stakeholder says, "Well, I want you to be able to handle it if they just say a state, that should be valid response."

So I say, OK, well if we just allow them to say the state then that means I need to make fifty or I need to be able to generate fifty different grammars based on if they say the state California then I need a grammar with just the cities in California and then I need to make prompts, fifty different prompts that will say, "Which city do you want in California?" and they'll say the cities.

So now I've just added a layer of complexity to it, additionally then the stakeholder can come back and say, "Well if they just say the city, then that should work too." A lot of times there is just one city in a state and if there is more then one city represented such as Springfield then you'll just go ahead and ask them what state they want. Now at this point I've taken an application dialog that initially took me less then a day to create and I've turned it into a week long or a two week long project and in this way we can see how all of a sudden a dialog through allowing for some natural language for its complexity to increase greatly.

And this is why I try and caution people to take care of the 90 to 95% of the people that will actually respond in a correct fashion as opposed to devoting say two weeks worth of work to the 5% which may or may not say the state only and in fact this is another idea. How many people would actually only say a city or only say a state and you don't know that until the actual application is released you may find that only 3% of the people only say a city or only say a state. Now all of a sudden you've increased your development time by 4 or 500% to service 3% of the people who you would just have to re-prompt to get them to say the right thing.

Sot here are some guidelines in designing your application as well as to try and help you keep your focus on the amount of scope work that you want to do. The next presentation I'll be talking about prompts and the one after that will be grammars, thank you very much.


  • Even more so than in traditional applications, speech requires that you think carefully about the design of an application before you begin your development. This video is about the importance of understanding the scope and complexity of a speech application and how it relates to two ways of getting input from a user: natural language (where a user responds to an open question) and directed dialogue (where a user responds to a series of narrow prompts). The video will arm you with the tools to realistically evaluate how these decisions will affect the size of your project.


  • Video Playtime: 7:10



  • Contact Us
  • +1-858-707-7700
  • Toll Free: (877) 977-0707,
    say "Sales"

© 2016 LumenVox, LLC. All rights reserved.