Speech Recognition Don'ts
Until now we've been focusing on the things that should be done, best practices when building
speech recognition applications. Now we're going to move to the dark side and focus on speech
recognition don'ts. The worst practices, if you will, the things you absolutely, positively should
avoid when you can. We'll look at things you should not do when creating:
- Call Flow
- Application Design
Grammar is one of the most challenging aspects of a speech application, it requires some
programming skills, some human psychology, some statistical evaluation and if you're new to speech
it's easy to get hung up with grammar design.
- Make grammars too large: Probably the biggest mistake that new developers
make is that they make grammars too large. Try to keep from making your grammar gigantic.
You'll want to try and make it as big as necessary but no bigger. Provide just enough coverage
for the majority of your callers so that callers who are using the system in a reasonable
manner get recognized verbally and who are abusing the system do not. As an example, there was
a company who wanted to include swear words or curse words within the grammar, so if caller
told the application to "go to hell," it was recognized. We're not entirely sure what
the application was to do with such a request, but there are people out there who call and
curse at the machine for whatever reason. You don't want to try and understand why this goes on
or try and deal with it with your application. Because what happens if you have a caller who's
using the application correctly, however the machine thinks the caller said to go to hell?
Remember, as you increase the size of your grammar the chance of misrecognition also increases.
The point is, you want to leave the things that most people do not say out of the grammar. Make
it just big enough so that it covers everything that a reasonable caller says and leave the
other junk out.
- Make a grammar too complicated: You do not want to make grammars too
complicated, SRGS (the language used for writing grammars) is very powerful, as is SISR (the
language for putting in semantic interpretation), which actually turns grammars into small
applications. You may be tempted to put all of your application logic into your grammar simply
if you really wanted to. However, application logic probably should go inside your application,
and not inside the grammar. You're better off keeping the logic within the grammar related to
grammar matters, such as interpretation. You don't want to perform other types of operations
because, it will slows things down when the grammar is handling functions which are better
suited for the application. This is also more complicated to maintain. So keep the grammars as
simple as you can, only complicate them if you have a very compelling reason to do so.
- Write grammars with infinite loops: This may sound like obvious advice,
but it happens more often then you may think. You can write recursive grammars with rules that
reference themselves, and at times there are reasons to do this. But you don't want these rules
to occur infinitely. If it takes an infinite amount of time for the parser to get through the
grammar, and a caller says something to cause an infinite loop, it will take an infinite number
of seconds for the speech engine to return anything and that is a very long time for a caller
to wait. So be very careful when using recursion within a grammar, and make sure they are tested
in your grammar editor before hand so you know when you have these types of problems.
Prompts are closely related to grammars, they determine the types of responses you will receive
from the user. There are some bad prompts that are seen frequently, the most common of which is a
holdover from the days of DTMF.
- Use "Press or Say" prompts. Let's say you have a DTMF application
and you add some speech recognition to it and you decide, because the callers are used to it,
why not just have the prompts state "Press or say 1 for?" to mimic DTMF. This is
actually a terrible prompt for speech. It's commonly used so a lot of people are comfortable
with it, but it's simply not good. Saying "one" or "two" instead of pressing
it is little to no improvement from a caller's perspective. Callers would rather actually say
"Please give me my balance," rather then trying to remember that "one" is the
equivalent to a balance inquiry. So try to avoid "press or say" prompts, they tend to
become long and unwieldy, and they don't really help anyone. It doesn't make much sense to spend
money on speech just to mimic a DTMF application.
- Write really long prompts. One of the things the "press or say"
prompt does is to get very long quickly. You don't need to spell out everything to the caller,
perhaps a prompt that simply says, "Please tell me what you like to do, for instance for
your checking account balance, say 'Checking account balance.'" You don't necessarily have
to tell the caller that they have to say saving account balance, to hear the balance for saving.
You may just tell the call, "Say checking account balance to hear your checking account
balance, or you can hear your saving account balance, or you can make a withdrawal." In this
prompt, notice you're not continuously saying "say this for this?" The prompt told the caller how
the system works and moved on. Don't make it so long because people don't like to listen to very
long prompts. The callers are smart enough to figure out how the IVR works if they're provided
with one or two examples, and it probably works better. This keeps the calls from getting long,
and callers from getting frustrated and hanging up. Which drives the success rate down and no
one's happy. Make prompts short, sweet and to the point.
- Force your users to sit through the entirety of every prompt, every time.
We have barge-in for a reason, we have good start and end of speech detection in our speech
engine software. Don't force you callers to listen to a prompt. This is not generally useful, or
helpful. There may be occasions where you may have important prompt that should not be
interrupted, however most of your prompts should allow barge-in. Also, while you're at it, leave
short pauses in the prompts, which lets callers know it's all right to speak and interrupt. For
example, "You can access your checking account balance [pause], or you can hear your savings
account balance [pause]." This allows the caller to feel free as some callers are hesitant to
interrupt when someone is speaking.
Call Flow and Application Design
Once prompts and grammars are taken care of, take a look at the overall call flow and design.
- Mimic DTMF call flows. Avoid the nested menu hell generally associated with
DTMF. With speech you can have many flat menus with numerous options that are natural and obvious,
and callers don't have to try and remember what each number represents. Don't make callers sit
through giant menus, you're not using DTMF, and don't just shoehorn speech into a DTMF application.
Instead, build nice, natural menus. Rethink your whole call flow and really take advantage of the
powers of speech recognition.
- Forget about error handling. There may not be as many errors with DTMF as
compared to speech, because it's easier to tell what a caller keyed in. However, in speech
applications you will need to put in confirmations, such as, "Did you want this, or this?"
Then ask the question and the caller will respond with yes or no. Don't overdo it, don't make the
caller always confirm their responses. It becomes very frustrating to the caller to have to
confirm everything. Check confidence scores to help with confirmation. If the confidence score is
high there is no need to confirm the caller's response. Use confirmations where appropriate and
your users won't mind it.
- Pretend the computer is a person. A major problem is that some people don't
realize that speech recognition is not the same as having a conversation. Designers will sometimes
try to build applications to mimic conversations. Don't do this, especially do not try and trick
your caller into believing they speaking to a live person. You don't necessarily have to specify
that they are speaking to a machine, but don't try to get the application to appear as a human.
These application don't work well, if a caller is fooled their responses become unpredictable,
and contribute to more errors. Then, once the caller realizes they've been tricked they most
likely will become angry. It's OK to build an application with personality, but go easy on it. If
a caller knows that they are talking to a computer, they will use short commands and phrases.
Your application will recognize more of what is being said, and your call success will increase.