Localizing Speech Applications - Part 1 Video



  • Moving a speech recognition application from one location to another presents an interesting set of challenges. More so than other applications, speech applications are very sensitive to shifts in locations, as speech patterns and languages change with geography. In the first part of our series on localizing speech applications, we talk about some of the basic ideas to consider when localizing speech applications, how speech recognition technology deals with languages, and some considerations for recording new prompts when localizing applications.
  • RUNTIME 11:39


Video Transcription

Localizing Speech Applications - Part 1

Hi, I'm Stephen Keller. We're going to talk about localizing speech applications. This is basically the process where you take an existing speech application and move it from one locale to another where the language has changed. Sometimes these language changes will be dramatic, like moving from a place where people speak English to a place where people predominantly speak Spanish. The move might also be more subtle, like moving within the country to a place where regional dialects are going to change the way people interact with the system. We're going to cover all that, and go through some of the pitfalls that people have when trying to convert an application from one location to another, and talk about some tips and other things to keep in mind.

Localizing Voice Applications

When taking a speech recognition application, you have to understand that it's going to be a little more complex than a DTMF-only application. With a DTMF-only app, the only thing you're going to be doing is translating prompts and re-recording them, because input doesn't really change that much. "Press 1" is still going to be called "Press 1", even if you now call 1 "Uno". It's still basically the same method of interacting with the system. With a DTMF application, your call flows and menu options remain relatively unchanged. A speech application is more complex, because the input itself is different, and the way that I interact with it is now significantly different. So it's important to realize that when localizing a speech application, you really want to understand how people in this new location are going to interact with the system. They might speak differently, they might use different words to mean the same thing, and they might use different sentence structure. All these things boil down to understanding, "How are my users going to interact with the system?"

Challenges with Speech

The VUI (voice user interface) has its own set of problems:

  • VUIs or speech apps are more sensitive to changes in prompts than DTMF. There are pitfalls, like when you ask somebody to confirm something. Let's say a user calls into a call router and asks to talk to "Bob". We say, "Did you say you wanted to talk to Robert Johnson?" Well, it may be that Bob also goes by Robert Johnson, and the user says "No" because they're being very literal-minded there. If we ask "Did you mean Robert Johnson," then they will say yes. These little shifts in language can be very subtle, but they can really change how people interact with the system, so it's really important to keep this in mind when you write your new prompts. We'll talk about this when we talk about tuning a little later on.
  • You'll have to rewrite your grammars (lists of words or phrases to be recognized). Obviously, you'll have to change this because the way the user will respond will be different.
  • You may have to change the acoustic models (languages supported by a speech recognition engine). One thing you might have to do is tell us to use a different acoustic model, which is to say that we have this big data set that represents how American English speakers sound, but we have a different data set that represents how Australian English speakers sound. So you have to tell us that even though the language is still English, you want us to code for the Australian English model.

How Speech Recognition Adapts to Different Languages

I mentioned we have these things called acoustic models, and basically, the way speech recognition works is that we're going to take audio from your caller, and we're going to take this grammar that you gave us (a list of words and phrases that you want to have recognized at this prompt), as well as some meaning or logic behind how you want to use them. What we're going to do with those two things is pull up an acoustic model, a big set of data that describes the sounds in the language. It tells us "these waveforms correspond to these letters in the language". That's how we know the way to take your written grammar and turn that into speech to translate. So we use that along with your audio and use that to decode what's being said.

With acoustic models, we have these big dictionaries where we say, here's 100,000 English words, and this how they're pronounced, because obviously spelling in a language doesn't always reflect how they're pronounced. This is especially true in English. Sometimes you'll give us a word that isn't in our dictionary. A good example would be a name that's from another language. Maybe you have a Vietnamese name that is pronounced a way we don't know about. We won't be able to find it in the dictionary, so we'll have to rely on some basic spelling rules to figure out how it will be spelled. You can actually add in custom pronunciations to tell us exactly how this word should be pronounced.


Like I said before, prompts will often have very subtle but important influences on how users will react and respond to the system. A good speech application prompt, at its most basic, is designed to elicit a predictable set of responses from callers. We say that we're going to give out these prompts, and we should have a good idea of what you're going to say back to us. We don't want to give a vague prompt that could elicit a million different responses. You've probably heard us talk about directed dialogue, directing users to give us a good set of responses. That's a good speech prompt. You'll want to create specific prompts to get specific responses.

One thing you'll want to be careful about in prompts is accents. A lot of countries have a sort of standard accent. By standard, I don't necessarily mean the accent that everybody speaks with, but what is sort of accepted. For instance, in the US, if you listen to the news you'll notice that people tend to speak with a kind of West Coast, maybe Midwest kind of accent. To most American speakers, this is a neutral-sounding accent. I happen to live in Southern California, so my accent matches that of most newscasters. For a lot of countries, the standard or default accent is what newscasters use. That is a good standard of what to use in prompts. Now if I have an application that is in California and I move it to Texas, I don't necessarily want to start giving all my prompts a Texan accent. Users can find this patronizing or weird, because what we expect "official" type applications to have are these standard, newscaster-like accents. I recommend using the most neutral-sounding accent to a speaker in that language.

Be careful when you record the prompts, because subtle things in the manner of speech can affect how people will respond. A good example is the tone or inflection. I don't want to sound too threatening or demanding, because users can get nervous and become flustered, or give incorrect responses, or get frustrated. You want your application to be calm and reassuring, and a little friendly. You want your customers to have a good experience, so make sure it's not just a monotone, computer-like speech.

You also want to be careful about the rate of speech. If you speak too quickly people can become nervous and flustered. But what can be tricky is that certain users will find different rates of speech acceptable. Generally, users in cities or more urban areas speak more quickly than users in more rural settings. This is true not only in America, but generally throughout the world. So if you're going to move an application from, say, New York City to someplace rural, you might want to consider slowing down the rate of speech in your prompts because your users will find that much more comfortable.

In the next part, we're going to talk more about some of these ideas, and about tuning an application, and the differences between dialects within a language.

© 2016 LumenVox, LLC. All rights reserved.