Capture Digit Strings

How to Effectively Capture Digit Strings

(JAN 2008) — One of the most common types of input to capture using speech recognition is a string of digits: this can be a credit card number, a PIN, a Social Security number, a serial number, or any other sort of long sequence of digits.

Capturing digits using speech recognition presents some interesting problems. The LumenVox Speech Engine has high accuracy when capturing a single digit — our tests show that the Speech Engine will correctly recognize a single spoken digit with an average accuracy of more than 97 percent. But when capturing long strings of digits, the accuracy goes down.

This difficulty in capturing strings becomes clear if you think of recognizing each digit as an independent event. If each event has even a slight chance of failing, when you put several of these events in a series, there is a higher chance that just one of the events will fail. Most speech application developers consider it a failure to return a single incorrect digit in a string, and thus run into problems when trying to recognize long strings.

Fortunately, there are several ways to address this difficulty. With good grammar and application design, speeech developers can mitigate the problems and increase their success rates when capturing long strings of digits.

Use Confidence Scores Wisely

Most speech application developers are familiar with confidence scores, the numerical representation of how likely a return from the Speech Engine is correct. Normally when checking confidence scores, the score for the entire utterance is checked. For a digit string, this means that you get a single confidence score for the entire interaction.

A handy feature of SISR is the ability to check the confidence score for every single word in an utterance. This allows you to check the confidence score for every single digit in a string. Often it is the case that in a long string, all of the digits but one will have high confidence scores — maybe there was some loud background noise when one digit was spoken, for instance, causing recognition difficulty.

By checking the confidence score for every single digit, you can flag digits that have low confidence scores. This is particularly useful if you are doing a database lookup on the string. For instance, if you were to replace every low–scoring digit with a wildcard character, you could query the database for any strings that match the specified pattern. The application could then ask the user to confirm which of the results was correct.

What follows is an example digits grammar that uses the meta.score object in SISR to check the confidence score for each digit. If the confidence score is lower than 450, that digit will be replaced by the underscore wildcard character. If this replacement happens more than the number of times specified in the max_wild variable (by default this is 5), then the entire string will be nulled. A detailed explanation of how this grammar works is available in our Advanced SI Script training video.

root $Final;

$Final = {out = ''} $Digits {out = rules.latest()};

$Digit = (one {out = '1'} | two {out = '2'} | three {out = '3'} | 
four {out = '4'} | five {out = '5'} | six {out = '6'} | seven {out = '7'} | 
eight {out = '8'} | nine {out = '9'} | (zero | oh) {out = '0'});

$Digits =
    out = '';
    var curr_wild=0;
    var max_wild=5;
( $Digit
    { ! {
    if (meta.Digit.score > 450) {
        out += rules.Digit;
    } else {
        out += '_';
 } ! }
 ) <1->
 { ! {
    if (curr_wild > max_wild) out='';
 } ! };

A more detailed explanation of this sort of application, one that is able to process digit strings using partial matches, is available in our whitepaper titled Leveraging Information to Increase Call Completion in Speech Recognition Applications.

Constrain Your Grammars

An important aspect to increase accuracy is to provide grammar constraints, to disallow invalid or very unlikely utterances. The above grammar will accept a digit string of any length. If you want to collect strings of a fixed length, like 16–digit credit card numbers, you would benefit from constraining the grammar to only accept 16–digit numbers.

To do this, you would just change the <1–> operator, which specifies an indefinite number of matches, to be <16>, which restricts it to 16 matches of the $Digit rule.

Restricting grammars in this manner will improve accuracy, as it helps the Engine understand exactly how many digits to be listening for. The one potential downside is that if your users say too many or too few digits, the Engine will not be able to return this information to your application. Often, however, the increased accuracy for the majority of cases is worth the tradeoff.

Tune Your Application

You should also ensure that your grammars reflect how your users will speak. The above digits grammar will only capture single digits spoken in a string, e.g. "One four nine five." It will not properly handle a user saying "Fourteen ninety–five."

It is important that you tune your application, making changes to it based on how people actually use it. If you find that users are regularly speaking natural numbers instead of pure digits, you will need to change your grammars to reflect that.

You may also be able to use prompts to train users to only say pure digits. This is especially true if you have a specialized application that will be used by a dedicated set of regular users. In this case you can even try using our digits–only acoustic models. These models are only capable of recognizing pure digits, but they do so with greater accuracy than our generic acoustic models for a given language. Our Engine help document includes information on working with languages and acoustic models.

Avoid Ambiguity

Any input to a grammar should only have one valid parse (the Grammar Editor tool, included with the LumenVox Speech Tuner, can show many parses a grammar returns for a given input).

The more parses a grammar has, the longer it takes the Engine to decode an utterance against it. It also decreases accuracy. As the number of valid parses increases, decode time can increase dramatically.

In the grammar above, the grammar is capable of correctly handling parses such as "two one" or "twenty one." But if a caller says just "one," it allows for two valid parses, as the last part of the root rule allows two optional $OnesDigit rule matches. In this case, each parse has a different interpretation: the first $OnesDigit match multiplies the interpretation by 10, returning a result of 10, while the second one returns a result of 1.

This sort of ambiguity not only increases decode time while decreasing accuracy, it also makes it harder for your application to correctly handle results. You would probably not expect a caller saying "one" to return a result of "10," but that is precisely one thing this grammar allows for.

KeepRules Compact

The larger and more complex rules are, the longer it takes to compile a grammar or decode against it. One good trick to keeping rules short is to combine rules with common words, where possible. For instance, the following rule:

$name = James Anderson | Jim Anderson | Jimmy Anderson | James | Jim | Jimmy;

Can be combined into:

$name = (James | Jim | Jimmy) [Anderson];

While it is a relatively small savings for one rule, across large grammars this sort of compactness can add up, decreasing load and decode times.

Prune Unwanted Parses

You would obviously want to not allow the example to the left, where "one" has a valid parse that returns as "ten." But even allowing "one" to be a valid parse is quite possibly a bad idea if all you want to capture are two digit strings.

The grammar allows for other parses such as "twenty zero" (it returns with an interpretation of "20"), or "ten two" (returning with an interpretation of "12"). Even a null input is a valid parse ("" returns with an interpretation of "0").

Unwanted parses slow down decodes and reduce accuracy. It's pretty unlikely that a caller would ever say "twenty zero" or that a developer would want to allow for that sort of input. Accounting for these sorts of unlikely cases increases the probability that a caller behaving appropriately will be misrecognized — e.g. somebody who says "twenty two" might get mistaken for the unreasonable "twenty zero."

Be Careful with Recursion/span>

SRGS allows for recursive rules, that is rules with references to themselves. Any time you work with recursion, you must be careful to avoid infinite loops. Since the LumenVox grammar parser parses from left to right, you should always avoid doing left–hand recursion.

For instance, the following rule will match the word "foo" any number of times:

$rule = foo ($rule | $NULL);

If the input is "foo foo," the Engine parses the rule, expanding the reference $rule each time until it matches $NULL and terminates. On the other hand, if your rule is written:

$rule = ($rule | $NULL) foo;

The parser will get caught in an infinite loop. The first thing it will attempt to do is to expand the $rule reference, only to expand it again, and again, ad infinitum.

© 2018 LumenVox, LLC. All rights reserved.