Video Asterisk Speech Recognition 101, part 3

RESOURCES

VIDEO DESCRIPTION

  • In the conclusion to our Asterisk Speech Recognition 101 training class, we go over building SRGS grammars, the heart of any Asterisk speech application. This video will cover writing a simple grammar to recognize the words "yes" or "no." It also describes using SISR, the standard method for adding semantic interpretation and simple logic to SRGS grammars.
  • RUNTIME 14:30

CHAPTERS IN THIS SECTION

Video Transcription

Asterisk Speech Recognition 101, part 3

Welcome to Part 3 of Asterisk Speech Recognition 101. In our last video, we talked about the Asterisk speech recognition interface. In order to use that interface, we first have to design grammars.

Grammar Basics

Grammars are files that contain a list of words and phrases to be recognized. They provide rules and constraints for the speech engine. Smaller grammars tend to provide better accuracy and less decode time, because they create a smaller "search space," which makes it more likely to find a correct answer. Provide just enough coverage to cover the majority of your callers. You don't want to include obscure stuff that people have a small chance of saying. A classic example is swearing. If you put in curse words into your grammar, people might get misrecognized as saying a swear word when they're really not, and they might get offended if the system accuses them of this.

Writing Grammars

The grammar spec (called SRGS) describes two formats:

  • Augmented Backus-Naur Form: Tends to be easier to read and write for humans, shorter, not as structured
  • GrXML: More friendly to machines

Grammars consist of a list of rules. Each rule contains tokens which will match those rules. A token is essentially anything the engine can match and turn into a phonetic equivalent. It can be a word, a series of words, or raw sounds (phonemes).

A special rule called the root rule exists. It must be matched for the grammar to be matched.

Example ABNF Grammar

#ABNF 1.0 UTF-8;
language en-US;
mode voice;
root $yesno;
$yesno = $yes | $no;
$yes = yes [please] | yeah;
$no = no [thanks] | nope;

Semantic Interpretation

By default, a speech engine will return raw text or a parse tree from a grammar. This is great for simple grammars, but it's not going to tell us very much. For instance, there are a lot of different ways to say "yes." If I say "yup," the application will return "yup". If I say "yes please," the application will return "yes please", when really these both mean the same thing. What I would prefer is what the user meant, not what they actually said. You have to derive the meaning from the user's input. This is called semantic interpretation. One way you can do this is to put it in your application. But since semantic interpretation must be done somewhere for most applications, it makes sense to keep it in grammars (encapsulation).

SISR

Semantic Interpretation for Speech Recognition is the standard method of putting semantic interpretation into grammars. SI tags are placed into grammar rules. These tags contain bits of ECMAScript (JavaScript) that are executed when the rule is matched, turning each rule into a function.

Example ABNF Grammar

#ABNF 1.0 UTF-8;
language en-US;
mode voice;
tag-format ;
root $yesno;
$yesno = $yes | $no; {out=rules.latest()};
$yes = yes [please] | yeah{out="yes"};
$no = no [thanks] | nope{out="no"};

In the next part, we'll talk about writing speech applications on Asterisk, as well as dial plan functions and their uses.

© 2016 LumenVox, LLC. All rights reserved.