Browse
 
Tools
Rss Categories

Customizing CPA and Tone Detection

Reference Number: AA-02276 Views: 73 0 Rating/ Voters

There are a few customizations you can make to both the CPA and tone detection grammars in order to change the behavior of the recognition. For most use cases, little to no customization should be needed. 


Enabling or Disabling Individual Tone Detection

Within tone detection, there are three different classes of tones that can be detected: answering machine tones, fax tones, or Special Information Tones (SIT). A tone detection grammar can enable or disable the detection of any of these three types of tones: 

<meta name="FAX_CUSTOM_ENABLE" content="true"/>
<meta name="SIT_CUSTOM_ENABLE" content="true"/> 

Simply set AMD_CUSTOM_ENABLE, FAX_CUSTOM_ENABLE, or SIT_CUSTOM_ENABLE to "true" or "false" to turn that type of tone detection on or off. 


Changing CPA Timeouts

CPA has four possible results: human residence, human business, unknown speech (likely machine), or unknown silence. CPA returns one of these based on how long it detects human speech (for the first three) or how long it detects silence. There are three timeout values that control this that are specified in the meta tags in the CPA grammar: 

<meta name="STREAM|CPA_HUMAN_RESIDENCE_TIME" content="1800"/>
<meta name="STREAM|CPA_HUMAN_BUSINESS_TIME" content="3000"/>
<meta name="STREAM|CPA_UNKNOWN_SILENCE_TIMEOUT" content="5000"/> 

The timeout values are all in milliseconds.  

If speech lasts less than the value of CPA_HUMAN_RESIDENCE_TIME (default 1800 milliseconds), then human residence is returned. If speech is more than that but less than CPA_HUMAN_BUSINESS_TIME, then human business is returned. If the speech is longer than CPA_HUMAN_BUSINESS_TIME, then unknown speech is returned. There is no timeout for unknown speech since it is anything longer than human business. 

UNKNOWN_SILENCE_TIMEOUT controls how long CPA will wait for voice before returning unknown silence

All of these may also be set in the LumenVox client_property.conf configuration file. Values in the grammar will override the defaults specified in the configuration file.


Modifying the End-Of-Speech Delay

The end-of-speech setting is used with CPA, to determine how much silence is needed after speech to trigger processing of the audio.  A larger value will essentially allow longer pauses between words.  This value defaults to 1200 ms, and with CPA can only be set in the grammar file.  To change the value add the following line to the cpa grammar, placing the desired setting in milliseconds in the content attribute (3000 ms in the example).

<meta name="STREAM|VAD_EOS_DELAY" content="3000"/> 

This value should only be modified with an understanding of the trade-offs, setting it too low will cause more false positive “HUMAN” responses.  Setting too high of a value,  will cause the CPA responses to take longer to return.


In versions 16.0.200 and earlier there exists a bug that requires you to have two meta tags in the cpa grammar in order to set the VAD_EOS_DELAY setting.

<meta name="STREAM|VAD_EOS_DELAY" content="3000"/>
<meta name="STREAM|PARM_VAD_EOS_DELAY" content="3000"/> 


Customizing the Return

As mentioned above, the default return from a tone detection or CPA grammar is one of the strings listed. However, a speech recognition (which CPA/tone detection mirrors) actually has two discrete components: 

  • The input sentence (also known as the "raw text"). This is the actual word or phrase spoken by the user.
  • The semantic interpretation. This is usually a formatted result that contains the meaning of the input sentence. 

For example, imagine a user says "One two three" with a digits grammar. The ASR will perform a recognition, using the grammar, in order to produce the raw text "ONE TWO THREE." It will then parse the grammar's semantic tags, using the input "ONE TWO THREE" to get a semantic interpretation. In this case, it will transform the phrase "ONE TWO THREE" into a string of digits: "123." This is useful for a number of reasons with speech recognition. Generally speaking, the interpretation is what the voice application will make use of.

CPA/tone detection goes through a similar process because LumenVox mimics ASR behavior in order to be more compatible with voice platforms that expect both raw text and interpretation. There is no actual raw text in a tone detection, but LumenVox acts as though there were. The grammars allow you to customize both the raw text and the semantic interpretation that is returned for a CPA or tone detection. 

The exact mechanism makes a little more sense if you are familiar with the syntax of SRGS grammars, but let's revisit part of the CPA grammar: 

    <meta name="HUMAN_RESIDENCE_CUSTOM_INPUT_TEXT" content="HUMAN RESIDENCE"/>
    <meta name="HUMAN_BUSINESS_CUSTOM_INPUT_TEXT"  content="HUMAN BUSINESS"/>
    <meta name="UNKNOWN_SPEECH_CUSTOM_INPUT_TEXT"  content="UNKNOWN SPEECH"/>
    <meta name="UNKNOWN_SILENCE_CUSTOM_INPUT_TEXT" content="UNKNOWN SILENCE"/>

The meta tags whose name include "CUSTOM_INPUT_TEXT" control what the input text gets set to if that event is returned. For instance, if the CPA detects human business, then the input text will be set to the value of HUMAN_BUSINESS_CUSTOM_INPUT_TEXT (by default this is HUMAN BUSINESS). 

   <rule id="root" scope="public">
       <one-of>
        <item>HUMAN RESIDENCE<tag>out="HUMAN RESIDENCE"</tag></item>
        <item>HUMAN BUSINESS<tag>out="HUMAN BUSINESS"</tag></item>
        <item>UNKNOWN SPEECH<tag>out="UNKNOWN SPEECH"</tag></item>
        <item>UNKNOWN SILENCE<tag>out="UNKNOWN SILENCE"</tag></item>
     </one-of>
    </rule> 

The semantic interpretation is then defined in the grammar in the above section. If the input text is HUMAN BUSINESS then the semantic tag gets executed, setting the semantic interpretation to HUMAN BUSINESS. Thus if you ever alter the custom input text, you must also alter the grammar rule or else you will not get appropriate results. 

For CPA, there is little reason to change the input text or the interpretation. For AMD it may make a little more sense. Let's revisit that grammar:

<meta name="AMD_CUSTOM_INPUT_TEXT" content="AMD"/>
<meta name="FAX_CUSTOM_INPUT_TEXT" content="FAX"/> 

<meta name="SIT_REORDER_LOCAL_CUSTOM_INPUT_TEXT" content="SIT REORDER LOCAL"/>
<meta name="SIT_VACANT_CODE_CUSTOM_INPUT_TEXT" content="SIT VACANT CODE"/>
<meta name="SIT_NO_CIRCUIT_LOCAL_CUSTOM_INPUT_TEXT" content="SIT NO CIRCUIT LOCAL"/>
<meta name="SIT_INTERCEPT_CUSTOM_INPUT_TEXT" content="SIT INTERCEPT"/>
<meta name="SIT_REORDER_DISTANT_CUSTOM_INPUT_TEXT" content="SIT REORDER DISTANT"/>
<meta name="SIT_NO_CIRCUIT_DISTANT_CUSTOM_INPUT_TEXT" content="SIT NO CIRCUIT DISTANT"/>
<meta name="SIT_OTHER_CUSTOM_INPUT_TEXT" content="SIT OTHER"/> 

<rule id="root" scope="public">
    <one-of>
<item>AMD<tag>out="AMD"</tag>
</item>
<item>FAX<tag>out="FAX"</tag>
</item>
<item>SIT REORDER LOCAL<tag>out="SIT"</tag>
</item>
<item>SIT VACANT CODE<tag>out="SIT"</tag>
</item>
<item>SIT NO CIRCUIT LOCAL<tag>out="SIT"</tag>
</item>
<item>SIT INTERCEPT<tag>out="SIT"</tag>
</item>
<item>SIT REORDER DISTANT<tag>out="SIT"</tag>
</item>
<item>SIT NO CIRCUIT DISTANT<tag>out="SIT"</tag>
</item> 
<item>SIT OTHER<tag>out="SIT"</tag>
</item>
    </one-of>
</rule> 

Here notice that all of the SIT detection events will return the same interpretation of SIT. If an application wanted to treat the SIT intercept differently from the other SIT events, then the interpretation for that item would need to be edited.