Rss Categories

Customizing CPA and Tone Detection

Reference Number: AA-02276 Views: 5291 0 Rating/ Voters

There are a few custom settings that can be applied to both the Call Progress Analysis (CPA) and Answering Machine tone Detection (AMD) algorithms in order to change their behavior. These are controlled using special "CPA" and "AMD" grammars, which signal that these features are needed in the current session or request.

We provide builtin grammars for both CPA and AMD. In the majority of cases, you can use these directly, or you can make your own copy of them where you can make your own custom changes. These grammars can be loaded and activated much like other ASR grammars, allowing platforms without any knowledge of either CPA or AMD to use these advanced features.


For most use cases, little or no customization should be needed for CPA and AMD to perform optimally. We have used advanced statistical analysis of many calls in order to determine the most appropriate settings for general use cases.

Enabling or Disabling Individual Tone Detection

Within tone detection, there are three different classes of tones that can be detected:

  • answering machine tones (AMD),
  • fax tones (FAX),
  • or Special Information Tones (SIT).

A tone detection grammar can enable or disable the detection of any of these three types of tones: 

<meta name="AMD_CUSTOM_ENABLE" content="true"/>
<meta name="FAX_CUSTOM_ENABLE" content="true"/>
<meta name="SIT_CUSTOM_ENABLE" content="true"/> 

Simply set your AMD_CUSTOM_ENABLE, FAX_CUSTOM_ENABLE, or SIT_CUSTOM_ENABLE entries in the grammar to "true" or "false" to turn that type of tone detection on or off. 

Changing CPA Timeouts

CPA has four possible results:

  • human residence,
  • human business,
  • unknown speech (likely machine),
  • or unknown silence (also likely a machine).

CPA returns one of these based on how long it detects human speech (for the first three) or if no speech (silence) is detected. Three timeout settings control this that are specified in the meta tags in the CPA grammar: 

<meta name="STREAM|CPA_HUMAN_RESIDENCE_TIME" content="1800"/>
<meta name="STREAM|CPA_HUMAN_BUSINESS_TIME" content="3000"/>
<meta name="STREAM|CPA_UNKNOWN_SILENCE_TIMEOUT" content="5000"/> 

These setting values are all represented in milliseconds (i.e. 1800ms or 1.8 seconds is the default value for CPA_HUMAN_RESIDENCE_TIME).  

If speech is detected that lasts for less than the value of CPA_HUMAN_RESIDENCE_TIME (default 1800 milliseconds), then human residence is returned. This is for short utterances typical for residential or mobile phones when answered by a human in a non-business setting (i.e. "Hello this is John")

If the length of speech detected  is greater than CPA_HUMAN_RESIDENCE_TIME, but less than CPA_HUMAN_BUSINESS_TIME, then human business is returned. This represents typically longer utterances for humans answering a business line (i.e. "This is LumenVox, how may I direct your call today")

If the speech is longer than CPA_HUMAN_BUSINESS_TIME, then unknown speech is returned. This may be a recorded message or some other automated playback, so is designated as UNKNOWN SPEECH. This is often handled as though it were a machine in call flows. There is no timeout for unknown speech since it is anything longer than human business. Note that once speech had been detected that exceeds the amount of CPA_HUMAN_BUSINESS_TIME, the result will be returned (it does NOT wait for all of the speech to end), allowing your call flow to decide how it would like to proceed.

UNKNOWN_SILENCE_TIMEOUT controls how long CPA will wait for the start of speech to be detected before returning unknown silence. Note that once start of human speech has been detected (prior to CPA_UNKNOWN_SILENCE_TIMEOUT), one of the other results will be returned. UNKNOWN SILENCE will only be returned if there is no detected human speech before this timeout is reached.

The default values for these settings is set in the LumenVox client_property.conf configuration file. Any values (such as those shown above) detected within the grammar will override any defaults specified in the configuration file.


Note that when performing CPA, it is important to make sure that BARGE_IN_TIMEOUT is not reached, since this would return a No-Input-Timeout event, which may be unexpected in the context of CPA. If there is no human speech detected in the call, you instead want the UNKNOWN_SILENCE_TIMEOUT to be reached first, thus returning the more understandable UNKNOWN SILENCE result, which can be handled more predictably within your call flow.

Modifying the End-Of-Speech Delay

The Voice Activity Detection End-Of-Speech setting (also known as VAD_EOS_DELAY) is used with CPA to determine how much silence should be allowed after start of human speech has been detected to trigger processing of the audio (and generating a response).  A larger value will essentially allow longer pauses between words, while shorter values may trigger end of speech too soon if someone is merely pausing between words.

This value defaults to 1200 ms, as defined in the client_property.conf settings file, and with CPA can only be set in the grammar file. 

To change the value add the following line to your CPA grammar, placing the desired setting in milliseconds in the content attribute (900 ms in the example).

<meta name="STREAM|VAD_EOS_DELAY" content="900"/> 

This value should only be modified with an understanding of the trade-offs, setting it too low will cause more false positive “HUMAN” responses.  Setting too high of a value,  will cause the CPA responses to take longer to return.

In versions 16.0.200 and earlier there exists a bug that requires you to have two meta tags in the CPA grammar in order to set the VAD_EOS_DELAY setting.

<meta name="STREAM|VAD_EOS_DELAY" content="900"/>
<meta name="STREAM|PARM_VAD_EOS_DELAY" content="900"/> 

Customizing the Return

As mentioned above, the default return from a tone detection or CPA grammar is one of the strings listed, however, a speech recognition (which CPA / tone detection mirrors) actually has two discrete components: 

  • The input sentence (also known as the "raw text"). This is the usually the actual word or phrase spoken by the user.
  • The semantic interpretation. This is usually a formatted result that contains the meaning of the input sentence. 

For example, imagine a user says "One two three" with a digits grammar. The ASR will perform a recognition, using the grammar, in order to produce the Raw Text (otherwise known as Input Text) "ONE TWO THREE." It will then parse the grammar's semantic tags, using the input "ONE TWO THREE" to get a Semantic Interpretation. For example, it may transform the phrase "ONE TWO THREE" into a string of digits: "123", known as the Semantic Interpretation. This is useful for a number of reasons within speech processing. Generally speaking, this Semantic interpretation is what the voice application will make use of.

CPA / tone detection goes through a similar process because LumenVox mimics ASR behavior in order to be more compatible with voice platforms that expect both Raw Text and Semantic Interpretation to be provided in results. There is no actual raw text in a tone detection, but LumenVox acts as though there were. The grammars allow you to customize both the Raw Text and the Semantic Interpretation that is returned for a CPA or tone detection. 

The exact mechanism makes a little more sense if you are familiar with the syntax of SRGS grammars, but let's revisit part of the CPA grammar: 


Meta tags whose names include "CUSTOM_INPUT_TEXT" control what the Raw Text (sometimes called Input Text) gets set to if that event is returned. For instance, if the CPA detects human business, then the input text will be set to the value of HUMAN_BUSINESS_CUSTOM_INPUT_TEXT (by default this is HUMAN BUSINESS). 

   <rule id="root" scope="public">
        <item>HUMAN RESIDENCE<tag>out="HUMAN RESIDENCE"</tag></item>
        <item>HUMAN BUSINESS<tag>out="HUMAN BUSINESS"</tag></item>
        <item>UNKNOWN SPEECH<tag>out="UNKNOWN SPEECH"</tag></item>
        <item>UNKNOWN SILENCE<tag>out="UNKNOWN SILENCE"</tag></item>

The semantic interpretation is then defined within the grammar, as in the section shown above. If the Input Text is HUMAN BUSINESS then the corresponding semantic tag gets executed, setting the Semantic Interpretation to HUMAN BUSINESS, both of which are returned in the result. Thus if you ever alter the custom input text, you must also alter the grammar rule or else you will not get appropriate results. 

For CPA, there is little reason to change the input text or the interpretation. For AMD it may make a little more sense. Let's revisit that grammar:

<meta name="AMD_CUSTOM_INPUT_TEXT" content="AMD"/>
<meta name="FAX_CUSTOM_INPUT_TEXT" content="FAX"/> 

<meta name="SIT_OTHER_CUSTOM_INPUT_TEXT" content="SIT OTHER"/> 

<rule id="root" scope="public">
<item>SIT REORDER LOCAL<tag>out="SIT"</tag>
<item>SIT VACANT CODE<tag>out="SIT"</tag>
<item>SIT NO CIRCUIT LOCAL<tag>out="SIT"</tag>
<item>SIT INTERCEPT<tag>out="SIT"</tag>
<item>SIT REORDER DISTANT<tag>out="SIT"</tag>
<item>SIT NO CIRCUIT DISTANT<tag>out="SIT"</tag>
<item>SIT OTHER<tag>out="SIT"</tag>

Here notice that all of the SIT (Special Information Tone) detection events will return the same interpretation of SIT. If an application wanted to treat the SIT intercept differently from the other SIT events, then the interpretation for that item would need to be edited, as shown in this example:

    <item>SIT INTERCEPT<tag>out="INTERCEPT"</tag>

The other tones can also be modified as needed too, should your application require the distinction.