Tools

Customizing CPA and Tone Detection

Reference Number: AA-02276 Views: 11941

0 Rating/ Voters

There are a few custom settings that can be applied to both the Call Progress Analysis (CPA) and Answering Machine tone Detection (AMD) algorithms in order to change their behavior. These are controlled using special "CPA" and "AMD" grammars, which signal that these features are needed in the current session or request.

Prior to version 19.1, we provided builtin grammars for both CPA and AMD, however moving forward, we now recommend using variants of the updated grammars we provide and describe in our Grammars in CPA and AMD article. In the majority of cases, you can use these sample grammars directly, or you can make your own copy of them and make your own custom changes. These grammars can be loaded and activated much like other ASR grammars, allowing platforms without any knowledge of either CPA or AMD to use these advanced features seamlessly.

CPA / AMD Builtin Grammars Deprecated

As mentioned above, we have deprecated support for builtin CPA and AMD grammars, in favor of updated grammars which add more functionality and are clearer and easier to use. The special builtin grammars for CPA and AMD will continue to work for backward compatibility only. All users are encouraged to use the new grammars described in the Grammars in CPA and AMD article.

Enabling or Disabling Individual Tone Detection

Within tone detection, there are three different classes of tones that can be detected:

answering machine tones (AMD)
fax tones (FAX)
Special Information Tones (SIT)
or busy tones (BUSY)

A tone detection grammar can enable or disable the detection of any of these three types of tones:

Simply set your AMD_CUSTOM_ENABLE, FAX_CUSTOM_ENABLE, SIT_CUSTOM_ENABLE or BUSY_CUSTOM_ENABLE entries in the grammar to "true" or "false" to turn that type of tone detection on or off.

BUSY Tone processing added

Note that the busy tone detection algorithm was added in version 19.1, so is only available in that release or newer. Please upgrade your product to include this if needed. By default BUSY_CUSTOM_ENABLE is set to "false" (disabled) to allow backward compatibility with existing applications.

Changing CPA Timeouts

CPA has four possible results:

human residence,
human business,
unknown speech (likely machine),
or unknown silence (also likely a machine).

CPA returns one of these based on how long it detects human speech (for the first three) or if no speech (silence) is detected. Three timeout settings control this that are specified in the meta tags in the CPA grammar:

These setting values are all represented in milliseconds (i.e. 1800ms or 1.8 seconds is the default value for CPA_HUMAN_RESIDENCE_TIME).

If speech is detected that lasts for less than the value of CPA_HUMAN_RESIDENCE_TIME (default 1800 milliseconds), then the HUMAN RESIDENCE prediction result is returned. This is for short utterances typical of residential or mobile phones when answered by a human in a non-business setting (i.e. "Hello this is John")

If the length of speech detected is greater than CPA_HUMAN_RESIDENCE_TIME, but less than CPA_HUMAN_BUSINESS_TIME, then the HUMAN BUSINESS prediction result is returned. This represents typically longer utterances for humans answering a business line (i.e. "This is LumenVox, how may I direct your call today")

If the speech is longer than CPA_HUMAN_BUSINESS_TIME, then the UNKNOWN SPEECH prediction result is returned. This may be a recorded message or some other automated playback, so is designated as UNKNOWN SPEECH. This is often handled as though it were a machine in call flows. There is no timeout for unknown speech since it is anything longer than human business. Note that once speech had been detected that exceeds the amount of CPA_HUMAN_BUSINESS_TIME, the UNKNOWN SPEECH result will be returned (it does NOT wait for all of the speech to end), allowing your call flow to decide how it would like to proceed.

Note that if CPA_HUMAN_RESIDENCE_TIME and CPA_HUMAN_BUSINESS_TIME are set to the same value, then HUMAN BUSINESS will not be returned. Duration of speech will either be less than these two settings, in which case HUMAN RESIDENCE will be returned, or above these settings, in which case, UNKNOWN SPEECH will be returned. This configuration may be useful for applications that do not wish to distinguish between Residence and Business classifications, and instead only wish to process human versus machine responses. Note that similar behavior can be achieved by setting the appropriate Input Text and Semantic Interpretation values for both of these classifications in the grammar, providing developers maximum flexibility.

UNKNOWN_SILENCE_TIMEOUT controls how long CPA will wait for the start of speech to be detected before returning UNKNOWN SILENCE. Note that once start of human speech has been detected (prior to CPA_UNKNOWN_SILENCE_TIMEOUT), one of the other results will be returned. UNKNOWN SILENCE will only be returned if there is no detected human speech before this timeout is reached.

UNKNOWN_SILENCE_TIMEOUT vs BARGE_IN_TIMEOUT or END_OF_SPEECH_TIMEOUT

Note that when performing CPA, it is important to make sure that BARGE_IN_TIMEOUT is not reached, since this would return a No-Input-Timeout event, which may be unexpected in the context of CPA. If there is no human speech detected in the call, you instead want the UNKNOWN_SILENCE_TIMEOUT to be reached first, thus returning the more understandable UNKNOWN SILENCE result, which can be handled more predictably within your call flow.

In version 19.1 and later, an internal change was introduced to force the BARGE_IN_TIMEOUT setting to be greater than the longest CPA time setting in order to ensure this would not be reached, therefore you only need to make this change in earlier versions, if you are unable to upgrade

Also worth noting is the END_OF_SPEECH_TIMEOUT setting (maximum duration of speech after human speech is first detected), which should not be confused with the VAD_EOS_DELAY setting (amount of silence following speech) described below, should also never be reached when working with CPA interactions. Another change introduced in version 19.1 was to force the value of this setting to be 10,000 ms (10 seconds) greater than the largest CPA time specified to ensure it is never reached. Only users working with earlier versions need to apply this setting. All CPA users are encouraged to upgrade to the latest release in order to benefit from some significant performance improvements.

The default values for these CPA timing settings are applied in the client_property.confconfiguration file. Any values (such as those shown above) detectedwithin the grammar will override any defaults specified in theconfiguration file.

CPA_MAX_TIME_FROM_CONNECT

One of the most significant changes to CPA was introduced in version 19.1 is the CPA_MAX_TIME_FROM_CONNECT setting. Be sure to read a more complete description of this setting and how it can be used in our dedicated CPA_MAX_TIME_FROM_CONNECT article.

Basically, this setting overrides and automatically scales the above CPA timing values based on the maximum amount of time your application is willing to wait for a response. This can be used in most applications and has the benefit of being a single setting to apply, which is therefore much less complex.

Modifying the End-Of-Speech Delay

The Voice Activity Detection End-Of-Speech setting (also known as VAD_EOS_DELAY) is used with CPA to determine how much silence should be allowed after start of human speech has been detected to trigger processing of the audio (and generating a response). A larger value will essentially allow longer pauses between words, while shorter values may trigger end of speech too soon if someone is merely pausing between words.

This value defaults to 1200 ms and should be configured using the CPA grammar file.

To change the value add the following line to your CPA grammar, placing the desired setting in milliseconds in the content attribute (900 ms in the example).

This value should only be modified with an understanding of the trade-offs, setting it too low will cause more false positive “HUMAN” responses (possibly misclassifying the prediction). Setting too high of a value will cause the CPA responses to take longer to return.

By way of a simple guide for this, we have found the default value of 1200ms works best in most situations, however implementations that are more sensitive to machine-to-human misclassifications, as may be the case in many call centers, setting a slightly higher value of perhaps 1500ms would improve this accuracy, at the cost of a small amount of additional delay between when the person answering the call stops speaking, and when the system receives a CPA response. In the example here, the difference between 1200ms and 1500ms would mean an additional 300ms of delay (or observed silence) to the person answering the call. Application developers should consider what is an acceptable amount of delay that they wish humans to encounter and understand this trade-off against CPA accuracy.

In versions 16.0.200 and earlier there existed a bug that requires you to have two meta tags in the CPA grammar in order to set the VAD_EOS_DELAY setting, as shown here. Newer versions do not require the second entry

It is important to note that when using the CPA_MAX_TIME_FROM_CONNECT option (described above), the VAD_EOS_DELAY setting is also automatically set, so specifying it within the grammar when using that setting will have no effect.

Customizing the Return

As mentioned above, the default return from a tone detection or CPA grammar is one of the strings listed, however, a speech recognition (which CPA / tone detection mirrors) actually has two distinct components in the result:

The input sentence (also known as the "Input Text" or "Raw Text"). This is the usually the actual word or phrase spoken by the user.
The Semantic Interpretation (or "SI"). This is usually a formatted result that contains the meaning or intent of the input sentence.

For example, imagine a user says "One two three" with a digits grammar. The ASR will perform a recognition, using the grammar, in order to produce the Raw Text (otherwise known as Input Text) "ONE TWO THREE." It will then parse the grammar's semantic tags, using the input "ONE TWO THREE" to get a Semantic Interpretation. It may also transform the phrase "ONE TWO THREE" into a string of digits: "123", known as the Semantic Interpretation. This is useful for a number of reasons within speech processing. Generally speaking, this Semantic interpretation is what the voice application will make use of ("ONE TWO THREE" likely requires further processing, whereas "123" might be useful as-is within an application).

CPA / tone detection goes through a similar process because LumenVox mimics ASR behavior in order to be more compatible with voice platforms that expect both Raw Text and Semantic Interpretation to be provided in results. There is no actual raw text in a tone detection, but LumenVox acts as though there were. The grammars allow you to customize both the Raw Text and the Semantic Interpretation that is returned for a CPA or tone detection response.

The exact mechanism makes a little more sense if you are familiar with the syntax of SRGS grammars, but let's revisit part of the CPA grammar:

Meta tags whose names include "CUSTOM_INPUT_TEXT" control what the Raw Text (sometimes called Input Text) gets set to if that event is returned. For instance, if the CPA detects human business, then the input text will be set to the value associate with HUMAN_BUSINESS_CUSTOM_INPUT_TEXT (by default this is HUMAN BUSINESS).

<rule id="root" scope="public">
<one-of>
<item>HUMAN RESIDENCE<tag>out="HUMAN RESIDENCE"</tag></item>
<item>HUMAN BUSINESS<tag>out="HUMAN BUSINESS"</tag></item>
<item>UNKNOWN SPEECH<tag>out="UNKNOWN SPEECH"</tag></item>
<item>UNKNOWN SILENCE<tag>out="UNKNOWN SILENCE"</tag></item>
</one-of>
</rule>

The semantic interpretation is then defined within the grammar, as in the section shown above. If the Input Text is HUMAN BUSINESS then the corresponding semantic tag gets executed, setting the Semantic Interpretation to HUMAN BUSINESS, both of which are returned in the result. Thus if you ever alter the custom input text, you must also alter the grammar rule or else you will not get appropriate results.

For CPA, there is little reason to change the input text or the interpretation. For AMD it may make a little more sense. Let's revisit that grammar:

Here notice that all of the SIT (Special Information Tone) detection events will return the same interpretation of SIT. If an application wanted to treat the SIT intercept differently from the other SIT events, then the interpretation for that item would need to be edited, as shown in this example:

<item>SIT INTERCEPT<tag>out="INTERCEPT"</tag>
</item>

The other tones can also be modified as needed too, should your application require the distinction.

For more details relating to working with grammars when using CPA and Tone Detection, see our Grammars in CPA and AMD article.