Video The Speech Ecosystem



  • This video explains a little bit about the basics of computer-based telephone systems, and how speech applications fit into those systems. We'll give you an overview of how incoming calls (either from the traditional telephone system or via voice over IP) get handled by software telephony platforms and how speech recognition and text-to-speech engines are integrated with the applications those platforms run.
  • RUNTIME 5:45


New to Speech?

The Speech Ecosystem

This presentation will go over the following topics:

  • Computer Telephony Basics
  • Platforms
  • Adding Speech

Computer Telephony Basics

Incoming calls generally come through in one of two ways:

  • POTS (Plain Old Telephone Service): This could include digital or analog and requires hardware to convert the telephone signal into data that the computer can understand.
  • VOIP traffic (Voice Over Internet Protocol): Utilizes the Internet and requires some additional resources such as hardware, or general CPU to decode.
  • Inbound call handling is called call control. This includes actions such as picking up the call, transferring, or hanging up etc.

Telephony Platforms

A platform is a software system that does two things:

  • Call Control, as described above
  • Runs an application. So if you have an IVR application like a call router or banking application running, this would be run by the platform. The platform also handles PBX functionality like conference bridging and voicemail programs.

The following graphic shows how this all works together. On the left of the illustration you'll see SIP (or this could also be VOIP), on the right at the bottom notice the POTS (or plain old telephone system).

Above that is the hardware layer and above that is the platform. Within the platform is the call control and then the applications, they will communicate with each other if necessary.

Writing Telephony Applications

In the past, applications and call control were written directly to the platform's API with a proprietary method.

Modern day platforms generally use an open standard:

  • For Call Control the open standard is Call Control XML (or CCXML)
  • For Applications, the standard is Voice XML (or VXML)

The illustration for this is almost the same as before, except that instead of call control, you'll see a CCXML Browser. Also, instead of an application you'll now see a VXML browser. Otherwise the graphic is unchanged.

Adding Speech

Most speech applications makes use of either ASR (Automatic Speech Recognition) also referred to as a speech engine, or TTS (text to speech), or both. These are controlled at the application level. The application can be written directly to the API of the ASR or TTS engine, or use MRCP (media resource control protocol).

VXML in Speech

Since speech resources are controlled by the platform using API or MRCP, they don't care about the underlying application code. This means that speech applications will work relatively the same, whether written in VXML or written directly the platform's API.

The following graphics illustrates this. The graphic on the left shows an application written directly to the API of the platform. The graphic on the right shows utilization of VXML and CCXML. Note that the ASR and the TTS is a layer on top, it doesn't matter if it was written via API or MRCP.

Our graphics show that the call comes in, it talks to a computer, it then goes through the call control or the CCXML browser. The application or VXML browser comes in next, the speech engine or TTS engine sits on top.

© 2018 LumenVox, LLC. All rights reserved.