API or MRCP Integration

API or MRCP Integration

Choosing a Development Path for Speech Recognition Solutions

With the introduction of the Media Resource Control Protocol (MRCP), speech solution and platform developers now have a choice in how they integrate the LumenVox Speech Engine and other speech engines into their applications: they can use MRCP, or write directly to the Application Programming Interface (API).

This paper discusses the pros and cons of both development methods, so that you can choose the proper path for your organization.

About MRCP

MRCP was proposed in April 2006 to the Internet Engineering Task Force (IETF), and is now in version 2. MRCP controls media resources like speech synthesizers, speech recognizers, signal generators, signal detectors, fax servers, voice biometrics servers, etc., over a network. Until the protocol was defined, these components had to be provided by a single vendor with a proprietary interface. In essence, MRCP allows the developer to seamlessly manage these diverse media resources and provides a common language to speak to all of these devices. Version 2 of this protocol is designed to work with Session Initiation Protocol (SIP), which helps establish control connections to external media streaming devices, and media delivery mechanisms like Real Time Protocol (RTP).

MRCP's strength is that it enables your application or platform to integrate with the speech and text–to–speech engines of your choice. MRCP addresses the need for client control of media processing resources. There are some caveats, however, when implementing your speech application using MRCP, such as no backwards compatibility and reduced control of core engine features limited to the MRCP standard definition.

In addition, there may be subtle differences in each speech vendor's implementation of the standard. This is usually because some vendors adopt the standard at different stages of the draft revisions, before the draft is complete. Also, vendors sometimes make special adaptations of the standard to suit their needs.

Despite these issues, if you want to have the ability to support multiple speech products through a single interface, MRCP would be the correct choice for your speech project.

MRCP Pros and Cons

  • Little transition time implementing multiple core speech technologies.
  • Generic interface.
  • The engine client (your software making use of the speech technology) does not need to directly link to vendors DLLs or shared libraries.
  • Any communication between the speech server and customer software is done strictly through the network.
  • Since all communication between client and server is done over the network, there is no inherent limitation of which OS or hardware the client can be implemented on. For example, the client could be implemented on a custom–built embedded device and communicate with a Windows server.

Cons

  • Generally a more time consuming implementation due to standards compliance and irregularities.
  • Generic interface.
  • No backwards compatibility.
  • The client must handle the details of network communication with the servers. If done from scratch, this can be a very large project. There are, however, open source projects available for handling the client side of the MRCP specification.
  • Some vendors may add vendor–specific options to the MRCP specification, somewhat limiting the advantage MRCP has as a "vendor agnostic" choice.

About the API

The other option for developers is to simply write their application directly to the API. The API is the optimized interface to the product and its features. This option usually takes less time and provides the developer with greater control. Most importantly, it gives access to all of the features specific to a product, rather than the subset of features that is common among all applications, as provided through MRCP.

Generally this option gives the developer more flexibility, with less reliance on the speech vendor. Additionally, the speech vendor can respond to feature requests more quickly and expose them for the developer in less time (often by weeks) through the API than through MRCP.

API Pros and Cons

  • Backwards compatibility.
  • Exposure to greater functionality (the most optimized interface) ex. Global Grammar for all resources, larger grammars. In addition, greater control of pre–compiled grammars and client/server side caching of grammars.
  • Specific requests for changes to how an engine works can be accommodated much more easily by the vendor.
  • The API interface shares some of the workload between client and server, so many low CPU tasks can be done by the client, without the need to use network resources. Ex. API barge–in detection occurs in the client's memory space, which is more efficient and faster then streaming audio over the network.
  • The vendor may allow a group of servers to be treated as a resource cluster. This can offer load–balancing and automatic fail–over.

Cons

  • Complete re–write of the interface every time you want to integrate with a new speech technology.
  • Vendor's client side library shares process space with the client side process. While this can greatly speed up certain task (like speech barge–in detection), it both limits the client to the OS and hardware supported by the vendor, and increases the memory footprint of the client side application.

Conclusion

Choosing an integration path for your speech application project boils down to what is right for your organization. If you are looking for more flexibility to access to a mix and match of speech technologies, then using MRCP may be the best choice for you. If you need greater functionality and an easier integration process, then writing to the API might be the best solution for company.

If you need further information on this topic, or would like to discuss your particular speech project, please contact LumenVox at 1–877–977–0707 or email our support team.

In addition, please find the defined MRCP protocol at the Internet Engineering Task Force (IETF) site: www.ietf.org.

For more information on LumenVox products, such as the Speech Engine, check out our Product Overview.