There are two settings in MRCPv1 that are used to specify the references for recognizer and synthesizer resources. If there is a mismatch in these strings between the client and the Media Server, whenever the resources are requested, the client or server may not understand which is being asked for. These resource definitions (resource URLs) can generally be changed at either the client end or the Media Server end so that they match (it doesn't really matter which end you change). For details on changing these resource URLs, see the media_server.conf article.
This is an IETF-defined signaling protocol that has been widely adopted for controlling communication sessions, such as VoIP calls and so on. Version 28 of this draft is described here. This protocol was originally designed in 1996 by Henning Schulzrinne and Mark Handley, and has become popular in recent years to the point where many speech application developers have adopted it. Most VXML implementation use SIP connectivity, which can be used to easily connect telephony and speech systems in a controlled Interactive Voice Recognition environment.
From a networking perspective, SIP differs from RTSP in that it may use either UDP or TCP connections to communicate, depending on the client requirements. As of LumenVox version 11.1, SIP over TCP support was added so that now both UDP and TCP connections are fully supported for SIP. Prior to LumenVox version 11.1, only UDP connections were supported for SIP.
When a SIP session is established, this differs further from RTSP in that there is a second communication channel (TCP-based this time) for the MRCP traffic. The port numbers used for MRCP are negotiated during the session initialization (via SDP). This means that all of the session control information is sent using SIP over UDP or TCP, while all of the MRCP information is sent over its own dedicated TCP connection.
This separation can be useful for network engineers needing to control how traffic is being routed. For example, using SIP, it is possible to configure proxy servers and routers to send the SIP (session control) traffic via one network path, and MRCP (resource control) traffic via a different path. When configuring large systems, connecting to Session Border Controllers (SBCs) and proxy servers, this may be beneficial, however this topic is beyond the scope of this overview.
LumenVox supports the IETF RFC3261 memo describing Session Initiation Protocol
Media Resource Control Protocol - MRCP
This is a protocol which is used when a session has been established using either SIP or RTSP. This differs from the other protocols, which control the overall state of the session or connection. Instead, MRCP is used to control the various speech resources that are used within the session. For example if the client application wants to request some TTS audio, or if it wants to request speech recognition, MRCP would be used to facilitate the communication of these requests.
As was mentioned above, RTSP uses Version 1 of MRCP, which SIP uses Version 2 of MRCP. Both versions perform very similar functions, however there are subtle differences between them which may need to be considered if you intend on writing the MRCP protocol handler yourself (this is not a small task).
As well as communicating requests, responses and events between the client and Media Server, these messages also control audio media which may be streamed between the two. These types of audio/media streams are transported by another protocol, called RTP, which is described below.
LumenVox supports the IETF RFC4463 memo describing MRCP (v1) and also the IETF RFC6787 describing MRCP Version 2
Real-time Transport Protocol - RTP
This protocol is used to transport audio streams over a network. These audio streams may be TTS or ASR audio, but typically TTS traffic flows away from the Media Server and ASR traffic flows towards the Media Server.
There are several types of audio that can be transported using this protocol. LumenVox supports PCMU (ulaw) and PCMA (alaw) encoded at 8 KHz. No other formats are supported.
Typically audio is split into small packets of data representing around 20 ms of time. These small packets are streamed one after the other from one end to the other. The receiver puts these small packets together and uses the audio stream for whatever it needs - playing out audio to a speaker, or sending audio into the speech recognizer.
Each packet has various attributes, describing its format and also which packet number it is within a sequence. This can be important because UDP datagrams are used to transport RTP audio. UDP is very efficient for this task, however packets can become lost or get out of sequence in certain situations. The receiver reviews the sequence information associated with each packet and reassembles the stream as best as it can to maintain audio quality.
In addition to audio, Dual Tone Multi-Frequency (touch-tones) packets are sent over the RTP stream. It is important to understand that the beep itself is not sent over as audio, since this would interfere with speech recognition if these "in-band" beeps were present. Instead, these DTMF tones are sent as RTP Events, which are special packets indicating which key was pressed.
The decision as to which RTP ports are using within a session is negotiated whenever the session is established. This associates the RTP stream with a specific resource (recognizer or synthesizer) which also determines the stream's direction.
LumenVox supports the IETF RFC3550 memo describing Real-time Transport Protocol and also the IETF RFC2833 memo describing the use of DTMF over RTP
Session Description Protocol - SDP
This protocol is used in conjunction with SIP and RTSP to establish multimedia sessions. This essentially means that using SDP you can describe the various audio streams and MRCP streams that are needed within a session.
SDP is used when negotiating a session and is used to describe the streaming media initialization parameters, including which audio format to use and which ports for RTP and MRCP (in the case of SIP sessions) should be used.
LumenVox supports the IETF RFC4566 memo describing the Session Description Protocol