Browse
 
Tools
Rss Categories

Synthesizer SPEAK

Reference Number: AA-01663 Views: 2990 0 Rating/ Voters

The SPEAK method from the client to the server provides the synthesizer resource with the speech text and initiates speech synthesis and streaming.  The SPEAK method can carry voice and prosody header fields that define the behavior of the voice being synthesized, as well as the actual marked-up text to be spoken.  If specific voice and prosody parameters are specified as part of the speech markup text, it will take precedence over the values specified in the header fields and those set using a previous SET-PARAMS request.

When applying voice parameters, there are 3 levels of scope.  The highest precedence are those specified within the speech markup text, followed by those specified in the header fields of the SPEAK request and, hence, apply for that SPEAK request only, followed by the session default values that can be set using the SET-PARAMS request and apply for the whole session moving forward. 

If the resource is idle and the SPEAK request is being actively processed, the resource will respond with a success status code and a request-state of IN-PROGRESS.

If the resource is in the speaking or paused states (i.e., it is in the middle of processing a previous SPEAK request), the status returns success and a request-state of PENDING. This means that this SPEAK request is in queue and will be processed after the currently active SPEAK request is completed.

For the synthesizer resource, this is the only request that can return a request-state of IN-PROGRESS or PENDING.  When the text to be synthesized is complete, the resource will issue a SPEAK-COMPLETE event with the request-id of the SPEAK message and a request-state of COMPLETE..


MRCPV1 SPEAK Example:

C->S: SPEAK 543257 MRCP/1.0
      Voice-gender:female
      Prosody-volume:medium
      Content-Type:application/synthesis+ssml
      Content-Length:104

      <?xml version="1.0"?>
      <speak>
      <paragraph>
        <sentence>You have 4 new messages.</sentence>
        <sentence>The first is from <say-as
        type="name">Stephanie Williams</say-as>
        and arrived at <break/>
        <say-as type="time">3:45pm</say-as>.</sentence>

        <sentence>The subject is <prosody
        rate="0.8">ski trip</prosody></sentence>
      </paragraph>
      </speak>

S->C: MRCP/1.0 543257 200 IN-PROGRESS

S->C: SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
      Completion-Cause:000 normal


MRCPV2 SPEAK Example:

C->S: MRCP/2.0 ... SPEAK 543257
      Channel-Identifier:32AECB23433802@speechsynth
      Voice-gender:female
      Prosody-volume:medium
      Content-Type:application/ssml+xml
      Content-Length:...

      <?xml version="1.0"?>
         <speak version="1.0"
             xmlns="http://www.w3.org/2001/10/synthesis"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
             xml:lang="en-US">
         <p>
          <s>You have 4 new messages.</s>
          <s>The first is from Stephanie Williams and arrived at
             <break/>
             <say-as interpret-as="vxml:time">0342p</say-as>.
             </s>
          <s>The subject is
                 <prosody rate="0.8">ski trip</prosody>
          </s>
         </p>
        </speak>

S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
      Channel-Identifier:32AECB23433802@speechsynth
      Speech-Marker:timestamp=857206027059

S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
      Channel-Identifier:32AECB23433802@speechsynth
      Completion-Cause:000 normal
      Speech-Marker:timestamp=857206027059

 

Language Selection when using Plain Text Synthesis

When working with TTS synthesis over MRCP, it it preferable to use the SSML Content-Type as described above because it allows you a greater level of flexibility and control over how words and phrases are pronounced, as well as fine control over voices used.

Sometimes it is necessary to use plain text during synthesis instead of SSML - this may be a preference, or perhaps the platform you are using does not support SSML, in which case you may have little choice. In these instances, it is important to understand how to control the TTS language used during synthesis. Prior to LumenVox version 14.1, this selection was fixed as the SYNTHESIS_LANGUAGE setting configured in your client_property.conf file. Introduced in version 14.1, LumenVox now supports the Speech-Language header when processing plain text syntheses (with Content-Type: text/plain), as shown in the examples below:

MRCPV1 SPEAK Example using plain text with British English language:

C->S: SPEAK 543257 MRCP/1.0
      Voice-gender:female
      Prosody-volume:medium
      Content-Type:text/plain

      Speech-Language:en-GB
      Content-Length:31

      LumenVox TTS is the Bee's Knees

S->C: MRCP/1.0 543257 200 IN-PROGRESS

S->C: SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
      Completion-Cause:000 normal


MRCPV2 SPEAK Example using plain text with British English language:

C->S: MRCP/2.0 ... SPEAK 543257
      Channel-Identifier:32AECB23433802@speechsynth
      Voice-gender:female
      Prosody-volume:medium
      Content-Type:text/plain
      Speech-Language:en-GB
      Content-Length:31

      LumenVox TTS is the Bee's Knees

S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
      Channel-Identifier:32AECB23433802@speechsynth
      Speech-Marker:timestamp=857206027059

S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
      Channel-Identifier:32AECB23433802@speechsynth
      Completion-Cause:000 normal
      Speech-Marker:timestamp=857206027059


Please note that specifying the Speech-Language header when using SSML (Content-Type: application/ssml+xml)  will be ignored, since the SSML must contain a language specifier, which will be used instead of any specified Speech-Language header that may be present. The Speech-Language header will therefore only be used in conjunction with Content-Type: text/plain requests.