Tools

Introduction to TTS

Reference Number: AA-01548 Views: 10252

0 Rating/ Voters

The LumenVox Text To Speech Engine is a piece of core technology that will allow you to synthesize audio phrases from a selection of available voices and languages. This audio can then be used in a variety of applications to provide prompts and dynamic content where needed, which can optionally be sent out as an audio stream to other applications or equipment.

The task of obtaining synthesized audio from a specified piece of text can be broken up into four general phases:

Setup: Your application creates a TTS client object and then requests some specified text to be synthesized.
Getting Results: The application can periodically check whether the TTS engine has completed the synthesis request, then determine the amount of audio produced
Interacting with Audio: Once the synthesis request has been successfully completed and the audio length determined, the application can request the entire audio in a single request or a portion at a time by specifying a smaller amount for each request, as needed (for example when streaming the audio).
Cleaning up: Once the application is finished with each TTS client, it should release the TTS object to free the license and other resources that were assigned to it.

Overview of Application Design

The TTS client can obtain synthesized audio in one of two ways. The simplest way is in a batch mode, where you obtain all of the synthesized audio in one request. In this case, your application will need to follow a process like this:

Create a TTS client object.
Request some text to be synthesized.
Wait for the synthesis request to be completed
Request the length of audio produced
Ask for all of the synthesized audio , based on the length of audio produced.
Release the TTS object (unless you have another request to be made).

Many applications, however, will be more complex as they will require that a streaming interface send small chunks of audio at a time, such as a telephone caller. In this case, your application will perform a small loop to handle the chunking of audio.

Create a TTS client object.
Request some text to be synthesized.
Wait for the synthesis request to be completed
Request the length of audio produced
Ask for and send each chunk, based on the desired chunk size. Wait and repeat this step as needed
Release the TTS object (unless you have another request to be made).

You will want to keep this overall design in mind as you get started learning about TTS synthesis. This guide will walk you through the basic steps, providing example code along the way. To get started, continue on to Creating a TTS client object.

Plain Text vs. SSML

The LumenVox TTS Server is designed to work with requests that are plain text, such as "Hello World", or more complex XML based markup language specifically designed for speech synthesis, namely Speech Synthesis Markup Language (SSML). LumenVox supports the SSML 1.0 draft specification with some minor exclusions. In general, plain text is easier to use, however more complex control over synthesized audio and the synthesis process can be done using SSML

If this is the first time you are attempting speech synthesis, you may wish to browse through the core programming guide to get an idea of our available C and C++ APIs, and then immediately read the SSML 1.0 specification, which we support that allows more complex and precise control over how speech synthesis is processed. Much of your application's success will depend on the quality of your synthesis requests, so understanding how to write quality synthesis requests using SSML is important.