Browse
 
Tools
Rss Categories

Introduction to TTS

Reference Number: AA-01548 Views: 9800 0 Rating/ Voters

The LumenVox Text To Speech Engine is a piece of core technology that will allow you to synthesize audio phrases from a selection of available voices and languages.  This audio can then be used in a variety of applications to provide prompts and dynamic content where needed, which can optionally be sent out as an audio stream to other applications or equipment. 

The task of obtaining synthesized audio from a specified piece of text can be broken up into four general phases:   

Overview of Application Design

The TTS client can obtain synthesized audio in one of two ways. The simplest way is in a batch mode, where you obtain all of the synthesized audio in one request. In this case, your application will need to follow a process like this:   

  1. Create a TTS client object.
  2. Request some text to be synthesized.
  3. Wait for the synthesis request to be completed
  4. Request the length of audio produced
  5. Ask for all of the synthesized audio , based on the length of audio produced.
  6. Release the TTS object (unless you have another request to be made).

Many applications, however, will be more complex as they will require that a streaming interface send small chunks of audio at a time, such as a telephone caller. In this case, your application will perform a small loop to handle the chunking of audio.

  1. Create a TTS client object.
  2. Request some text to be synthesized.
  3. Wait for the synthesis request to be completed
  4. Request the length of audio produced
  5. Ask for and send each chunk, based on the desired chunk size. Wait and repeat this step as needed
  6. Release the TTS object (unless you have another request to be made).

    You will want to keep this overall design in mind as you get started learning about TTS synthesis. This guide will walk you through the basic steps, providing example code along the way. To get started, continue on to Creating a TTS client object.

    Plain Text vs. SSML

    The LumenVox TTS Server is designed to work with requests that are plain text, such as "Hello World", or more complex XML based markup language specifically designed for speech synthesis, namely Speech Synthesis Markup Language (SSML). LumenVox supports the SSML 1.0 draft specification with some minor exclusions.  In general, plain text is easier to use, however more complex control over synthesized audio and the synthesis process can be done using SSML

    If this is the first time you are attempting speech synthesis, you may wish to browse through the core programming guide to get an idea of our available C and C++ APIs, and then immediately read the SSML 1.0 specification, which we support that allows more complex and precise control over how speech synthesis is processed. Much of your application's success will depend on the quality of your synthesis requests, so understanding how to write quality synthesis requests using SSML is important.