Tools

Character Encodings

Reference Number: AA-01751 Views: 13362

0 Rating/ Voters

When working with both grammars and SSML documents, it is a good practice to specify a character encoding for the document. The character encoding describes how the letters and other characters inside of a file are represented, and becomes important when working with software that supports multiple languages. As LumenVox supports many acoustic models for ASR and different voices/languages for TTS, ensuring that you are using the proper character encoding helps guarantee that your text will be understood properly.

LumenVox supports two mechanisms for character encoding: UTF-8, which is the most prominent character encoding on the web today, and ISO-8859-1. UTF-8 allows for a much larger range possible characters and is thus preferred, but ISO-8859-1 provides good coverage for most European languages. If you do not know which character encoding to use, you should use UTF-8. Users who are only making use of American English and do not need any other characters (e.g. letters with accent marks) likely do not need to specify an encoding, but anyone using other languages should do so.

LumenVox will attempt to determine the encoding if it is not explicit, but providing an explicit declaration is always good practice since this automatic determination is not foolproof.

Declaring a Character Encoding

The best place to put an encoding declaration is in the header of a document. You can specify the character set for documents that are served over HTTP as part of the HTTP Content-Type header, but LumenVox only supports the declaration in the actual document. Likewise for documents passed inline over MRCP; LumenVox will ignore the MRCP content-type headers and use what is in the document or attempt to auto-detect the character set.

SSML

The proper place for the encoding declaration in an SSML document is in the XML prolog that begins any SSML document:

<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis">
<voice xml:lang='es-MX' >Para Español, oprima numero dos</voice>
</speak>

GrXML Grammar

Since a GrXML grammar is an XML document, the encoding declaration looks the same as it does for SSML:

<?xml version="1.0"? encoding="UTF-8">
<grammar xml:lang="es-MX" version="1.0" root="language" mode="voice" tag-format="semantics/1.0">

<rule id="language">
<item>Español</item>
<item>Inglés</item>
</rule>

</grammar>

ABNF

ABNF encoding declarations immediately follow the #ABNF 1.0 statement:

#ABNF 1.0 UTF-8;
language es-MX;
mode voice;
tag-format <semantics/1.0>;
root $language;

$language = Español | Inglés;

Ensure Actual Encodings Match Declarations

If you declare an encoding, please ensure that the document is actually encoded using the format you have specified. If you have a UTF-8 encoded document and you declare it as ISO-8859-1, it can cause the LumenVox parsers to mangle the characters (especially accented characters).

More Information

Please note that while UTF-8 allows for a very large number of characters, LumenVox does not support every conceivable language so many of these are meaningless. While you could hypothetically provide Chinese characters in a UTF-8-encoded grammar, LumenVox does not (currently) have any Chinese-language acoustic models so these characters would be meaningless to the recognizer. It is a good idea to make sure you only use characters that are part of the language you specify in the document and that is supported by LumenVox.

You also need to ensure that you have the correct locale installed on your machine. See Dealing with Locales for more information about problems with installed system locales.

Because character encoding is a complex topic, it may be a good idea to review the Wikipedia entry on Character encoding for more information.