The LumenVox TTS1 Text-To-Speech synthesizer works internally by synthesizing words. However, input text documents contain not only words, such as milk and sugar, but also various other written elements, such as numbers (15), date (3/4/2003), acronyms (USA), abbreviations (i.e.), symbols ($), etc. All such elements must first be converted to actual words, and only then synthesized. This conversion takes place internally within the synthesizer. Such conversion is called text normalization.
The American and British English TTS1 Text-To-Speech voices correctly normalize and synthesize the majority of English texts. This document describes how LumenVox accomplishes the task of text normalization.
The user may extend LumenVox' text normalization by using PLS lexicons (as defined in the W3C pronunciation-lexicon Recommendation).
Please note that this article does not apply to our TTS2 voices.
This section describes how unannotated input text is split into paragraphs, sentences and words.
Paragraphs are separated by empty lines.
Paragraphs may be explicitly marked with SSML elements <p>.
A sentence contains by default less than 1000 characters. Sentences longer than that will be broken into multiple smaller sentences.
Sentences may be explicitly marked with SSML elements <s>.
A word contains by default less than 100 characters. Words longer than that will be broken into multiple smaller words.
Words without any vowels will be spelled out.
LumenVox will properly handle words with apostrophes, such as the standard contractions: 'll, 've, 'd, n't, the genitive 's, as well as common phrases such as rock’n'roll or c'mon.
LumenVox accepts all Unicode characters. LumenVox handles most characters found in texts based on the Latin script.
Punctuation plays a key role in the way texts are interpreted by the TTS system. LumenVox supports majority of punctuation marks found in English texts. However, in the end all punctuation marks which have effect on pauses or intonation are mapped to the following marks.
rising or falling
Default normalization rules
This section describes in general how LumenVox normalizes input text, excluding text fragments marked with the SSML say-as element.
This section is not exhaustive. LumenVox normalizes lots of various text elements, but only the most common have been described over here.
A cardinal number is either any single digit (0, 1, …, 9) or a sequence of digit not starting with 0.
Longer cardinal numbers may make use of comma as a thousands separator.
- 10,000 will be pronounced ten thousand.
- 256 will be pronounced two hundred (and) fifty six.
- 4358 will be pronounced four thousand three hundred (and) fifty eight.
- 1,000 will be pronounced one thousand.
A signed integer consists of a sign character followed immediately by a cardinal number. Valid sign characters are the plus sign (+), the minus sign (-, U+2212) and the plus-minus sign (±). The popular hyphen-minus character (-), as well as other dash-like characters, are also supported as the sign character, but they are ambiguous and should best be avoided.
- +5 will be pronounced plus five.
- -3,000 will be pronounced minus three thousand.
A cardinal or signed integer followed immediately by the dot and a sequence of digits will be recognized as a real number.
- 4.5 will be pronounced four point five.
- -3.1 will be pronounced minus three point one.
- 1,000.12 will be pronounced one thousand point one two.
A cardinal number with suffixed st, nd, rd or th is interpreted as an ordinal number. The suffixes st, nd and rd may only be applied to numbers for which the ordinal ends in these letters. The suffix th may be applied to any cardinal number.
- 21st will be pronounced twenty first.
- 42nd will be pronounced forty second.
- 6th will be pronounced sixth.
- 1,000,000th will be pronounced one millionth.
Cardinal followed by s
Cardinal followed by s or 's will follow the same pattern as regular plural English words, examples below.
- 60s will be pronounced sixties.
- 100s will be pronounced one hundreds.
- bc 52's will be pronounced b c fifty two’s.
LumenVox supports various Roman numerals.
All uppercase Roman numerals with an appropriate lowercase ordinal suffix are pronounced as ordinal numbers.
- LIst will be pronounced fifty first.
- MMXIth will be pronounced two thousand eleventh.
Uppercase Roman numerals in names of monarchs will be read as ordinal numbers preceded with the word the.
- Queen Elizabeth II will be pronounced queen elizabeth the second.
- Henry III of England will be pronounced henry the third of england.
Small uppercase and lowercase Roman numerals in other contexts will be pronounced as cardinal numbers.
- Chapter XIX will be pronounced chapter nineteen.
- World War II will be pronounced world war two.
- xxii will be pronounced twenty three.
A fraction consists of the following elements in order:
- An optional sign character.
- An optional whole number (cardinal) followed by the space character.
- The numerator (a cardinal number).
- The slash (/ U+002F) or the solidus character (/ U+2044).
- The denominator (a cardinal number)
Fractions with the slash character are recognized only for the most common denominators. Fractions with the solidus character are always correctly recognized.
- 3/4 will be pronounced three fourths (American) or three quarters (British).
- 2 1/2 will be pronounced two and one half.
- -7 2/3 will be pronounced minus seven and two thirds.
- 15/5678 (solidus only) will be pronounced fifteen five thousand six hundred seventy eighths.
Sequence of digits
Sequences of more than one digit starting with 0 are always read as a sequence of digits.
Similarly are handled digits in fixed formats, such as telephone numbers or social security numbers.
- 0123 will be pronounced zero one two three.
- 924-51-0387 will be pronounced nine two four five one zero three eight seven.
- 236-555-1234 will be pronounced two three six five five five one two three four.
Unit and measurement
LumenVox handles a wide variety of commonly as well as rarely used units, including metric and imperial systems. Some unit symbols are always recognized, others need a preceding number.
- fl oz will be pronounced fluid ounces.
- 14'5" will be pronounced fourteen feet five inches.
- 1h2m30s will be pronounced one hour two minutes thirty seconds.
- 5 tsp will be pronounced five teaspoons.
- 1 tbsp will be pronounced one tablespoon.
- 2.6 GHz will be pronounced two point six gigahertz.
- 25 MPH will be pronounced twenty five miles per hour.
- 8 nmi will be pronounced eight nautical miles.
- -0.01% will be pronounced minus zero point zero one percent.
- 90° will be pronounced ninety degrees.
- 50c will be pronounced fifty cents.
- 40 km/h will be pronounced forty kilometers per hour.
LumenVox supports a wide list of currencies in multiple formats. Valid currency symbols include commonly used symbols such as L, $, €, Y, ?, $AU, SG$, as well as many of the ISO 4217 currency codes (uppercase only).
The number may be followed by any of the words million, billion, trillion, or their various abbreviations. In that case the currency will be pronounced afterwards.
The value may have a thousands separator which may be either a comma or a space.
- $10 will be pronounced ten dollars.
- USD5.27 will be pronounced five u s dollars and twenty seven cents.
- L5.27 will be pronounced five pounds and twenty seven pence.
- GBP 1,000 will be pronounced one thousand pounds sterling.
- Y1 million will be pronounced one million yen.
- Y5.27 will be pronounced five yen and twenty seven sen.
- CHF6M will be pronounced six million swiss francs.
- € 20 000 will be pronounced twenty thousand euros.
- C$ 2.3 mn will be pronounced two point three million canadian dollars.
LumenVox supports time specified in both the 12-hour and the 24-hour clock.
- 1:59 will be pronounced one fifty nine.
- 2:00 will be pronounced two o’clock.
- 01:59am will be pronounced one fifty nine _a m.
- 2 AM will be pronounced two _a m.
- 13:00 will be pronounced thirteen hundred hours.
- 10:25:30 will be pronounced ten twenty five and thirty seconds.
- 07:53:10 A.M. will be pronounced seven fifty three and ten seconds_a m.
LumenVox also handles duration specified in multiple formats.
- 5'30" (only for seconds greater than 11) will be pronounced five minutes and thirty seconds.
- 5m30s will be pronounced five minutes and thirty seconds.
- 3h10m will be pronounced three hours and ten minutes.
One-digit numbers for the day and for the month may have an optional leading zero.
Supported formats for month expressions: numbers (4, 04), name (April), abbreviation (Apr).
The year may have either 2 or 4 digits.
Standard US format (M/D/Y, M-D-Y, M.D.Y), default for American English voices:
- 12/31/1999 will be pronounced december thirty first nineteen ninety nine.
- 10-25-99 will be pronounced october twenty fifth nineteen ninety nine.
- Dec/31/1999 will be pronounced december thirty first nineteen ninety nine.
- April-25-1999 will be pronounced april twenty fifth nineteen ninety nine.
European format (D/M/Y, D-M-Y, D.M.Y), default for British English voices:
- 12/may/1995 will be pronounced may twelfth nineteen ninety five.
- 12-Apr-2007 will be pronounced april twelfth two thousand seven.
- 20.3.2011 will be pronounced march twentieth twenty eleven.
ISO 8601 standard (Y-M-D, Y/M/D, Y.M.D), only 4-digit year:
- 2007/01/01 will be pronounced january first two thousand seven.
- 2007-Jan-01 will be pronounced january first two thousand seven.
- 2007-January-01 will be pronounced january first two thousand seven.
Other common formats:
- June 2 will be pronounced june second.
- Aug. 5, 1921 will be pronounced august fifth nineteen twenty one.
- arrive on 3/4 will be pronounced arrive on march fourth.
A number will be read as a year if it is followed by BC or if it is preceded or followed by AD:
- 1063 A.D. will be pronounced ten sixty three _a d.
LumenVox interprets ranges of numbers, measurements, time and date.
- ages 3–5 will be pronounced ages three to five.
- 40Hz–20kHz will be pronounced forty hertz to twenty kilohertz.
- June 15-20 will be pronounced june fifteenth to twentieth.
- 1939-1945 will be pronounced nineteen thirty nine to nineteen forty five.
Most abbreviations will be pronounced as the words for which the abbreviations stand. There will be no sentence break on the period (full stop) following a recognized abbreviation. In order to force a sentence break please use two periods: one for the abbreviation and one for the sentence ending.
- i.e. Mr T vs bros Inc. will be interpreted as that is mister t versus brothers incorporated.
Initialisms with a period (dot) following each letter (e.g. U.S., F.B.I.) will be pronounced by spelling out each letter.
Common initialisms without periods (e.g. US, FBI) will be recognized and properly pronounced. However if there is a vowel, such an initialism may be treated as an uppercased ordinary word.
LumenVox will pronounce all vowelless words as initialisms.
- N.Y.P.D. will be pronounced n y p d.
- In the US will be pronounced in the u s.
- an IT report will be pronounced an i t report.
- BBC will be pronounced b b c.
- pwq will be pronounced p w q.
In most cases LumenVox properly recognizes and normalizes street addresses in the United States and Canada.
- 159 W. Poplar Av., Ste. 5, St. George, CA 12345 will be pronounced one fifty nine west poplar avenue, suite five, saint george california one two three four five.
LumenVox recognizes most American telephone number formats and reads them as series of digits.
- (978) 555-2345 will be pronounced as nine seven eight five five five two three four five.
- 1-800-555-1234 ex. 10 will be pronounced as one eight hundred five five five one two three four extension one zero.
Non-words not described elsewhere will be treated as identifiers. This group includes mixes of letters and digits, such as r121, as well as URL’s, e-mail addresses, or fancy proper names unknown to the synthesizer.
Numbers within identifiers such as r121, x01, b987654 will be read in groups of two if they consist of up to 4 digits, and will be read as a series of digits otherwise.
Punctuation characters within identifiers will be pronounced.
- er125lp will be pronounced er one twenty five l p.
- http://www.lumenvox.com will be pronounced h t t p colon slash slash w w w dot lumenvox dot com.
- B!0 will be pronounced b exclamation mark zero.
SSML say-as attribute values
The SSML element say-as gives users the possibility to annotate fragments of text in order to force particular interpretation.
Marking a fragment with say-as disables most default normalization rules, which would have otherwise been applied. Therefore, it is advised to mark text with say-as scarcely, only when the default normalization rules fail and render different speech than expected by the user.
The standards authority W3C Working Group has issued a note describing SSML 1.0 say-as attribute values, which is mostly followed by LumenVox.
LumenVox will interpret a value as a date, when used within say-as with interpret-as="date". This works just as defined in the W3C note. The format attribute may be set to any of the following: mdy, dmy, ymd, md, dm, ym, my, d, m, y.
- <say-as interpret-as="date" format="ymd">01/02/03</say-as> will be pronounced february third two thousand one (American) or the third of february two thousand and one (British).
- <say-as interpret-as="date">1234</say-as> will be pronounced twelve thirty four.
A token like 7'10" would by default be recognized as length in feet and inches. However, it may be forced to be recognized as duration in minutes and seconds by surrounding with say-as having interpret-as="time".
- <say-as interpret-as="time">2'10"</say-as> will be pronounced two minutes and ten seconds.
Telephone numbers may be marked with the say-as element having interpret-as="telephone". In a telephone number LumenVox will read most digits and letters individually, as well as properly read the extension number and the characters * and #.
- <say-as interpret-as="telephone">1-800-555-234 ex. 23</say-as> will be pronounced one eight hundred five five five two three four extension two three.
- <say-as interpret-as="telephone">*53#</say-as> will be pronounced star five three pound (American) or star five three hash (British).
LumenVox will read individual characters for text within the say-as element having interpret-as="characters". The format attribute is ignored. The detail attribute may be used to force pauses, as described in the W3C Note.
- <say-as interpret-as="characters">speed</say-as> will be pronounced s p e e d.
- <say-as interpret-as="characters" detail="3 1 2">1a3BZ7</say-as> will be pronounced one _a three, b, z seven.
LumenVox will attempt to read values within say-as having interpret-as="cardinal" as cardinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
- <say-as interpret-as="cardinal">1999</say-as> will be pronounced one thousand nine hundred (and) ninety nine.
- <say-as interpret-as="cardinal">CLI</say-as> will be pronounced one hundred (and) fifty one.
LumenVox will attempt to read values within say-as having interpret-as="ordinal" as ordinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
- <say-as interpret-as="ordinal">21</say-as> will be pronounced twenty first.
- <say-as interpret-as="ordinal">VI</say-as> will be pronounced sixth.
LumenVox will interpret values within say-as having interpret-as="fraction" as common fractions. The syntax for fractions is any of the following:
["+" | "-" | "±"] cardinal “/” cardinal.
Non-negative mixed number
["+" | "±"] cardinal “+” cardinal “/” cardinal.
Negative mixed number
“-” cardinal “-” cardinal “/” cardinal.
where cardinal is a number as defined in Cardinal numbers above.
- <say-as interpret-as="fraction">2/9</say-as> will be pronounced two ninths.
- <say-as interpret-as="fraction">3+1/2</say-as> will be pronounced three and one half.
- <say-as interpret-as="fraction">-2-3/8</say-as> will be pronounced minus two and three eighths.
Measurements may be marked with say-as having interpret-as="unit" (or interpret-as="measure"). The valid syntax is the following:
symbol [ "2" | "3" | "4" | "2" | "3" ] [ "/" unit ]
number “-” unit
A unit symbol may be almost any of the standard metric, imperial or other unit symbols, e.g. N(newtons), kJ (kilojoules), mi (miles), sqft (square feet), MiB (mebibytes), ly (light years), tbsp (tablespoons), °F (degrees Fahrenheit), psi (pounds per square inch), etc. The unit name does not contain periods (dots). In general the unit symbols are case sensitive, so B is bytes and b is bits, but unambiguous symbols are matched case-insensitively, so that either the proper Hz or improper hz, HZ and hZ will all be treated as the frequency unit hertz.
The SI prefixes as well as binary prefixes may be prepended to unit symbols, if appropriate.
In unambiguous cases, the letter s may be appended to a symbol to force plural even though the number would need a singular qualifier, e.g. 1mph is one mile per hour, but 1mphs will be one miles per hour.
A unit symbol may be suffixed with a power like 2 or 3, so that m2 is square meters and s3 is seconds cubed.
The adjective measurement forces singular unit form, so that whereas 2in is two inches, 2-in is two inch.
A number may be either a cardinal, a signed integer, a real number, or a fraction, as described above.
- <say-as interpret-as="unit">2nmi</say-as> will be pronounced two nautical miles.
- <say-as interpret-as="unit">1+1/2tsp</say-as> will be pronounced one and one half teaspoons.
- <say-as interpret-as="unit">5m/s2</say-as> will be pronounced five meters per second squared.
- <say-as interpret-as="unit">2,100rpm</say-as> will be pronounced two thousand one hundred revolutions per minute.
- <say-as interpret-as="unit">2.7µF</say-as> will be pronounced two point seven microfarads.
Street addresses or parts of an address may be marked with say-as having interpret-as="address". This will force special pronunciation of numbers and expansion of abbreviations.
The two-letter US state abbreviation will be expanded only when followed by a ZIP code. However, one may force expansion elsewhere by specifying the attribute format="us-state".
- <say-as interpret-as="address">320 W Mt Willson Ct</say-as> will be pronounced three twenty west mount willson court.
- <say-as interpret-as="address">rm. 103</say-as> will be pronounced room one oh three.
- <say-as interpret-as="address">Ft Worth, TX 12345</say-as> will be pronounced fort worth texas one two three four five.
- <say-as interpret-as="address" format="us-state">CO</say-as> will be pronounced colorado.
LumenVox special characters
As mentioned at the very beginning of this text, it is sometimes necessary to modify texts to be synthesized in order to make them compatible with the system constraints and achieve the expected output. LumenVox provides a set of special characters that work only in certain contexts, changing the way texts are being synthesized in terms of pronunciation or intonation. The characters are language-specific and do not apply to other languages unless specified otherwise in the language-specific documentation.
Force the letter a
_a will be pronounced ey. This is to disambiguate the letter a in contexts in which the synthesizer would recognize input a as the indefinite article a.
Force rising intonation
A question mark followed by caret also known as circumflex (?^) can be used to force the intonation of a question to be rising. Wh-questions (questions starting with an interrogative pronoun) by default have falling intonation. This can be changed by appending a caret to the question mark.
- How are you?^ will result in a rising intonation.
Force falling intonation
A question mark followed by an underscore (?_) can be used to force the intonation of a question to be falling. Yes/No questions by default have a rising intonation. This can be changed by appending the underscore character to the question mark.
- Are you all right?_ will result in a falling intonation.