Last update : April 2, 2015
Utterance Definition
In linguistics an utterance is a unit of speech, without having a precise definition. It’s a bit of spoken language. It could be anything from “Baf!” to a full sentence or a long speech. The corresponding unit in written language is text.
Phonetically an utterance is a unit of speech bounded (preceded and followed) by silence. Phonemes, phones, morphemes, words etc are all considered items of an utterance.
In orthography, an utterance begins with a capital letter and ends in a period, question mark, or exclamation point.
In Speech Synthesis (TTS) the text that you wish to be spoken is contained within an utterance object (example : SpeechSynthesisUtterance). The Festival TTS system uses the utterance as the basic object for synthesis. Speech synthesis is the process that applies a set of programs to an utterance.
The main stages to convert textual input to speech output are :
- Conversion of the input text to tokens
- Conversion of tokens to words
- Conversion of words to strings of phonemes
- Addition of prosodic information
- Generation of a waveform
In Festival each stage is executed in several steps. The number of steps and what actually happens may vary and is dependent on the particular language and voice selected. Each of the steps is achieved by a Festival module which will typically add new information to the utterance structure. Swapping of modules is possible.
Festival provides six synthesizer modules :
- 2 diphone engines : MBROLA and diphone
- 2 unit selection engines : clunits and multisyn
- 2 HMM engines : clustergen and HTS
Festival Utterance Architecture
A very simple utterance architecture is the string model where the high level items are replaced sequentially by lower level items, from tokens to phones. The disadvantage of this architecture is the loss of information about higher levels.
Another architecture is the multi-level table model with one hierarchy. The problem is that there are no explicit connections between levels.
Festival uses a Heterogeneous Relation Graph (HRG). This model is defined as follows :
- Utterances consist of a set of items, representing things like tokens, words, phones,
- Each item is related by one or more relations to other items.
- Each item contains a set of features, having each a name and a value.
- Relations define lists, trees or lattices of items.
The stages and steps to build an utterance in Festival, described in the following chapters, are related to the us-english language and to the clustergen voice cmu_us_slt_cg.
To explore the architecture (structure) of an utterance in Festival, I will analyse the relation-trees created by the synthesis of the text string “253”.
festival> (voice_cmu_us_slt_cg)
cmu_us_slt_cg
festival> (set! utter (SayText "253"))
#<Utterance 0x104c20720>
festival> (utt.relationnames utter)
(Token
Word
Phrase
Syllable
Segment
SylStructure
IntEvent
Intonation
Target
HMMstate
segstate
mcep
mcep_link
Wave)
festival> (utt.relation_tree utter 'Token)
((("253"
((id "_1")
(name "253")
(whitespace "")
(prepunctuation "")
(token_pos "cardinal")))
(("two"
((id "_2")
(name "two")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.69302821)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB"))))
(("hundred"
((id "_3")
(name "hundred")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.692711)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB"))))
(("fifty"
((id "_4")
(name "fifty")
(pos_index 8)
(pos_index_score 0)
(pos "nn")
(phr_pos "n")
(phrase_score -0.69282991)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB"))))
(("three"
((id "_5")
(name "three")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(pbreak_index 0)
(pbreak_index_score 0)
(pbreak "B")
(blevel 3))))))
festival> (utt.relation_tree utter 'Word)
((("two"
((id "_2")
(name "two")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.69302821)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB"))))
(("hundred"
...
...
(blevel 3)))))
festival> (utt.relation_tree utter 'Phrase)
((("B" ((id "_6") (name "B")))
(("two"
((id "_2")
(name "two")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.69302821)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB"))))
(("hundred"
...
...
(blevel 3))))))
festival> (utt.relation_tree utter 'Syllable)
((("syl" ((id "_7") (name "syl") (stress 1))))
(("syl" ((id "_10") (name "syl") (stress 1))))
(("syl" ((id "_14") (name "syl") (stress 0))))
(("syl" ((id "_19") (name "syl") (stress 1))))
(("syl" ((id "_23") (name "syl") (stress 0))))
(("syl" ((id "_26") (name "syl") (stress 1)))))
festival> (utt.relation_tree utter 'Segment)
((("pau" ((id "_30") (name "pau") (end 0.15000001))))
(("t" ((id "_8") (name "t") (end 0.25016451))))
(("uw" ((id "_9") (name "uw") (end 0.32980475))))
(("hh" ((id "_11") (name "hh") (end 0.39506164))))
(("ah" ((id "_12") (name "ah") (end 0.48999402))))
(("n" ((id "_13") (name "n") (end 0.56175226))))
(("d" ((id "_15") (name "d") (end 0.59711802))))
(("r" ((id "_16") (name "r") (end 0.65382934))))
(("ax" ((id "_17") (name "ax") (end 0.67743915))))
(("d" ((id "_18") (name "d") (end 0.75765681))))
(("f" ((id "_20") (name "f") (end 0.86216313))))
(("ih" ((id "_21") (name "ih") (end 0.93317086))))
(("f" ((id "_22") (name "f") (end 1.0023116))))
(("t" ((id "_24") (name "t") (end 1.0642071))))
(("iy" ((id "_25") (name "iy") (end 1.1534019))))
(("th" ((id "_27") (name "th") (end 1.2816957))))
(("r" ((id "_28") (name "r") (end 1.3449684))))
(("iy" ((id "_29") (name "iy") (end 1.5254952))))
(("pau" ((id "_31") (name "pau") (end 1.6754951)))))
festival> (utt.relation_tree utter 'SylStructure)
((("two"
((id "_2")
(name "two")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.69302821)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB")))
(("syl" ((id "_7") (name "syl") (stress 1)))
(("t" ((id "_8") (name "t") (end 0.25016451))))
(("uw" ((id "_9") (name "uw") (end 0.32980475))))))
(("hundred"
((id "_3")
(name "hundred")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(phrase_score -0.692711)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB")))
(("syl" ((id "_10") (name "syl") (stress 1)))
(("hh" ((id "_11") (name "hh") (end 0.39506164))))
(("ah" ((id "_12") (name "ah") (end 0.48999402))))
(("n" ((id "_13") (name "n") (end 0.56175226)))))
(("syl" ((id "_14") (name "syl") (stress 0)))
(("d" ((id "_15") (name "d") (end 0.59711802))))
(("r" ((id "_16") (name "r") (end 0.65382934))))
(("ax" ((id "_17") (name "ax") (end 0.67743915))))
(("d" ((id "_18") (name "d") (end 0.75765681))))))
(("fifty"
((id "_4")
(name "fifty")
(pos_index 8)
(pos_index_score 0)
(pos "nn")
(phr_pos "n")
(phrase_score -0.69282991)
(pbreak_index 1)
(pbreak_index_score 0)
(pbreak "NB")))
(("syl" ((id "_19") (name "syl") (stress 1)))
(("f" ((id "_20") (name "f") (end 0.86216313))))
(("ih" ((id "_21") (name "ih") (end 0.93317086))))
(("f" ((id "_22") (name "f") (end 1.0023116)))))
(("syl" ((id "_23") (name "syl") (stress 0)))
(("t" ((id "_24") (name "t") (end 1.0642071))))
(("iy" ((id "_25") (name "iy") (end 1.1534019))))))
(("three"
((id "_5")
(name "three")
(pos_index 1)
(pos_index_score 0)
(pos "cd")
(phr_pos "cd")
(pbreak_index 0)
(pbreak_index_score 0)
(pbreak "B")
(blevel 3)))
(("syl" ((id "_26") (name "syl") (stress 1)))
(("th" ((id "_27") (name "th") (end 1.2816957))))
(("r" ((id "_28") (name "r") (end 1.3449684))))
(("iy" ((id "_29") (name "iy") (end 1.5254952)))))))
festival> (utt.relation_tree utter 'IntEvent)
((("L-L%" ((id "_32") (name "L-L%"))))
(("H*" ((id "_33") (name "H*"))))
(("H*" ((id "_34") (name "H*"))))
(("H*" ((id "_35") (name "H*")))))
festival> (utt.relation_tree utter 'Intonation)
((("syl" ((id "_26") (name "syl") (stress 1)))
(("L-L%" ((id "_32") (name "L-L%")))))
(("syl" ((id "_7") (name "syl") (stress 1)))
(("H*" ((id "_33") (name "H*")))))
(("syl" ((id "_10") (name "syl") (stress 1)))
(("H*" ((id "_34") (name "H*")))))
(("syl" ((id "_19") (name "syl") (stress 1)))
(("H*" ((id "_35") (name "H*"))))))
festival> (utt.relation_tree utter 'Target)
((("t" ((id "_8") (name "t") (end 0.25016451)))
(("0" ((id "_36") (f0 101.42016) (pos 0.1)))))
(("uw" ((id "_9") (name "uw") (end 0.32980475)))
(("0" ((id "_37") (f0 121.11904) (pos 0.25)))))
(("hh" ((id "_11") (name "hh") (end 0.39506164)))
(("0" ((id "_38") (f0 119.19957) (pos 0.30000001)))))
(("ah" ((id "_12") (name "ah") (end 0.48999402)))
(("0" ((id "_39") (f0 123.81679) (pos 0.44999999)))))
(("d" ((id "_15") (name "d") (end 0.59711802)))
(("0" ((id "_40") (f0 117.02986) (pos 0.60000002)))))
(("ax" ((id "_17") (name "ax") (end 0.67743915)))
(("0" ((id "_41") (f0 110.17942) (pos 0.85000008)))))
(("f" ((id "_20") (name "f") (end 0.86216313)))
(("0" ((id "_42") (f0 108.59299) (pos 1.0000001)))))
(("ih" ((id "_21") (name "ih") (end 0.93317086)))
(("0" ((id "_43") (f0 115.24371) (pos 1.1500001)))))
(("t" ((id "_24") (name "t") (end 1.0642071)))
(("0" ((id "_44") (f0 108.76601) (pos 1.3000002)))))
(("iy" ((id "_25") (name "iy") (end 1.1534019)))
(("0" ((id "_45") (f0 102.23844) (pos 1.4500003)))))
(("th" ((id "_27") (name "th") (end 1.2816957)))
(("0" ((id "_46") (f0 99.160072) (pos 1.5000002)))))
(("iy" ((id "_29") (name "iy") (end 1.5254952)))
(("0" ((id "_47") (f0 90.843689) (pos 1.7500002))))
(("0" ((id "_48") (f0 88.125809) (pos 1.8000003))))))
festival> (utt.relation_tree utter 'HMMstate)
((("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001)*
(("pau_2" ((id "_50") (name "pau_2") (statepos 2) (end 0.1))))
(("pau_3" ((id "_51") (name "pau_3") (statepos 3) (end 0.15000001)*
(("t_1" ((id "_52") (name "t_1") (statepos 1) (end 0.16712391))))
(("t_2" ((id "_53") (name "t_2") (statepos 2) (end 0.23217295))))
(("t_3" ((id "_54") (name "t_3") (statepos 3) (end 0.25016451))))
(("uw_1" ((id "_55") (name "uw_1") (statepos 1) (end 0.2764155))))
(("uw_2" ((id "_56") (name "uw_2") (statepos 2) (end 0.3001706))))
(("uw_3" ((id "_57") (name "uw_3") (statepos 3) (end 0.32980475))))
(("hh_1" ((id "_58") (name "hh_1") (statepos 1) (end 0.3502973))))
...
...
(("iy_1" ((id "_100") (name "iy_1") (statepos 1) (end 1.3995106))))
(("iy_2" ((id "_101") (name "iy_2") (statepos 2) (end 1.4488922))))
(("iy_3" ((id "_102") (name "iy_3") (statepos 3) (end 1.5254952))))
(("pau_1" ((id "_103") (name "pau_1") (statepos 1) (end 1.5754951)*
(("pau_2" ((id "_104") (name "pau_2") (statepos 2) (end 1.6254952)*
(("pau_3" ((id "_105") (name "pau_3") (statepos 3) (end 1.6754951)*
festival> (utt.relation_tree utter 'segstate)
((("pau" ((id "_30") (name "pau") (end 0.15000001)))
(("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001)
(("pau_2" ((id "_50") (name "pau_2") (statepos 2) (end 0.1))))
(("pau_3" ((id "_51") (name "pau_3") (statepos 3) (end 0.15000001)*
(("t" ((id "_8") (name "t") (end 0.25016451)))
(("t_1" ((id "_52") (name "t_1") (statepos 1) (end 0.16712391))))
(("t_2" ((id "_53") (name "t_2") (statepos 2) (end 0.23217295))))
(("t_3" ((id "_54") (name "t_3") (statepos 3) (end 0.25016451)))))
(("uw" ((id "_9") (name "uw") (end 0.32980475)))
(("uw_1" ((id "_55") (name "uw_1") (statepos 1) (end 0.2764155))))
(("uw_2" ((id "_56") (name "uw_2") (statepos 2) (end 0.3001706))))
(("uw_3" ((id "_57") (name "uw_3") (statepos 3) (end 0.32980475))))
...
...
(("iy" ((id "_29") (name "iy") (end 1.5254952)))
(("iy_1" ((id "_100") (name "iy_1") (statepos 1) (end 1.3995106))))
(("iy_2" ((id "_101") (name "iy_2") (statepos 2) (end 1.4488922))))
(("iy_3" ((id "_102") (name "iy_3") (statepos 3) (end 1.5254952))*
(("pau" ((id "_31") (name "pau") (end 1.6754951)))
(("pau_1" ((id "_103") (name "pau_1") (statepos 1) (end 1.5754951)*
(("pau_2" ((id "_104") (name "pau_2") (statepos 2) (end 1.6254952)*
(("pau_3" ((id "_105") (name "pau_3") (statepos 3) (end 1.6754951)*
festival> (utt.relation_tree utter 'mcep)
((("pau_1"
((id "_106")
(frame_number 0)
(name "pau_1")
(clustergen_param_frame 19315))))
(("pau_1"
((id "_107")
(frame_number 1)
(name "pau_1")
(clustergen_param_frame 19315))))
(("pau_1"
((id "_108")
(frame_number 2)
(name "pau_1")
(clustergen_param_frame 19315))))
(("pau_1"
((id "_109")
(frame_number 3)
(name "pau_1")
(clustergen_param_frame 19315))))
...
...
(("t_1"
((id "_137")
(frame_number 31)
(name "t_1")
(clustergen_param_frame 26089))))
(("t_1"
((id "_138")
(frame_number 32)
(name "t_1")
(clustergen_param_frame 26085))))
(("t_1"
((id "_139")
(frame_number 33)
(name "t_1")
(clustergen_param_frame 26085))))
(("t_2"
((id "_140")
(frame_number 34)
(name "t_2")
(clustergen_param_frame 26642))))
...
...
(("uw_1"
((id "_157")
(frame_number 51)
(name "uw_1")
(clustergen_param_frame 27595))))
...
(("pau_3"
((id "_438")
(frame_number 332)
(name "pau_3")
(clustergen_param_frame 22148))))
(("pau_3"
((id "_439")
(frame_number 333)
(name "pau_3")
(clustergen_param_frame 22148))))
(("pau_3"
((id "_440")
(frame_number 334)
(name "pau_3")
(clustergen_param_frame 22148))))
(("pau_3"
((id "_441")
(frame_number 335)
(name "pau_3")
(clustergen_param_frame 22365)))))
festival> (utt.relation_tree utter 'mcep_link)
((("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001).
(("pau_1"
((id "_106")
(frame_number 0)
(name "pau_1")
(clustergen_param_frame 19315))))
(("pau_1"
((id "_107")
(frame_number 1)
(name "pau_1")
(clustergen_param_frame 19315))))
(("pau_1"
((id "_108")
(frame_number 2)
(name "pau_1")
(clustergen_param_frame 19315))))
...
...
(("pau_3"
((id "_439")
(frame_number 333)
(name "pau_3")
(clustergen_param_frame 22148))))
(("pau_3"
((id "_440")
(frame_number 334)
(name "pau_3")
(clustergen_param_frame 22148))))
(("pau_3"
((id "_441")
(frame_number 335)
(name "pau_3")
(clustergen_param_frame 22365))))))
festival> (utt.relation_tree utter 'Wave)
((("0" ((id "_442") (wave "[Val wave]")))))
festival>
Notes :
* some parentheses have been deleted in the display for formating reasons
… some content has been deleted to reduce the size of the analyzed code
Results of the code analysis
The number of items created for the string “253” are shown in the following table :
number |
item |
id’s |
1 |
token |
1 |
4 |
word |
2-5 |
1 |
phrase |
6 |
6 |
syllable |
7, 10, 14, 19, 23, 26 |
19 |
segment |
8-9, 11-13, 15-18, 20-22, 24-25, 27-31 |
4 |
intevent |
32-35 |
13 |
target |
36-48 |
57 |
hmmstate |
49-105 |
336 |
mcep |
106-441 |
1 |
wave |
442 |
The features associated to the different items are presented in the next table :
item |
features |
token |
name, whitespace, prepunctuation, token_pos |
word |
name, pos_index, pos_index_score, pos, phr_pos, phrase_score, pbreak_index, pbreak_index_score, pbreak, blevel |
phrase |
name |
syllable |
name, stress |
segment |
name, end |
intevent |
name |
target |
f0, pos |
hmmstate |
name, statepos, end |
mcep |
name, frame_number, clustergen_param_frame |
wave |
Val |
The last table shows the relations between the different items in the HRG :
item |
daughter |
leaf |
relation |
token |
word |
x |
Token |
word |
syllable |
– |
SylStructure |
phrase |
word |
x |
Phrase |
syllable |
segment |
x (except silence) |
SylStructure |
syllable |
intevent |
x |
Intonation |
segment |
target |
x |
Target |
segment |
hmmstate |
x |
segstate |
segment |
mcep |
x |
mcep_link |
Relations between utterance items
To better understand the relations between utterance items, I use a second example :
festival>
(set! utter (SayText "253 and 36"))
(utt.relation.print utter 'Token)
Festival SayText
There are 3 tokens. The Token relation is a list of trees where each root is the white space separated tokenized object from the input character string and where the daughters are the list of words associated with the tokens. Most often it is a one to one relationship, but in the case of digits a token is associated with several words. The following command shows the Token tree with the daughters :
(utt.relation_tree utter 'Token)
Festival Token_tree
We can check that the word list corresponds to the Token tree list :
(utt.relation.print utter 'Word)
Festival Word List
To access the second word of the first token we can use two methods :
(item.name (item.daughter2 (utt.relation.first utter 'Token)))
or
(item.name (item.next (utt.relation.first utter 'Word)))
Festival access methods to word item
TTS stages and steps
In the next chapters the different stages and steps executed to synthesize a text string are described with more details. In the first step a simple and a complex utterance of type Text are created :
(set! simple_utt (Utterance Text
"The quick brown fox jumps over the lazy dog"))
(set! complex_utt (Utterance Text
"Mr. James Brown Jr. attended flight No AA4101 to Boston on
Friday 02/13/2014."))
The complex utterance named complex_utt is used in the following examples.
1. Text-to-Token Conversion
Text
Text is a string of characters in ASCII or ISO-8850 format. Written (raw) text usually contains also numbers, names, abbreviations, symbols, punctuation etc which must be translated into spoken text. This process is called Tokenization. Other terms used are (lexical) Pre-Processing, Text Normalization or Canonicalization. To access the items and features of the defined utterance named complex_utt in Festival we use the following modules :
festival> (Initialize complex_utt) ; Initialize utterance
festival> (utt.relationnames complex_utt) ; show created relations
Festival Utterance Initialization
The result nil indicates that there exist not yet a relation inside the text-utterance.
Tokens
The second step is the Tokenization which consists in the conversion of the input text to tokens. A token is a sequence of characters where whitespace and punctuation are eliminated. The following Festival command is used to convert raw text to tokens and to show them :
festival> (Text complex_utt) ; convert text to tokens
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Token) ; display tokens
Festival Text Module to convert raw text to tokens
There are several methods to access individual tokens :
festival> (utt.relation.first complex_utt 'Token) ; returns 1st token
festival> (utt.relation.last complex_utt 'Token) ; returns last token
festival> (utt.relation_tree complex_utt 'Token) ; returns token tree
Festival Token Access
This utt.relation_tree method can also be applied to other relations than ‘Tokens.
2. Token-to-Word Conversion
Words
In linguistics, a word is the smallest element that may be uttered in isolation with semantic or pragmatic content. To convert the isolated tokens to words, we use the Festival commands :
festival> (Token complex_utt) ; token to word conversion
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Word) ; display words
Festival Token Module to convert tokens to words
The rules to perform the token to word conversion are specified in the Festival script token.scm.
POS
Part-of-Speech (POS) Tagging is related to the Token-to-Word conversion. POS is also called grammatical tagging or word-category disambiguation. It’s the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context (identification of words as nouns, verbs, adjectives, adverbs, etc.)
To do the POS tagging, we use the commands
festival> (POS complex_utt) ; Part of Speech tagging
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Word) ; display words
The relation check shows that no new relation was created with the POS method. There are however new features which have been added to the ‘Word relation.
Festival POS Module to tag the words
The new features are :
- pos_index n
- pos_index_score m
- pos xx
Phrase
The last step of the Token-to-Word conversion is the phrasing. This process determines the places where phrase boundaries should be inserted. Prosodic phrasing in TTS makes the whole speech more understandable. The phrasing is launched with the following commands :
festival> (Phrasify complex_utt) ;
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Phrase) ; display breaks
Festival Phrasify Module to insert boundaries
The result can be seen in new attributes in the Word relation:
festival> (utt.relation.print complex_utt 'Word)
- phr_pos xx
- phrase_score nn
- pbreak_index n
- pbreak_index_score m
- pbreak yy (B for small breaks, BB is for big breaks, NB for no break)
- blevel p
Festival Word list after phrasing (click to enlarge)
3. Word-to-Phoneme Conversion
The command
festival> (Word complex_utt)
generates 3 new relations : syllables, segments and SylStructure.
Festival relations generated by the Word method
Segment
Segments and phones are synonyms.
festival> (utt.relation.print complex_utt 'Segment)
Festival segments = phones
Syllable
Consonants and vowels combine to make syllables. They are often considered the phonological building blocks of words, but there is no universally accepted definition for a syllable. An approximate definition is : a syllable is a vowel sound together with some of the surrounding consonants closely associated with it. The general structure of a syllable consists of three segments :
- Onset : a consonant or consonant cluster
- Nucleus : a sequence of vowels or syllabic consonant
- Coda : a sequence of consonants
Nucleus and coda are grouped together as a Rime. Prominent syllables are called accented; they are louder, longer and have a different pitch.
The following Festival command shows the syllables of the defined utterance.
festival> (utt.relation.print complex_utt 'Syllable)
Festival syllables
SylStructure
Words, segments and syllables are related in the HRG trought the SylStructure. The command
festival> (utt.relation.print complex_utt 'SylStructure)
prints these related items.
Festival SylStructure (click to enlarge)
4. Prosodic Information Addition
Besides the phrasing with break indices, additional prosodic components can be added to speech synthesis to improve the voice quality. Some of these elements are :
- pitch accents (stress)
- final boundary tones
- phrasal tones
- F0 contour
- tilt
- duration
Festival supports ToBI, a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety.
The process
festival> (Intonation complex_utt)
generates two additional relations : IntEvent and Intonation
Festival prosodic relations
IntEvent
The command
festival> (utt.relation.print complex_utt 'IntEvent)
prints the IntEvent items.
Festival IntEvent items
The following types are listed :
- L-L% : low boundary tone
- H* : peak accent
- !H* : downstep high
- L+H* : bitonal accent, rising peak
Intonation
The command
festival> (utt.relation.print complex_utt 'Intonation)
prints the Intonation items.
Festival Intonation items
Only the syllables with stress are displayed.
Duration
The process
festival> (Duration complex_utt)
creates no new relations and I have not seen any new items or features in other relations.
utt14
Target
The last process in the prosodic stage
festival> (Int_Targets complex_utt)
generates the additional relation Target.
Festival relations after the Int_Targets process
The command
festival> (utt.relation.print complex_utt 'Target)
prints the target items.
Festival clustergen targets
The unique target features are the segment name and the segment end time.
5. Waveform Generation
Wave
The process
festival> (Wave_Synth complex_utt)
festival> (utt.relation.print complex_utt 'Wave)
generates five new relations :
- HMMstate
- segstate
- mcep
- mcep_links
- Wave
Festival Wave relations for clustergen voice
In the next chapters we use the method
(utt.relation.print complex_utt 'Relation)
to display the relations and features specific to the diphone voice.
Relation ‘HMMstate
HMMstates for Festival clustergen voice
Relation ‘segstate
segstates for Festival clustergen voice
Relation ‘mcep
mcep features for Festival clustergen voice
Relation ‘mcep_links
mcep_links relation for Festival clustergen voice
Relation ‘Wave
Wave relation for Festival clustergen voice
Diphone Voice Utterance
If we use a diphone voice (e.g. the default kal_diphone voice) instead of the clustergen voice, the last step of the prosodic stage (No 4) and the complete wave-synthesis stage (No 5) provide different relations and features.
We use the Festival method “SayText”, a combination of the above presented processes
- Initialize utt
- Text utt
- Token utt
- POS utt
- Phrasify utt
- Word utt
- Intonation utt
- Duration utt
- Int_Targets utt
- Wave_Synt utt
to create the same complex utterance as in the first example :
festival>
(set! complex_utt (SayText "Mr. James Brown Jr. attended flight
No AA4101 to Boston on Friday 02/13/2014."))
(utt.relationnames complex_utt)
Here are the results :
Utterance relations for a Festival diphone voice
In the next chapters we use the method
(utt.relation.print complex_utt 'Relation)
to display the relations and features specific to the diphone voice.
Relation ‘Target
Relation Target for diphone voice
Relation ‘Unit
Relation Unit for diphone voice (click to enlarge)
Relation ‘SourceCoef
Relation ‘SourceCoef for diphone voice
Relation ‘fo
Relation f0 for diphone voice
Relation ‘TargetCoef
Relation TargetCoef for diphone voice
Relation ‘US_map
Relation US_map for diphone voice
Relation ‘Wave
Relation Wave for diphone voice
Playing and saving the diphone voice utterance :
Playing and saving a synthesized Festival utterance
Diphone voice utterance shown in Audacity :
Display of a synthesized Festival utterance
Clunits Voice Utterance
What is true for the diphone voice is also ture for a clunits voice. The last step of the prosodic stage (No 4) and the complete wave-synthesis stage (No 5) generate different relations and features. As an example we use a swedish clunits voice :
Relations for Festival clunits voice
Relation ‘Target
Relation Target for Festival clunits voice (click to enlarge)
Relation ‘Unit
Relation unit for Festival clunits voice (click to enlarge)
Relation ‘SourceSegments
Relation SourceSegments for Festival clunits voice
That’s all.