Referring to my recent post about Festival, I am glad to announce that I was successful in building and testing a new uniphone voice (english) with my own prompt recordings. The goal is to set up my system to create a luxembourgish synthetic voice for the Festival package.
The list of the different steps is shown below :
• (creation of the voice directory mbarnig_en_marco)
• $FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone
• (define a phoneset)
• (define a lexicon)
• festival -b festvox/build_clunits.scm '(build_prompts_waves
"etc/uniphone.data")'
• (uncomment the line USE_SOX=1 in the script prompt_them)
• ./bin/prompt_them etc/uniphone.data
• ./bin/make_labs prompt-wav/*.wav
• festival -b festvox/build_clunits.scm '(build_utts
"etc/uniphone.data")'
• (copy etc/uniphone.data into etc/txt.done.data)
• ./bin/make_pm_wave wav/*.wav
• ./bin/make_mcep wav/*.wav
• festival -b festvox/build_clunits.scm '(build_clunits
"etc/uniphone.data")'
• (copy data in Festival voice directory)
• festival> (voice_mbarnig_en_marco_clunits)
• festival> (SayText "Hello Marco, how are you?")
1. Voice Folder
First I created a new voice folder mbarnig_en_marco inside the Festival_TTS/festvox/ directory and opened a terminal window inside this new folder.
2. Clunits Setup
I launched the script
$FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone
to construct the voice folder structure and copy there the template files for voice building.
The arguments of the setup script setup_clunits are :
- institution : mbarnig
- language : en
- speaker : marco
- standard prompt list : uniphone
The following folders are created inside the voice directory mbarnig_en_marco :
- bin
- cep
- emu
- etc
- f0
- festival
- festvox
- group
- lab
- lar
- lpc
- mcep
- phr
- pm
- pm_lab
- prompt_cep
- prompt_lab
- prompt_utt
- prompt_wav
- recording
- scratch
- syl
- versions
- wav
- wrd
The following programs are copied into the 1sr folder (mbarnig_en_marco/bin) :
- add_noise
- contour_powernormalize
- do_build
- find_db_duration
- find_num_available_cpu
- find_powercontours
- find_poerfactors
- get_lars
- get_wavs
- make_cmm
- make_dist
- make_f0
- make_labs
- make_lpc
- make_mcep
- make_pm
- make_pm_fix
- make_pm_pmlab
- make_pm_wave
- make_pmlab_pm
- make_samples
- prompt_them
- prune_middle_silence
- prune_silence
- reduce_prompts
- simple_powernormalize
- sphinx_lab
- sphinxtrain
- synthfile
- traintest
- ws
The following files are created inside the 4th folder (mbarnig_en_marco/etc) :
- emu_f0.tpl
- emu_hier.tpl
- emu_lab.tpl
- emu_pm.tpl
- uniphone.data
- voice.defs
- ws_festvox.conf
The following sub-folders (most empty) are created inside the 6th folder (mbarnig_en_marco/festival) :
- clunits, including a file all.desc
- coeffs
- disttabs
- dur
- f0
- feats
- phrbrk
- trees
- utts
The following scripts are created inside the 7th folder (mbarnig_en_marco/festvox) :
- build_clunits.scm
- build_st.scm
- mbarnig_en_marco_clunits.scm
- mbarnig_en_marco_duration.scm
- mbarnig_en_marco_durdata.scm
- mbarnig_en_marco_f0model.scm
- mbarnig_en_marco_intonation.scm
- mbarnig_en_marco_lexicon.scm
- mbarnig_en_marco_other.scm
- mbarnig_en_marco_phoneset.scm
- mbarnig_en_marco_phrasing.scm
- mbarnig_en_marco_tagger.scm
- mbarnig_en_marco_tokenizer.scm
The other listed folders are empty.
The file uniphone.data in the mbarnig_en_marco/etc folder contains the following minimal prompt-set :
( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )
These 3 sentences contain each of the english phonemes once. The prompt list is coded in the standard Festival data-format. The spaces after the left parantheses are required.
The file voice.defs in the mbarnig_en_marco/etc folder contains the following parameters :
FV_INST=mbarnig
FV_LANG=en
FV_NAME=marco
FV_TYPE=clunits
FV_VOICENAME=$FV_INST"_"$FV_LANG"_"$FV_NAME
FV_FULLVOICENAME=$FV_VOICENAME"_"FV_TYPE
The file ws_festvox.conf in the mbarnig_en_marco/etc folder is automatically generated by WaveSurfer.
Phoneset Definition
The phoneset for the new voice is defined in the script mbarnig_en_marco_phoneset.scm. Referring to the english Festival radio phoneset, I modified the phoneset-script as follows :
;;; Phoneset for mbarnig_en_marco
;;;
(defPhoneSet
mbarnig_en_marco
;;; Phone Features
(;; vowel or consonant
(vc + -)
;; vowel length: short long dipthong schwa
(vlng s l d a 0)
;; vowel height: high mid low
(vheight 1 2 3 0)
;; vowel frontness: front mid back
(vfront 1 2 3 0)
;; lip rounding
(vrnd + - 0)
;; consonant type: stop fricative affricate nasal lateral approximant
(ctype s f a n l r 0)
;; place of articulation: labial alveolar palatal labio-dental
;; dental velar glottal
(cplace l a p b d v g 0)
;; consonant voicing
(cvox + - 0)
)
;; Phone set members
(
;; Note these features were set by awb so they are wrong !!!
(aa + l 3 3 - 0 0 0) ;; father
(ae + s 3 1 - 0 0 0) ;; fat
(ah + s 2 2 - 0 0 0) ;; but
(ao + l 3 3 + 0 0 0) ;; lawn
(aw + d 3 2 - 0 0 0) ;; how
(ax + a 2 2 - 0 0 0) ;; about
(axr + a 2 2 - r a +)
(ay + d 3 2 - 0 0 0) ;; hide
(b - 0 0 0 0 s l +)
(ch - 0 0 0 0 a p -)
(d - 0 0 0 0 s a +)
(dh - 0 0 0 0 f d +)
(dx - a 0 0 0 s a +) ;; ??
(eh + s 2 1 - 0 0 0) ;; get
(el + s 0 0 0 l a +)
(em + s 0 0 0 n l +)
(en + s 0 0 0 n a +)
(er + a 2 2 - r 0 0) ;; always followed by r (er-r == axr)
(ey + d 2 1 - 0 0 0) ;; gate
(f - 0 0 0 0 f b -)
(g - 0 0 0 0 s v +)
(hh - 0 0 0 0 f g -)
(hv - 0 0 0 0 f g +)
(ih + s 1 1 - 0 0 0) ;; bit
(iy + l 1 1 - 0 0 0) ;; beet
(jh - 0 0 0 0 a p +)
(k - 0 0 0 0 s v -)
(l - 0 0 0 0 l a +)
(m - 0 0 0 0 n l +)
(n - 0 0 0 0 n a +)
(nx - 0 0 0 0 n d +) ;; ???
(ng - 0 0 0 0 n v +)
(ow + d 2 3 + 0 0 0) ;; lone
(oy + d 2 3 + 0 0 0) ;; toy
(p - 0 0 0 0 s l -)
(r - 0 0 0 0 r a +)
(s - 0 0 0 0 f a -)
(sh - 0 0 0 0 f p -)
(t - 0 0 0 0 s a -)
(th - 0 0 0 0 f d -)
(uh + s 1 3 + 0 0 0) ;; full
(uw + l 1 3 + 0 0 0) ;; fool
(v - 0 0 0 0 f b +)
(w - 0 0 0 0 r l +)
(y - 0 0 0 0 r p +)
(z - 0 0 0 0 f a +)
(zh - 0 0 0 0 f p +)
(pau - 0 0 0 0 0 0 -)
(h# - 0 0 0 0 0 0 -)
(brth - 0 0 0 0 0 0 -)
)
)
(PhoneSet.silences '(pau))
(define (mbarnig_en_marco::select_phoneset)
"(mbarnig_en_marco::select_phoneset)
Set up phone set for mbarnig_en_marco."
(Parameter.set 'PhoneSet 'mbarnig_en_marco)
(PhoneSet.select 'mbarnig_en_marco)
)
(define (mbarnig_en_marco::reset_phoneset)
"(mbarnig_en_marco::reset_phoneset)
Reset phone set for mbarnig_en_marco."
t
)
(provide 'mbarnig_en_marco_phoneset)
Lexicon Creation
Without a lexicon, the result of a building command generates unknown words messages
The lexicon for the new voice is defined in the script mbarnig_en_marco_lexicon.scm. Referring the the english Festival cmu lexicon, I modified the lexicon-script as follows :
;;; Lexicon, LTS and Postlexical rules for mbarnig_en_marco
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;; CMU lexicon for US English
;;;
;;; Load any necessary files here
(require 'postlex)
(setup_cmu_lex)
(define (mbarnig_en_marco::select_lexicon)
"(mbarnig_lx_marco::select_lexicon)
Set up the lexicon for mbarnig_lx_marco."
(lex.select "cmu")
;; Post lexical rules
(set! postlex_rules_hooks (list postlex_apos_s_check))
(set! postlex_vowel_reduce_cart_tree nil) ; no reduction
)
(define (mbarnig_lx_marco::reset_lexicon)
"(mbarnig_lx_marco::reset_lexicon)
Reset lexicon information."
t
)
(provide 'mbarnig_lx_marco_lexicon)
Building Prompts
The second command
festival -b festvox/build_clunits.scm '(build_prompts_waves
"etc/uniphone.data")'
generates synthesized waveforms to act as prompts and timing cues. The nearest available voice (in this case kal_diphone) is used for synthesizing. The generated files are also used in aligning the spoken data. The -b option (–batch) avoids switching in the interactive Festival mode.
The following files are created :
- folder prompt-lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
- folder prompt-utt : files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt
- folder prompt-wav : files uniph_0001.wav, uniph_0002.wav and uniph_0003.wav
The uniph_xxxx.lab files have the following type of content :
#
0.1100 100 pau
0.2200 100 ax
0.3300 100 hh
0.4400 100 ow
0.5500 100 l
0.6600 100 jh
0.7700 100 oy
0.8800 100 w
0.9900 100 aa
1.1000 100 z
1.2100 100 r
1.3200 100 iy
1.4850 100 p
1.6500 100 ih
1.8150 100 ng
1.9250 100 pau
The uniph_xxxx.utt files have the following type of content :
EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 77 ; type Text ;
iform "\"a whole joy was reaping.\"" ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
4 id _4 ; name was ; whitespace " " ; prepunctuation "" ;
5 id _5 ; name reaping ; punc . ; whitespace " " ;
prepunctuation "" ;
6 id _10 ; name reaping ; pbreak B ; pos nil ;
7 id _11 ; name . ; pbreak B ; pos punc ;
8 id _9 ; name was ; pbreak NB ; pos nil ;
9 id _8 ; name joy ; pbreak NB ; pos nil ;
10 id _7 ; name whole ; pbreak NB ; pos nil ;
11 id _6 ; name a ; pbreak NB ; pos dt ;
12 id _12 ; name B ;
13 id _13 ; name syl ; stress 0 ;
14 id _15 ; name syl ; stress 1 ;
15 id _19 ; name syl ; stress 1 ;
16 id _22 ; name syl ; stress 1 ;
17 id _26 ; name syl ; stress 1 ;
18 id _29 ; name syl ; stress 0 ;
19 id _33 ; name pau ; dur_factor 1 ; end 0.11 ;source_end 0.101815 ;
20 id _14 ; name ax ; dur_factor 1 ; end 0.22 ;source_end 0.235802 ;
21 id _16 ; name hh ; dur_factor 1 ; end 0.33 ;source_end 0.322177 ;
22 id _17 ; name ow ; dur_factor 1 ; end 0.44 ;source_end 0.493926 ;
23 id _18 ; name l ; dur_factor 1 ; end 0.55 ;source_end 0.626926 ;
24 id _20 ; name jh ; dur_factor 1 ; end 0.66 ;source_end 0.732624 ;
25 id _21 ; name oy ; dur_factor 1 ; end 0.77 ;source_end 0.900228 ;
26 id _23 ; name w ; dur_factor 1 ; end 0.88 ;source_end 1.06616 ;
27 id _24 ; name aa ; dur_factor 1 ; end 0.99 ;source_end 1.20716 ;
28 id _25 ; name z ; dur_factor 1 ; end 1.1 ;source_end 1.33726 ;
29 id _27 ; name r ; dur_factor 1 ; end 1.21 ;source_end 1.46326 ;
30 id _28 ; name iy ; dur_factor 1 ; end 1.32 ;source_end 1.58507 ;
31 id _30 ; name p ; dur_factor 1.5 ; end 1.485 ;source_end 1.71307 ;
32 id _31 ; name ih ; dur_factor 1.5 ; end 1.65 ;source_end 1.83979 ;
33 id _32 ; name ng ; dur_factor 1.5 ; end 1.815 source_end 2.02441 ;
34 id _34 ; name pau ; dur_factor 1 ; end 1.925 ;source_end 2.36643 ;
35 id _35 ; name Accented ;
36 id _36 ; name Accented ;
37 id _37 ; name Accented ;
38 id _38 ; name Accented ;
39 id _56 ; f0 110 ; pos 1.815 ;
40 id _54 ; f0 126.571 ; pos 1.31 ;
41 id _55 ; f0 116.286 ; pos 1.32 ;
42 id _53 ; f0 128.571 ; pos 1.11 ;
43 id _50 ; f0 130 ; pos 1.09 ;
44 id _51 ; f0 118.571 ; pos 1.1 ;
45 id _52 ; f0 118.571 ; pos 1.101 ;
46 id _49 ; f0 132 ; pos 0.78 ;
47 id _46 ; f0 132.286 ; pos 0.76 ;
48 id _47 ; f0 122 ; pos 0.77 ;
49 id _48 ; f0 122 ; pos 0.771 ;
50 id _45 ; f0 134.286 ; pos 0.56 ;
51 id _42 ; f0 135.714 ; pos 0.54 ;
52 id _43 ; f0 124.286 ; pos 0.55 ;
53 id _44 ; f0 124.286 ; pos 0.551 ;
54 id _41 ; f0 137.714 ; pos 0.23 ;
55 id _40 ; f0 127.714 ; pos 0.22 ;
56 id _39 ; f0 130 ; pos 0.11 ;
57 id _57 ; name pau-ax ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 9 ; end 0.172053 ; num_frames 17 ;
58 id _58 ; name ax-hh ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 5 ; end 0.288115 ; num_frames 11 ;
59 id _59 ; name hh-ow ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 2 ; end 0.406552 ; num_frames 11 ;
60 id _60 ; name ow-l ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 7 ; end 0.559239 ; num_frames 14 ;
61 id _61 ; name l-jh ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 5 ; end 0.673416 ; num_frames 10 ;
62 id _62 ; name jh-oy ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 4 ; end 0.82104 ; num_frames 13 ;
63 id _63 ; name oy-w ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 6 ; end 1.01148 ; num_frames 17 ;
64 id _64 ; name w-aa ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 4 ; end 1.1311 ; num_frames 11 ;
65 id _65 ; name aa-z ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 6 ; end 1.25207 ; num_frames 11 ;
66 id _66 ; name z-r ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 7 ; end 1.39807 ; num_frames 14 ;
67 id _67 ; name r-iy ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 5 ; end 1.52951 ; num_frames 12 ;
68 id _68 ; name iy-p ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 4 ; end 1.64963 ; num_frames 11 ;
69 id _69 ; name p-ih ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 5 ; end 1.78517 ; num_frames 13 ;
70 id _70 ; name ih-ng ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 4 ; end 1.91617 ; num_frames 12 ;
71 id _71 ; name ng-pau ; sig "[Val wave]" ; coefs "[Val track]" ;
middle_frame 9 ; end 2.19542 ; num_frames 27 ;
72 id _72 ; name coef ; coefs "[Val track]" ;
frame "[Val wavevector]" ;
73 id _73 ; name f0 ; f0 "[Val track]" ;
74 id _74 ; coefs "[Val track]" ; residual "[Val wave]" ;
75 id _75 ;
76 id _76 ; map "[Val ivector]" ;
77 id _77 ; wave "[Val wave]" ;
End_of_Stream_Items
Relations
Relation Token ; ()
6 11 1 0 0 0
1 1 0 6 2 0
7 10 2 0 0 0
2 2 0 7 3 1
8 9 3 0 0 0
3 3 0 8 4 2
9 8 4 0 0 0
4 4 0 9 5 3
10 6 5 0 11 0
11 7 0 0 0 10
5 5 0 10 0 4
End_of_Relation
Relation Word ; ()
1 11 0 0 2 0
2 10 0 0 3 1
3 9 0 0 4 2
4 8 0 0 5 3
5 6 0 0 0 4
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 12 0 2 0 0
End_of_Relation
Relation Syllable ; ()
1 13 0 0 2 0
2 14 0 0 3 1
3 15 0 0 4 2
4 16 0 0 5 3
5 17 0 0 6 4
6 18 0 0 0 5
End_of_Relation
Relation Segment ; ()
1 19 0 0 2 0
2 20 0 0 3 1
3 21 0 0 4 2
4 22 0 0 5 3
5 23 0 0 6 4
6 24 0 0 7 5
7 25 0 0 8 6
8 26 0 0 9 7
9 27 0 0 10 8
10 28 0 0 11 9
11 29 0 0 12 10
12 30 0 0 13 11
13 31 0 0 14 12
14 32 0 0 15 13
15 33 0 0 16 14
16 34 0 0 0 15
End_of_Relation
Relation SylStructure ; ()
8 20 7 0 0 0
7 13 1 8 0 0
1 11 0 7 2 0
10 21 9 0 11 0
11 22 0 0 12 10
12 23 0 0 0 11
9 14 2 10 0 0
2 10 0 9 3 1
14 24 13 0 15 0
15 25 0 0 0 14
13 15 3 14 0 0
3 9 0 13 4 2
17 26 16 0 18 0
18 27 0 0 19 17
19 28 0 0 0 18
16 16 4 17 0 0
4 8 0 16 5 3
22 29 20 0 23 0
23 30 0 0 0 22
20 17 5 22 21 0
24 31 21 0 25 0
25 32 0 0 26 24
26 33 0 0 0 25
21 18 0 24 0 20
5 6 0 20 6 4
6 7 0 0 0 5
End_of_Relation
Relation IntEvent ; ()
1 35 0 0 2 0
2 36 0 0 3 1
3 37 0 0 4 2
4 38 0 0 0 3
End_of_Relation
Relation Intonation ; ()
5 35 1 0 0 0
1 14 0 5 2 0
6 36 2 0 0 0
2 15 0 6 3 1
7 37 3 0 0 0
3 16 0 7 4 2
8 38 4 0 0 0
4 17 0 8 0 3
End_of_Relation
Relation Target ; ()
12 56 1 0 0 0
1 19 0 12 2 0
13 55 2 0 0 0
2 20 0 13 3 1
14 54 3 0 0 0
3 21 0 14 4 2
15 51 4 0 16 0
16 52 0 0 17 15
17 53 0 0 0 16
4 23 0 15 5 3
18 50 5 0 0 0
5 24 0 18 6 4
19 47 6 0 20 0
20 48 0 0 21 19
21 49 0 0 0 20
6 25 0 19 7 5
22 46 7 0 0 0
7 26 0 22 8 6
23 43 8 0 24 0
24 44 0 0 25 23
25 45 0 0 0 24
8 28 0 23 9 7
26 42 9 0 0 0
9 29 0 26 10 8
27 40 10 0 28 0
28 41 0 0 0 27
10 30 0 27 11 9
29 39 11 0 0 0
11 33 0 29 0 10
End_of_Relation
Relation Unit ; grouped 1 ;
1 57 0 0 2 0
2 58 0 0 3 1
3 59 0 0 4 2
4 60 0 0 5 3
5 61 0 0 6 4
6 62 0 0 7 5
7 63 0 0 8 6
8 64 0 0 9 7
9 65 0 0 10 8
10 66 0 0 11 9
11 67 0 0 12 10
12 68 0 0 13 11
13 69 0 0 14 12
14 70 0 0 15 13
15 71 0 0 0 14
End_of_Relation
Relation SourceCoef ; ()
1 72 0 0 0 0
End_of_Relation
Relation f0 ; ()
1 73 0 0 0 0
End_of_Relation
Relation TargetCoef ; ()
1 74 0 0 2 0
2 75 0 0 0 1
End_of_Relation
Relation US_map ; ()
1 76 0 0 0 0
End_of_Relation
Relation Wave ; ()
1 77 0 0 0 0
End_of_Relation
End_of_Relations
End_of_Utterance
Recording Prompts
Before launching the next command
./bin/prompt_them etc/uniphone.data
to start the automatic recording of the prompts, I uncommented the line USE_SOX=1 in the prompt_them script to use the SOX package on the Mac instead of the na_play / na_record programs.
Festival plays the synthesized prompt before each record and calculates the recording duration, based on the synthesis. The recorded audio files are saved into the wav folder.
As the recording in the required format 16.000 Hz, mono 16 bits was not possible, I did a manual recording with the Audacity app and replaced the audio files in the wav folder.
Labeling
The labeling of the spoken prompts is done by matching the synthesized prompts with the spoken ones.
./bin/make_labs prompt_wav/*.wav
The following files are created :
- folder cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep
- folder lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
- folder prompt-cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep
The uniph_xxxx.cep files have the following type of content :
EST_File Track
DataType binary
ByteOrder 01
NumFrames 425
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;
Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
...
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
..........
The uniph_xxxx.cep files have the following content :
separator ;
nfields 1
#
0.01500 26 pau
0.12500 26 ax
0.23500 26 hh
0.37000 26 ow
0.53000 26 l
0.65000 26 jh
0.87000 26 oy
1.08500 26 w
1.19500 26 aa
1.39500 26 z
1.45000 26 r
1.55500 26 iy
1.74500 26 p
1.91000 26 ih
2.07500 26 ng
2.12500 26 pau
The correct labeling can be checked with the WaveSurfer app.
The uniph_xxxx.cep files have the following type of content :
EST_File Track
DataType binary
ByteOrder 01
NumFrames 385
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;
Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
...
Channel_19 melcep_d_8
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
....
Creating Utterances
After labeling the utterance structure is created with the command
festival -b festvox/build_clunits.scm '(build_utts
"etc/uniphone.data")'
The 3 files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt are saved in the folder festival/utts. They have the following type of content :
EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 94 ; type Text ;
iform "\"a whole joy was reaping.\"" ;
filename prompt-utt/uniph_0001.utt ;
fileid uniph_0001 ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
...
1 76 0 0 0 0
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 77 0 2 0 0
End_of_Relation
End_of_Relations
End_of_Utterance
txt.done.data
The next scripts are looking for the txt.done.data file instead of the uniphone.data file. Copying the uniphone.data file and renaming it to txt.done.data solves this problem.
Extracting pitchmarks
The simplest way to extract the pitchmarks from the records is to use the command
./bin/make_pm_wave wav/*.wav
without tuning any parameters.
The 3 files uniph_0001.pm, uniph_0002.pm and uniph_0003.pm are saved in the folder pm. They have the following type of content :
EST_File Track
DataType ascii
NumFrames 271
NumChannels 0
NumAuxChannels 0
EqualSpace 0
BreaksPresent true
EST_Header_End
0.016750 1
0.023312 1
0.030125 1
0.037250 1
0.044625 1
.....
2.070750 1
2.081512 1
2.092275 1
2.103038 1
2.113800 1
2.124563 1
Find Mel Frequency Cepstral Coefficients
In the next stage the Mel Frequency Cepstral Coefficients are defined synchronously with the pitch periods
./bin/make_mcep wav/*.wav
The 3 files uniph_0001.mcep, uniph_0002.mcep and uniph_0003.mcep are saved in the folder mcep. They have the following type of content :
EST_File Track
DataType binary
ByteOrder 01
NumFrames 271
NumChannels 12
EqualSpace 0
BreaksPresent true
CommentChar ;
Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
Channel_6 melcep_7
Channel_7 melcep_8
Channel_8 melcep_9
Channel_9 melcep_10
Channel_10 melcep_11
Channel_11 melcep_N
EST_Header_End
........
Building Synthesizer
Building the cluster unit selection synthesizer is the main part of the voice creation. It’s done with the command
festival -b festvox/build_clunits.scm '(build_clunits
"etc/uniphone.data")'
The following files are created :
- folder festival/clunits : file mbarnig_en_marco.catalogue
- folder festival/feats : 41 files [phoneme-name].feats
- folder festival/trees : 41 files [phoneme-name].tree
- folder festival/trees : file mbarnig_en_marco.tree
The file mbarnig_en_marco.catalogue has the following type of content :
EST_File index
DataType ascii
NumEntries 46
IndexName mbarnig_en_marco
EST_Header_End
pau_5 uniph_0001 0.000000 0.007500 0.015000
ax_0 uniph_0001 0.015000 0.070000 0.125000
hh_0 uniph_0001 0.125000 0.180000 0.235000
ow_0 uniph_0001 0.235000 0.302500 0.370000
l_0 uniph_0001 0.370000 0.450000 0.530000
jh_0 uniph_0001 0.530000 0.590000 0.650000
oy_0 uniph_0001 0.650000 0.760000 0.870000
.....
er_0 uniph_0003 1.150000 1.205000 1.260000
m_0 uniph_0003 1.260000 1.320000 1.380000
ay_0 uniph_0003 1.380000 1.480000 1.580000
k_0 uniph_0003 1.580000 1.615000 1.650000
pau_0 uniph_0003 1.650000 1.650000 1.650000
A xx.feats file has the following type of content :
0 w - r 0 0 0 0 l + z - f 0 0 0 0 a + 0.11000001 128.271 130.7258
126.721 1 coda coda onset 1 1 0 0 1 1 single oy + 0 2 d 3 + 0 0 0 0
content content content
A xx.tree file has the following type of content :
((((0 0)) 0))
;; Right cluster 0 (0%) mean ranking 2 mean distance 0
The mbarnig_en_marco.tree file has the following type of content :
;; Autogenerated list of selection trees
;; db_dir "./"
;; db_dir "."
;; name mbarnig_en_marco
;; index_name mbarnig_en_marco
;; f0_join_weight 0
;; join_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; trees_dir "festival/trees/"
;; catalogue_dir "festival/clunits/"
;; coeffs_dir "mcep/"
;; coeffs_ext ".mcep"
;; clunit_name_feat lisp_mbarnig_en_marco::clunit_name
;; join_method windowed
;; continuity_weight 5
;; optimal_coupling 1
;; extend_selections 2
;; pm_coeffs_dir "mcep/"
;; pm_coeffs_ext ".mcep"
;; sig_dir "wav/"
;; sig_ext ".wav"
;; disttabs_dir "festival/disttabs/"
;; utts_dir "festival/utts/"
;; utts_ext ".utt"
;; dur_pen_weight 0
;; f0_pen_weight 0
;; get_stds_per_unit t
;; ac_left_context 0.8
;; ac_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; feats_dir "festival/feats/"
;; feats (occurid p.name p.ph_vc p.ph_ctype p.ph_vheight p.ph_vlng
p.ph_vfront p.ph_vrnd p.ph_cplace p.ph_cvox n.name n.ph_vc
n.ph_ctype n.ph_vheight n.ph_vlng n.ph_vfront n.ph_vrnd n.ph_cplace
n.ph_cvox segment_duration seg_pitch p.seg_pitch n.seg_pitch
R:SylStructure.parent.stress seg_onsetcoda n.seg_onsetcoda
p.seg_onsetcoda R:SylStructure.parent.accented pos_in_syl
syl_initial syl_final R:SylStructure.parent.lisp_cg_break
R:SylStructure.parent.R:Syllable.p.lisp_cg_break
R:SylStructure.parent.position_type pp.name pp.ph_vc pp.ph_ctype
pp.ph_vheight pp.ph_vlng pp.ph_vfront pp.ph_vrnd pp.ph_cplace
pp.ph_cvox n.lisp_is_pau p.lisp_is_pau
R:SylStructure.parent.parent.gpos
R:SylStructure.parent.parent.R:Word.p.gpos
R:SylStructure.parent.parent.R:Word.n.gpos)
;; wagon_field_desc "festival/clunits/all.desc"
;; wagon_progname "$ESTDIR/bin/wagon"
;; wagon_cluster_size 20
;; prune_reduce 0
;; cluster_prune_limit 40
;; files (uniph_0001 uniph_0002 uniph_0003)
(set! clunits_selection_trees '(
("k" ((((0 0)) 0)))
("ay" ((((0 0)) 0)))
("m" ((((0 0)) 0)))
("er" ((((0 0)) 0)))
("zh" ((((0 0)) 0)))
("ae" ((((0 0)) 0)))
("ch" ((((0 0)) 0)))
("eh" ((((0 0)) 0)))
....
("jh" ((((0 0)) 0)))
("l" ((((0 0)) 0)))
("ow" ((((0 0)) 0)))
("hh" ((((0 0)) 0)))
("ax" ((((0 0)) 0)))
("pau"
((((0 67.455) (1 100) (2 66.73) (3 100) (4 67.43) (5 100)) 83.6025)))
))
Install new voice
In the last step I created the folder
Festival-TTS/festival/lib/voices/english/mbarnig_en_marco_clunits/
and copied the following folders from the voice directory
Festival_TTS/festvox/mbarnig_en_marco/
into this folder :
- festival
- festvox
- mcep
- wav
We can now check if the voice is recognized
festival> (voice.list)
load the new voice
festival> (voice_mbarnig_en_marco_clunits)
and test the synthesis
festival> (SayText "Hello Marco, how are you?")
It works as expected.
The following folders in the voice folder Festival-TTS/festvox/mbarnig_en_marco remained empty :
- emu, including 2 empty subfolders lab_hlb and pm_hlb
- f0
- group
- lar
- lpc
- phr
- pm_lab
- recording
- scratch, including 2 empty subfolders lab and wav
- syl
- versions
- wrd