Festival uniphone voice creation

Posted on February 15, 2015 by Marco Barnig

Referring to my recent post about Festival, I am glad to announce that I was successful in building and testing a new uniphone voice (english) with my own prompt recordings. The goal is to set up my system to create a luxembourgish synthetic voice for the Festival package.

The list of the different steps is shown below :

• (creation of the voice directory mbarnig_en_marco)
• $FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone
• (define a phoneset)
• (define a lexicon)
• festival -b festvox/build_clunits.scm '(build_prompts_waves 
"etc/uniphone.data")'
• (uncomment the line USE_SOX=1 in the script prompt_them)
• ./bin/prompt_them etc/uniphone.data 
• ./bin/make_labs prompt-wav/*.wav 
• festival -b festvox/build_clunits.scm '(build_utts 
"etc/uniphone.data")' 
• (copy etc/uniphone.data into etc/txt.done.data)
• ./bin/make_pm_wave wav/*.wav
• ./bin/make_mcep wav/*.wav
• festival -b festvox/build_clunits.scm '(build_clunits 
"etc/uniphone.data")'
• (copy data in Festival voice directory)
• festival> (voice_mbarnig_en_marco_clunits)
• festival> (SayText "Hello Marco, how are you?")

1. Voice Folder

First I created a new voice folder mbarnig_en_marco inside the Festival_TTS/festvox/ directory and opened a terminal window inside this new folder.

2. Clunits Setup

I launched the script

$FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone

to construct the voice folder structure and copy there the template files for voice building.
The arguments of the setup script setup_clunits are :

institution : mbarnig
language : en
speaker : marco
standard prompt list : uniphone

Festival : setup_clunits

The following folders are created inside the voice directory mbarnig_en_marco :

bin
cep
emu
etc
f0
festival
festvox
group
lab
lar
lpc
mcep
phr
pm
pm_lab
prompt_cep
prompt_lab
prompt_utt
prompt_wav
recording
scratch
syl
versions
wav
wrd

The following programs are copied into the 1sr folder (mbarnig_en_marco/bin) :

add_noise
contour_powernormalize
do_build
find_db_duration
find_num_available_cpu
find_powercontours
find_poerfactors
get_lars
get_wavs
make_cmm
make_dist
make_f0
make_labs
make_lpc
make_mcep
make_pm
make_pm_fix
make_pm_pmlab
make_pm_wave
make_pmlab_pm
make_samples
prompt_them
prune_middle_silence
prune_silence
reduce_prompts
simple_powernormalize
sphinx_lab
sphinxtrain
synthfile
traintest
ws

The following files are created inside the 4th folder (mbarnig_en_marco/etc) :

emu_f0.tpl
emu_hier.tpl
emu_lab.tpl
emu_pm.tpl
uniphone.data
voice.defs
ws_festvox.conf

The following sub-folders (most empty) are created inside the 6th folder (mbarnig_en_marco/festival) :

clunits, including a file all.desc
coeffs
disttabs
dur
f0
feats
phrbrk
trees
utts

The following scripts are created inside the 7th folder (mbarnig_en_marco/festvox) :

build_clunits.scm
build_st.scm
mbarnig_en_marco_clunits.scm
mbarnig_en_marco_duration.scm
mbarnig_en_marco_durdata.scm
mbarnig_en_marco_f0model.scm
mbarnig_en_marco_intonation.scm
mbarnig_en_marco_lexicon.scm
mbarnig_en_marco_other.scm
mbarnig_en_marco_phoneset.scm
mbarnig_en_marco_phrasing.scm
mbarnig_en_marco_tagger.scm
mbarnig_en_marco_tokenizer.scm

The other listed folders are empty.

The file uniphone.data in the mbarnig_en_marco/etc folder contains the following minimal prompt-set :

( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )

These 3 sentences contain each of the english phonemes once. The prompt list is coded in the standard Festival data-format. The spaces after the left parantheses are required.

The file voice.defs in the mbarnig_en_marco/etc folder contains the following parameters :

FV_INST=mbarnig
FV_LANG=en
FV_NAME=marco
FV_TYPE=clunits
FV_VOICENAME=$FV_INST"_"$FV_LANG"_"$FV_NAME
FV_FULLVOICENAME=$FV_VOICENAME"_"FV_TYPE

The file ws_festvox.conf in the mbarnig_en_marco/etc folder is automatically generated by WaveSurfer.

Phoneset Definition

The phoneset for the new voice is defined in the script mbarnig_en_marco_phoneset.scm. Referring to the english Festival radio phoneset, I modified the phoneset-script as follows :

;;; Phoneset for mbarnig_en_marco
;;;
(defPhoneSet
mbarnig_en_marco
;;; Phone Features
(;; vowel or consonant
(vc + -)
;; vowel length: short long dipthong schwa
(vlng s l d a 0)
;; vowel height: high mid low
(vheight 1 2 3 0)
;; vowel frontness: front mid back
(vfront 1 2 3 0)
;; lip rounding
(vrnd + - 0)
;; consonant type: stop fricative affricate nasal lateral approximant
(ctype s f a n l r 0)
;; place of articulation: labial alveolar palatal labio-dental
;; dental velar glottal
(cplace l a p b d v g 0)
;; consonant voicing
(cvox + - 0)
)
;; Phone set members
(
;; Note these features were set by awb so they are wrong !!!
(aa + l 3 3 - 0 0 0) ;; father
(ae + s 3 1 - 0 0 0) ;; fat
(ah + s 2 2 - 0 0 0) ;; but
(ao + l 3 3 + 0 0 0) ;; lawn
(aw + d 3 2 - 0 0 0) ;; how
(ax + a 2 2 - 0 0 0) ;; about
(axr + a 2 2 - r a +)
(ay + d 3 2 - 0 0 0) ;; hide
(b - 0 0 0 0 s l +)
(ch - 0 0 0 0 a p -)
(d - 0 0 0 0 s a +)
(dh - 0 0 0 0 f d +)
(dx - a 0 0 0 s a +) ;; ??
(eh + s 2 1 - 0 0 0) ;; get
(el + s 0 0 0 l a +)
(em + s 0 0 0 n l +)
(en + s 0 0 0 n a +)
(er + a 2 2 - r 0 0) ;; always followed by r (er-r == axr)
(ey + d 2 1 - 0 0 0) ;; gate
(f - 0 0 0 0 f b -)
(g - 0 0 0 0 s v +)
(hh - 0 0 0 0 f g -)
(hv - 0 0 0 0 f g +)
(ih + s 1 1 - 0 0 0) ;; bit
(iy + l 1 1 - 0 0 0) ;; beet
(jh - 0 0 0 0 a p +)
(k - 0 0 0 0 s v -)
(l - 0 0 0 0 l a +)
(m - 0 0 0 0 n l +)
(n - 0 0 0 0 n a +)
(nx - 0 0 0 0 n d +) ;; ???
(ng - 0 0 0 0 n v +)
(ow + d 2 3 + 0 0 0) ;; lone
(oy + d 2 3 + 0 0 0) ;; toy
(p - 0 0 0 0 s l -)
(r - 0 0 0 0 r a +)
(s - 0 0 0 0 f a -)
(sh - 0 0 0 0 f p -)
(t - 0 0 0 0 s a -)
(th - 0 0 0 0 f d -)
(uh + s 1 3 + 0 0 0) ;; full
(uw + l 1 3 + 0 0 0) ;; fool
(v - 0 0 0 0 f b +)
(w - 0 0 0 0 r l +)
(y - 0 0 0 0 r p +)
(z - 0 0 0 0 f a +)
(zh - 0 0 0 0 f p +)
(pau - 0 0 0 0 0 0 -)
(h# - 0 0 0 0 0 0 -)
(brth - 0 0 0 0 0 0 -)
)
)
(PhoneSet.silences '(pau))

(define (mbarnig_en_marco::select_phoneset)
 "(mbarnig_en_marco::select_phoneset)
Set up phone set for mbarnig_en_marco."
 (Parameter.set 'PhoneSet 'mbarnig_en_marco)
 (PhoneSet.select 'mbarnig_en_marco)
)

(define (mbarnig_en_marco::reset_phoneset)
 "(mbarnig_en_marco::reset_phoneset)
Reset phone set for mbarnig_en_marco."
 t
)

(provide 'mbarnig_en_marco_phoneset)

Lexicon Creation

Without a lexicon, the result of a building command generates unknown words messages

Festival : unknown words without lexicon

The lexicon for the new voice is defined in the script mbarnig_en_marco_lexicon.scm. Referring the the english Festival cmu lexicon, I modified the lexicon-script as follows :

;;; Lexicon, LTS and Postlexical rules for mbarnig_en_marco
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;; CMU lexicon for US English
;;;

;;; Load any necessary files here
(require 'postlex)
(setup_cmu_lex)
(define (mbarnig_en_marco::select_lexicon)
 "(mbarnig_lx_marco::select_lexicon)
Set up the lexicon for mbarnig_lx_marco."
(lex.select "cmu")

;; Post lexical rules
(set! postlex_rules_hooks (list postlex_apos_s_check))
(set! postlex_vowel_reduce_cart_tree nil) ; no reduction
)


(define (mbarnig_lx_marco::reset_lexicon)
 "(mbarnig_lx_marco::reset_lexicon)
Reset lexicon information."
 t
)

(provide 'mbarnig_lx_marco_lexicon)

Building Prompts

The second command

festival -b festvox/build_clunits.scm '(build_prompts_waves 
"etc/uniphone.data")'

generates synthesized waveforms to act as prompts and timing cues. The nearest available voice (in this case kal_diphone) is used for synthesizing. The generated files are also used in aligning the spoken data. The -b option (–batch) avoids switching in the interactive Festival mode.

Festival : build_prompts_waves

The following files are created :

folder prompt-lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
folder prompt-utt : files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt
folder prompt-wav : files uniph_0001.wav, uniph_0002.wav and uniph_0003.wav

The uniph_xxxx.lab files have the following type of content :

#
0.1100 100 pau
0.2200 100 ax
0.3300 100 hh
0.4400 100 ow
0.5500 100 l
0.6600 100 jh
0.7700 100 oy
0.8800 100 w
0.9900 100 aa
1.1000 100 z
1.2100 100 r
1.3200 100 iy
1.4850 100 p
1.6500 100 ih
1.8150 100 ng
1.9250 100 pau

The uniph_xxxx.utt files have the following type of content :

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 77 ; type Text ; 
iform "\"a whole joy was reaping.\"" ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
4 id _4 ; name was ; whitespace " " ; prepunctuation "" ;
5 id _5 ; name reaping ; punc . ; whitespace " " ; 
prepunctuation "" ;
6 id _10 ; name reaping ; pbreak B ; pos nil ;
7 id _11 ; name . ; pbreak B ; pos punc ;
8 id _9 ; name was ; pbreak NB ; pos nil ;
9 id _8 ; name joy ; pbreak NB ; pos nil ;
10 id _7 ; name whole ; pbreak NB ; pos nil ;
11 id _6 ; name a ; pbreak NB ; pos dt ;
12 id _12 ; name B ;
13 id _13 ; name syl ; stress 0 ;
14 id _15 ; name syl ; stress 1 ;
15 id _19 ; name syl ; stress 1 ;
16 id _22 ; name syl ; stress 1 ;
17 id _26 ; name syl ; stress 1 ;
18 id _29 ; name syl ; stress 0 ;
19 id _33 ; name pau ; dur_factor 1 ; end 0.11 ;source_end 0.101815 ;
20 id _14 ; name ax ; dur_factor 1 ; end 0.22 ;source_end 0.235802 ;
21 id _16 ; name hh ; dur_factor 1 ; end 0.33 ;source_end 0.322177 ;
22 id _17 ; name ow ; dur_factor 1 ; end 0.44 ;source_end 0.493926 ;
23 id _18 ; name l ; dur_factor 1 ; end 0.55 ;source_end 0.626926 ;
24 id _20 ; name jh ; dur_factor 1 ; end 0.66 ;source_end 0.732624 ;
25 id _21 ; name oy ; dur_factor 1 ; end 0.77 ;source_end 0.900228 ;
26 id _23 ; name w ; dur_factor 1 ; end 0.88 ;source_end 1.06616 ;
27 id _24 ; name aa ; dur_factor 1 ; end 0.99 ;source_end 1.20716 ;
28 id _25 ; name z ; dur_factor 1 ; end 1.1 ;source_end 1.33726 ;
29 id _27 ; name r ; dur_factor 1 ; end 1.21 ;source_end 1.46326 ;
30 id _28 ; name iy ; dur_factor 1 ; end 1.32 ;source_end 1.58507 ;
31 id _30 ; name p ; dur_factor 1.5 ; end 1.485 ;source_end 1.71307 ;
32 id _31 ; name ih ; dur_factor 1.5 ; end 1.65 ;source_end 1.83979 ;
33 id _32 ; name ng ; dur_factor 1.5 ; end 1.815 source_end 2.02441 ;
34 id _34 ; name pau ; dur_factor 1 ; end 1.925 ;source_end 2.36643 ;
35 id _35 ; name Accented ;
36 id _36 ; name Accented ;
37 id _37 ; name Accented ;
38 id _38 ; name Accented ;
39 id _56 ; f0 110 ; pos 1.815 ;
40 id _54 ; f0 126.571 ; pos 1.31 ;
41 id _55 ; f0 116.286 ; pos 1.32 ;
42 id _53 ; f0 128.571 ; pos 1.11 ;
43 id _50 ; f0 130 ; pos 1.09 ;
44 id _51 ; f0 118.571 ; pos 1.1 ;
45 id _52 ; f0 118.571 ; pos 1.101 ;
46 id _49 ; f0 132 ; pos 0.78 ;
47 id _46 ; f0 132.286 ; pos 0.76 ;
48 id _47 ; f0 122 ; pos 0.77 ;
49 id _48 ; f0 122 ; pos 0.771 ;
50 id _45 ; f0 134.286 ; pos 0.56 ;
51 id _42 ; f0 135.714 ; pos 0.54 ;
52 id _43 ; f0 124.286 ; pos 0.55 ;
53 id _44 ; f0 124.286 ; pos 0.551 ;
54 id _41 ; f0 137.714 ; pos 0.23 ;
55 id _40 ; f0 127.714 ; pos 0.22 ;
56 id _39 ; f0 130 ; pos 0.11 ;
57 id _57 ; name pau-ax ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 9 ; end 0.172053 ; num_frames 17 ;
58 id _58 ; name ax-hh ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 0.288115 ; num_frames 11 ;
59 id _59 ; name hh-ow ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 2 ; end 0.406552 ; num_frames 11 ;
60 id _60 ; name ow-l ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 7 ; end 0.559239 ; num_frames 14 ;
61 id _61 ; name l-jh ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 0.673416 ; num_frames 10 ;
62 id _62 ; name jh-oy ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 0.82104 ; num_frames 13 ;
63 id _63 ; name oy-w ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 6 ; end 1.01148 ; num_frames 17 ;
64 id _64 ; name w-aa ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.1311 ; num_frames 11 ;
65 id _65 ; name aa-z ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 6 ; end 1.25207 ; num_frames 11 ;
66 id _66 ; name z-r ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 7 ; end 1.39807 ; num_frames 14 ;
67 id _67 ; name r-iy ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 1.52951 ; num_frames 12 ;
68 id _68 ; name iy-p ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.64963 ; num_frames 11 ;
69 id _69 ; name p-ih ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 1.78517 ; num_frames 13 ;
70 id _70 ; name ih-ng ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.91617 ; num_frames 12 ;
71 id _71 ; name ng-pau ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 9 ; end 2.19542 ; num_frames 27 ;
72 id _72 ; name coef ; coefs "[Val track]" ; 
frame "[Val wavevector]" ;
73 id _73 ; name f0 ; f0 "[Val track]" ;
74 id _74 ; coefs "[Val track]" ; residual "[Val wave]" ;
75 id _75 ;
76 id _76 ; map "[Val ivector]" ;
77 id _77 ; wave "[Val wave]" ;
End_of_Stream_Items
Relations
Relation Token ; ()
6 11 1 0 0 0
1 1 0 6 2 0
7 10 2 0 0 0
2 2 0 7 3 1
8 9 3 0 0 0
3 3 0 8 4 2
9 8 4 0 0 0
4 4 0 9 5 3
10 6 5 0 11 0
11 7 0 0 0 10
5 5 0 10 0 4
End_of_Relation
Relation Word ; ()
1 11 0 0 2 0
2 10 0 0 3 1
3 9 0 0 4 2
4 8 0 0 5 3
5 6 0 0 0 4
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 12 0 2 0 0
End_of_Relation
Relation Syllable ; ()
1 13 0 0 2 0
2 14 0 0 3 1
3 15 0 0 4 2
4 16 0 0 5 3
5 17 0 0 6 4
6 18 0 0 0 5
End_of_Relation
Relation Segment ; ()
1 19 0 0 2 0
2 20 0 0 3 1
3 21 0 0 4 2
4 22 0 0 5 3
5 23 0 0 6 4
6 24 0 0 7 5
7 25 0 0 8 6
8 26 0 0 9 7
9 27 0 0 10 8
10 28 0 0 11 9
11 29 0 0 12 10
12 30 0 0 13 11
13 31 0 0 14 12
14 32 0 0 15 13
15 33 0 0 16 14
16 34 0 0 0 15
End_of_Relation
Relation SylStructure ; ()
8 20 7 0 0 0
7 13 1 8 0 0
1 11 0 7 2 0
10 21 9 0 11 0
11 22 0 0 12 10
12 23 0 0 0 11
9 14 2 10 0 0
2 10 0 9 3 1
14 24 13 0 15 0
15 25 0 0 0 14
13 15 3 14 0 0
3 9 0 13 4 2
17 26 16 0 18 0
18 27 0 0 19 17
19 28 0 0 0 18
16 16 4 17 0 0
4 8 0 16 5 3
22 29 20 0 23 0
23 30 0 0 0 22
20 17 5 22 21 0
24 31 21 0 25 0
25 32 0 0 26 24
26 33 0 0 0 25
21 18 0 24 0 20
5 6 0 20 6 4
6 7 0 0 0 5
End_of_Relation
Relation IntEvent ; ()
1 35 0 0 2 0
2 36 0 0 3 1
3 37 0 0 4 2
4 38 0 0 0 3
End_of_Relation
Relation Intonation ; ()
5 35 1 0 0 0
1 14 0 5 2 0
6 36 2 0 0 0
2 15 0 6 3 1
7 37 3 0 0 0
3 16 0 7 4 2
8 38 4 0 0 0
4 17 0 8 0 3
End_of_Relation
Relation Target ; ()
12 56 1 0 0 0
1 19 0 12 2 0
13 55 2 0 0 0
2 20 0 13 3 1
14 54 3 0 0 0
3 21 0 14 4 2
15 51 4 0 16 0
16 52 0 0 17 15
17 53 0 0 0 16
4 23 0 15 5 3
18 50 5 0 0 0
5 24 0 18 6 4
19 47 6 0 20 0
20 48 0 0 21 19
21 49 0 0 0 20
6 25 0 19 7 5
22 46 7 0 0 0
7 26 0 22 8 6
23 43 8 0 24 0
24 44 0 0 25 23
25 45 0 0 0 24
8 28 0 23 9 7
26 42 9 0 0 0
9 29 0 26 10 8
27 40 10 0 28 0
28 41 0 0 0 27
10 30 0 27 11 9
29 39 11 0 0 0
11 33 0 29 0 10
End_of_Relation
Relation Unit ; grouped 1 ;
1 57 0 0 2 0
2 58 0 0 3 1
3 59 0 0 4 2
4 60 0 0 5 3
5 61 0 0 6 4
6 62 0 0 7 5
7 63 0 0 8 6
8 64 0 0 9 7
9 65 0 0 10 8
10 66 0 0 11 9
11 67 0 0 12 10
12 68 0 0 13 11
13 69 0 0 14 12
14 70 0 0 15 13
15 71 0 0 0 14
End_of_Relation
Relation SourceCoef ; ()
1 72 0 0 0 0
End_of_Relation
Relation f0 ; ()
1 73 0 0 0 0
End_of_Relation
Relation TargetCoef ; ()
1 74 0 0 2 0
2 75 0 0 0 1
End_of_Relation
Relation US_map ; ()
1 76 0 0 0 0
End_of_Relation
Relation Wave ; ()
1 77 0 0 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

Recording Prompts

Before launching the next command

./bin/prompt_them etc/uniphone.data

to start the automatic recording of the prompts, I uncommented the line USE_SOX=1 in the prompt_them script to use the SOX package on the Mac instead of the na_play / na_record programs.

Festival : recording prompts

Festival plays the synthesized prompt before each record and calculates the recording duration, based on the synthesis. The recorded audio files are saved into the wav folder.

As the recording in the required format 16.000 Hz, mono 16 bits was not possible, I did a manual recording with the Audacity app and replaced the audio files in the wav folder.

Audacity app to record prompts

Labeling

The labeling of the spoken prompts is done by matching the synthesized prompts with the spoken ones.

./bin/make_labs prompt_wav/*.wav

Festival : make_labs

The following files are created :

folder cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep
folder lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
folder prompt-cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep

The uniph_xxxx.cep files have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 425
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
...
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
..........

The uniph_xxxx.cep files have the following content :

separator ;
nfields 1
#
0.01500 26 pau
0.12500 26 ax
0.23500 26 hh
0.37000 26 ow
0.53000 26 l
0.65000 26 jh
0.87000 26 oy
1.08500 26 w
1.19500 26 aa
1.39500 26 z
1.45000 26 r
1.55500 26 iy
1.74500 26 p
1.91000 26 ih
2.07500 26 ng
2.12500 26 pau

The correct labeling can be checked with the WaveSurfer app.

WafeSurfer . checking labels

The uniph_xxxx.cep files have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 385
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
...
Channel_19 melcep_d_8
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
....

Creating Utterances

After labeling the utterance structure is created with the command

festival -b festvox/build_clunits.scm '(build_utts
"etc/uniphone.data")'

Festival : build_utts

The 3 files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt are saved in the folder festival/utts. They have the following type of content :

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 94 ; type Text ; 
iform "\"a whole joy was reaping.\"" ; 
filename prompt-utt/uniph_0001.utt ; 
fileid uniph_0001 ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
...
1 76 0 0 0 0
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 77 0 2 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

txt.done.data

The next scripts are looking for the txt.done.data file instead of the uniphone.data file. Copying the uniphone.data file and renaming it to txt.done.data solves this problem.

Extracting pitchmarks

The simplest way to extract the pitchmarks from the records is to use the command

./bin/make_pm_wave wav/*.wav

without tuning any parameters.

Festival : make_pm

The 3 files uniph_0001.pm, uniph_0002.pm and uniph_0003.pm are saved in the folder pm. They have the following type of content :

EST_File Track
DataType ascii
NumFrames 271
NumChannels 0
NumAuxChannels 0
EqualSpace 0
BreaksPresent true
EST_Header_End
0.016750 1
0.023312 1
0.030125 1
0.037250 1
0.044625 1
.....
2.070750 1
2.081512 1
2.092275 1
2.103038 1
2.113800 1
2.124563 1

Find Mel Frequency Cepstral Coefficients

In the next stage the Mel Frequency Cepstral Coefficients are defined synchronously with the pitch periods

./bin/make_mcep wav/*.wav

Festival : make_mcep

The 3 files uniph_0001.mcep, uniph_0002.mcep and uniph_0003.mcep are saved in the folder mcep. They have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 271
NumChannels 12
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
Channel_6 melcep_7
Channel_7 melcep_8
Channel_8 melcep_9
Channel_9 melcep_10
Channel_10 melcep_11
Channel_11 melcep_N
EST_Header_End
........

Building Synthesizer

Building the cluster unit selection synthesizer is the main part of the voice creation. It’s done with the command

festival -b festvox/build_clunits.scm '(build_clunits 
"etc/uniphone.data")'

Festival : build synthesizer (click to enlarge)

The following files are created :

folder festival/clunits : file mbarnig_en_marco.catalogue
folder festival/feats : 41 files [phoneme-name].feats
folder festival/trees : 41 files [phoneme-name].tree
folder festival/trees : file mbarnig_en_marco.tree

The file mbarnig_en_marco.catalogue has the following type of content :

EST_File index
DataType ascii
NumEntries 46
IndexName mbarnig_en_marco
EST_Header_End
pau_5 uniph_0001 0.000000 0.007500 0.015000
ax_0 uniph_0001 0.015000 0.070000 0.125000
hh_0 uniph_0001 0.125000 0.180000 0.235000
ow_0 uniph_0001 0.235000 0.302500 0.370000
l_0 uniph_0001 0.370000 0.450000 0.530000
jh_0 uniph_0001 0.530000 0.590000 0.650000
oy_0 uniph_0001 0.650000 0.760000 0.870000
.....
er_0 uniph_0003 1.150000 1.205000 1.260000
m_0 uniph_0003 1.260000 1.320000 1.380000
ay_0 uniph_0003 1.380000 1.480000 1.580000
k_0 uniph_0003 1.580000 1.615000 1.650000
pau_0 uniph_0003 1.650000 1.650000 1.650000

A xx.feats file has the following type of content :

0 w - r 0 0 0 0 l + z - f 0 0 0 0 a + 0.11000001 128.271 130.7258 
126.721 1 coda coda onset 1 1 0 0 1 1 single oy + 0 2 d 3 + 0 0 0 0 
content content content

A xx.tree file has the following type of content :

((((0 0)) 0))
;; Right cluster 0 (0%) mean ranking 2 mean distance 0

The mbarnig_en_marco.tree file has the following type of content :

;; Autogenerated list of selection trees
;; db_dir "./"
;; db_dir "."
;; name mbarnig_en_marco
;; index_name mbarnig_en_marco
;; f0_join_weight 0
;; join_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; trees_dir "festival/trees/"
;; catalogue_dir "festival/clunits/"
;; coeffs_dir "mcep/"
;; coeffs_ext ".mcep"
;; clunit_name_feat lisp_mbarnig_en_marco::clunit_name
;; join_method windowed
;; continuity_weight 5
;; optimal_coupling 1
;; extend_selections 2
;; pm_coeffs_dir "mcep/"
;; pm_coeffs_ext ".mcep"
;; sig_dir "wav/"
;; sig_ext ".wav"
;; disttabs_dir "festival/disttabs/"
;; utts_dir "festival/utts/"
;; utts_ext ".utt"
;; dur_pen_weight 0
;; f0_pen_weight 0
;; get_stds_per_unit t
;; ac_left_context 0.8
;; ac_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; feats_dir "festival/feats/"
;; feats (occurid p.name p.ph_vc p.ph_ctype p.ph_vheight p.ph_vlng 
p.ph_vfront p.ph_vrnd p.ph_cplace p.ph_cvox n.name n.ph_vc 
n.ph_ctype n.ph_vheight n.ph_vlng n.ph_vfront n.ph_vrnd n.ph_cplace 
n.ph_cvox segment_duration seg_pitch p.seg_pitch n.seg_pitch 
R:SylStructure.parent.stress seg_onsetcoda n.seg_onsetcoda 
p.seg_onsetcoda R:SylStructure.parent.accented pos_in_syl 
syl_initial syl_final R:SylStructure.parent.lisp_cg_break 
R:SylStructure.parent.R:Syllable.p.lisp_cg_break 
R:SylStructure.parent.position_type pp.name pp.ph_vc pp.ph_ctype 
pp.ph_vheight pp.ph_vlng pp.ph_vfront pp.ph_vrnd pp.ph_cplace 
pp.ph_cvox n.lisp_is_pau p.lisp_is_pau 
R:SylStructure.parent.parent.gpos 
R:SylStructure.parent.parent.R:Word.p.gpos 
R:SylStructure.parent.parent.R:Word.n.gpos)
;; wagon_field_desc "festival/clunits/all.desc"
;; wagon_progname "$ESTDIR/bin/wagon"
;; wagon_cluster_size 20
;; prune_reduce 0
;; cluster_prune_limit 40
;; files (uniph_0001 uniph_0002 uniph_0003)
(set! clunits_selection_trees '(
("k" ((((0 0)) 0)))
("ay" ((((0 0)) 0)))
("m" ((((0 0)) 0)))
("er" ((((0 0)) 0)))
("zh" ((((0 0)) 0)))
("ae" ((((0 0)) 0)))
("ch" ((((0 0)) 0)))
("eh" ((((0 0)) 0)))
....
("jh" ((((0 0)) 0)))
("l" ((((0 0)) 0)))
("ow" ((((0 0)) 0)))
("hh" ((((0 0)) 0)))
("ax" ((((0 0)) 0)))
("pau"
((((0 67.455) (1 100) (2 66.73) (3 100) (4 67.43) (5 100)) 83.6025)))
))

Install new voice

In the last step I created the folder

Festival-TTS/festival/lib/voices/english/mbarnig_en_marco_clunits/

and copied the following folders from the voice directory

Festival_TTS/festvox/mbarnig_en_marco/

into this folder :

festival
festvox
mcep
wav

We can now check if the voice is recognized

festival> (voice.list)

Festival : voice.list

load the new voice

festival> (voice_mbarnig_en_marco_clunits)

and test the synthesis

festival> (SayText "Hello Marco, how are you?")

Festival : SayText

It works as expected.

The following folders in the voice folder Festival-TTS/festvox/mbarnig_en_marco remained empty :

emu, including 2 empty subfolders lab_hlb and pm_hlb
f0
group
lar
lpc
phr
pm_lab
recording
scratch, including 2 empty subfolders lab and wav
syl
versions
wrd

Speech Corpora for TTS

Posted on February 2, 2015 by Marco Barnig

Speech Corpora

A speech corpus is a database of speech audio files and text transcriptions. In Speech technology, speech corpora are used to create voices for TTS (Text-to Speech) and to create acoustic models for speech recognition.

For a speech database to serve as the basis for constructing a synthetic voice, the recordings should be of studio quality and free of noise. Noise includes not just external sounds, but also unwanted breaths and clicks. The recorded utterances need to be phonetically balanced and the prosody of speech needs to be controlled so that the synthetic voice’s style of delivery is both consistent and appropriate. To satisfy these requirements it’s not sufficient to collect speech records, but you have to design a speech corpus for synthesis. The basic idea is to take a very large amount of text (millions of words) and automatically find nice utterances that match the following criteria :

phonetically and prosodically balanced
targeted toward an intended domain
easy to say by a speaker without mistakes
short enough for a speaker to be willing to say it

Some historic speech database projects are :

1986 : TIDIGITS
1986 : TIMIT
1992 : SWITCHBOARD
1994 : Boston University’s FM Radio News Corpus
1997 : CALLHOME
1998 : BABEL
2000 : AURORA

If the use of a designed speech corpus should be unrestricted, we need to start from a source of written material that does not impose any copyright. In the past one such source was the Gutenberg Project (PG), a volunteer effort to digitize and archive cultural works. It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. As most of these texts are at least 70 years old, we face the issue of language drift. Languages changed considerable over the last century and the related texts are often archaic. Today, dumps of Wikipedia are usually preferred as free sources to design a speech corpus for TTS.

In the next chapters some recent speech corpora projects are presented.

CMU-ARCTIC US Voice Databases

The CMU_ARCTIC databases were constructed in 2003 at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US English single speaker databases, designed for unit selection speech synthesis research. The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male and female speakers as well as other accented speakers. The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labelling was performed by the CMU Sphinx using the FestVox based labelling scripts. No hand correction has been made. Runnable Festival Voices are included with the database distributions.

The following corpora are available :

US male (bdl) – 593 a files, 539 b files
US male (rms) – 593 a files, 539 b files
US Canadian male (jmk) – 593 a files, 539 b files
US Scottish male (awb) – 597 a files, 541 b files
US Indian male (ksp) – 593 a files, 539 b files
US female (slt) – 593 a files, 539 b files
US female (clb) – 593 a files, 539 b files

The typical structure of a CMU_ARCTIC database is shown below :

cmu_arctic database structure

The main folders are :

etc/ : list of prompts; one file txt.done.data included with all utterances
wav/ : recorded audio data (XXXX.wav); one file per utterance
lab/ : labelled text files (XXXX.lab); one file per utterance

ENST and UMPC French Speech Corpora

French speech corpora have been designed for synthesis in 2013 at Télécom ParisTech (ENST : École nationale supérieure des télécommunications) and by the Institut des Systèmes Intelligents et de Robotique (ISIR), Université Pierre et Marie Curie (UPMC).

The audio data is provided in the losslessly compressed FLAC format. The speaker were recorded at a 44.1 kHz or 48 kHz sampling rate, 16 bits per sample, in mono. No filters of any sort have been applied to the raw data. Phonetic labels, automatically obtained using forced alignment using the eHMM tool from Festvox 2.1, are provided as Xwaves .lab file with the fields ENDTIME (in seconds), NUMBER (no significance) and LABEL (variant of the SAMPA phonetic alphabet.

The following corpora are available :

ENST Camille : recorded by Camille Dianoux, a female native speaker of French
UPMC Pierre : recorded by Pierre Chauvin, a male native speaker of French
UPMC Jessica : recorded by Jessica Durand, a female native speaker of French

The GitHub repository of a typical ENST or UPMC corpus is shown below :

upmc speech corpus repository

The main folders are :

prompts/ : one text_xxxx.txt file per utterance (1.000 files)
labels/ : one text_xxxx.lab file per utterance (1.000 files)
audio.tar archive with one text_xxxx.flac file per utterance (1.000 files)

PAVOQUE German Corpus of Expressive Speech

A single speaker, multi-style corpus of German speech, with a large neutral subset, and subsets acting out four different expressive speaking styles, has been designed for synthesis in 2013 in the context of the SEMAINE and IDEAS4GAMES projects. PAVOQUE is the abbreviation for PArametrisation of prosody and VOice QUality for concatenative speech synthesis in view of Emotion expression. The speaker is Stefan Röttig, a male native speaker of German, trained as a professional actor and baritone opera singer.

The audio data is provided in the losslessly compressed FLAC format. The speaker was recorded at a 44.1 kHz sampling rate, 24 bits per sample, in mono. No filters of any sort have been applied to this raw data, but low-pass filtering at 50 Hz is recommended. The manually corrected phonetic labels are provided as Xwaves .lab files with the fields ENDTIME (in seconds), NUMBER (no significance) and LABEL (variant of the SAMPA phonetic alphabet.

The following corpora are available :

Neutral
Obadiah ist von Natur aus niedergeschlagen und blickt pessimistisch in die Zukunft.
Poker ist ein ausgekochter Pokerspieler. Er ist cool, ihn bringt nichts aus der Ruhe.
Poppy ist fröhlich, optimistisch und sieht das Gute in allen Dingen!
Spike ist aggressiv und geht keinem Streit aus dem Weg!

The GitHub repository of the PAVOQUE corpus is shown below :

pavoque speech corpus repository

The main folders are :

Text/ one axxx.txt file per utterance (total 4.242 files)
ManualLabels/Neutral/ one Xxxxx.lab file per utterance (total 3.126 files : 1.591 a-files, 1.423 e-files, 112 prudence-files)
ManualLabels/Obadia/ one Xxxxx.lab file per utterance (total 556 files : 124 obadia-files, 400-m files, 32 poker_d-files)
ManualLabels/Poker/ one Xxxxx.lab file per utterance (total 682 files : 282 poker_n-files, 400 m-files)
ManualLabels/Poppy/ one Xxxxx.lab file per utterance (total 584 files : 159 poppy-files, 400 m-files, 25 poker__f-files)
ManualLabels/Spike/ one Xxxxx.lab file per utterance (total 601 files : 151 spike-files, 400 m-files, 50 poker_a-files)
Recordings/Neutral.tar archive one Xxxx.wav file per utterance (3.126 files)
Recordings/Obadia.tar archive one Xxxx.wav file per utterance (556 files)
Recordings/Poker.tar archive one Xxxx.wav file per utterance (682 files)
Recordings/Poppy.tar archive one Xxxx.wav file per utterance (584 files)
Recordings//Spike.tar archive one Xxxx.wav file per utterance (601 files)

Praat

The label files (abcd.lab) can be opened in Praat using the command

{Praat Object} 
Open > Read from special tier file > Read IntervalTier from Xwaves ..

Festival Text-to-Speech Package

Posted on January 7, 2015 by Marco Barnig

Last update : April 22, 2015

Festival

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Alan W. Black is now professor in the Language Technology Institute at Carnegie Mellon University where substantial contributions have been provided to Festival. The program is written in C++.

To set-up a complete Festival Environment on OS X (Yosemite 10.10.2), four packages are required :

Festival-2.4 (file festival-2.4-release.tar)
Edinburgh Speech-Tools (file speech_tools-2.4-release.tar)
Festvox (file festvox-2.7.0-release.tar.gz)
Languages (example file : english festvox_kallpc16k.tar.gz)

To compile and install the packages, I got some guidance from a Linguistic Mystic (alias Will Styler). After unzipping, the files have been moved into a common folder Festival-TTS on the desktop with the following names :

festival
speech-tools
festvox

The language files are installed in the festival folder in the sub-folders lib/voices/english.

The packages have been compiled in the following sequence :

mbarnig$ cd Desktop/Festival-TTS/speech_tools
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make test
mbarnig$ make install

mbarnig$ cd Desktop/Festival-TTS/festival
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make install

mbarnig$ cd Desktop/Festival-TTS/festvox
mbarnig$ ./configure
mbarnig$ make

At the end the voice folder with the language files was moved to the festival/lib directory.

After updating Xcode to version 6.1.1 and installing the audiotools for Xcode 6.1, I checked that afplay is working :

afplay check

I checked also that the festival/lib/siteinit.scm file contains the following statements :

(Parameter.set ‘Audio_Required_Format ‘riff)
(Parameter.set ‘Audio_Method ‘Audio_Command)
(Parameter.set ‘Audio_Command “afplay $FILE”)

The following files have been downloaded from the festvox website, unzipped and moved to the festival/lib/dicts folder :

festlex_CMU.tar.gz
festlex_OALD.tar.gz
festlex_POSLEX.tar.gz

I added finally the following statements to the .bash_profile file located in the homefolder (/Users/mbarnig) :

export FESTIVALDIR=”/Users/mbarnig/Desktop/Festival-TTS/festival”
export PATH=”$FESTIVALDIR/bin:$PATH”
export ESTDIR=”/Users/mbarnig/Desktop/Festival-TTS/speech_tools”
export PATH=”$ESTDIR/bin:$PATH”
export FESTVOXDIR=”/Users/mbarnig/Desktop/Festival-TTS/festvox”

The festival tool can now be started in the terminal window with the command

mbarnig$ $FESTIVALDIR/bin/festival

Festival version 2.4

All seems to be working great!

Festival embeds a basic small Scheme (Lisp) interpreter (SIOD : Scheme In One Defun 3.0) written by George Carrett.

Festival works in two fundamental modes, command mode and text-to-speech (tts) mode. If Festival is started without arguments (or with the option –command), it enters the default command mode (prompt = festival>). Information included in paranthesis is treated as commands and is interpreted by the Scheme interpreter. The following commands are accepted:

festival> 
> (intro)   :  short spoken introduction
> (voice.list)   : list of available voices
> (set! utt1 (Utterance Text "Hello world"))   : 
           create an utterance and save it in a variable
> (utt.synth utt1)    : synthesize utterance to get a waveform
> (utt.play utt1) : send the synthesized waveform to the audio device
> (SayText "Good morning, welcome to Festival")   : 
           speak text (combination of the 3 preceding commands)
> (tts "myfile" nil)    : speak file instead of text
> (manual nil)  : show the content of the manual
> (manual "Accessing an utterance")  : show the section "utterance"
> (PhoneSet.list)   : show the currently defined phonesets
> (tts "doremi.xml" 'singing)  : an XML based mode for specifying 
           songs, both notes and duration
> (quit)   : exit

If Festival is started with the –tts option, it enters tts-mode. Information (in files or through standard input) is treated as text to be rendered as speech.

Other options available at the start of Festival are :

--language LANG   : set the default language to LANG.
--server   : enter server mode where Festival waits for clients on a 
    known port (default port : 1314); connected clients may send 
    commands (or text) to the server and expect waveforms back.
--script scriptfile  : run scriptfile as a Festival script file.
--heap NUMBER   : to increase the scheme heap.
--batch  : after processing file arguments do not become interactive.
--interactive  : after processing file arguments become interactive.

Script mode :

festival mbarnig$  examples/saytime
festival mbarnig$  text2wave myfile.txt -o myfile.wav

An updated Festival System Documentation with 34 chapters, edited in December 2014, is available at the festvox website.

The following Festival voices are available :

festvox_cmu_us_ahw_cg
festvox_cmu_us_aup_cg
festvox_cmu_us_awb_cg
festvox_cmu_us_axb_cg
festvox_cmu_us_bdl_cg
festvox_cmu_us_clb_cg
festvox_cmu_us_fem_cg
festvox_cmu_us_gka_cg
festvox_cmu_us_jmk_cg
festvox_cmu_us_ksp_cg
festvox_cmu_us_rms_cg
festvox_cmu_us_rxr_cg
festvox_cmu_us_slt_cg
festvox_kallpc16k
festvox_rablpc16k
Leopold : AustrianGerman
IMS German Festival
OGIgerman by CSLU
Swedish by SOL
Hindi

Hindi and German are examples of Festival languages/voices with different phone-features in the phone-set as in the standard us and english phone-sets.

Edinburgh Speech Tools

The Edinburgh Speech Tools Library is a collection of C++ class, functions and related programs for manipulating objects used in speech processing. It includes support for reading and writing waveforms, parameter files (LPC, Ceptra, F0) in various formats and converting between them. It also includes support for linguistic type objects and support for various label files and ngrams (with smoothing). In addition to the library a number of programs are included. An intonation library which includes a pitch tracker, smoother and labelling system (using the Tilt Labelling system), a classification and regression tree (CART) building program called wagon. Also there is growing support for various speech recognition classes such as decoders and HMMs.

An introduction to the Edinburgh Speech Tools is provided by Festvox.

Festvox

The Festvox project aims to make the building of new synthetic voices for Festival more systemic and better documented, by offering the following resources :

Documentation
Aids to building synthetic voices for limited domains
Specific scripts to build new voices
Example Speech Databases
Demos
Discussion list

Festival Variables

Festival provides a list of variables available for general use. This list is automatically generated from the documentation strings of the variables defined in the source code. A variable can be displayed with the print command at the festival prompt. Some examples are shown hereafter :

festival>
> (print festival_version) ; current version of the system
> (print *ostype*) ; operation system that Festival is running on
> (print lexdir) ; default directory of the lexicons
> (print SynthTypes) ; list of synthesis types and functions
> (print token.letter_pos) ; POS tag for individual letters
> (print token.punctuation) ; characters treated as punctuation
> (print voice-path) ; list of folders to look for voices
> (print voice_default) ; function to load the default voice

Festival Variables

Festival Functions

Festival provides a list of functions available for general use. This list is automatically generated from the documentation strings of the functions defined in the source code. A function is called at the Festival prompt. Some examples are shown hereafter :

festival>
> (pwd) ; return current directory
> (lex.list) ; list names of all currently defined lexicons
> (voice.list) ; list all potential voices in the system
> (lex.lookup WORD FEATURES) ; lookup word in current lexicon
> (lex.compile ENTRYFILE COMPILEFILE) ; compile lexical entries
> (PhoneSet.list) ; list all currently defined PhoneSets
> (quit) ; exit from Festival

Festival Functions

Utterance Access Methods

Festival provides a number of standard functions that allow to access parts of an utterance, to traverse through it and to extract features.

Three utterances access methods are of particular interest :

(utt.feat UTT FEATNAME)
returns the value of feature FEATNAME in UTT
(item.feat ITEM FEATNAME)
returns the value of feature FEATNAME in ITEM
(utt.features UTT RELATIONNAME FUNCLIST)
returns vectors of feature values for each item, listed in FUNCLIST and related
in RELATIONNAME in UTT

FEATNAME may be a

feature name ; example : (item.feat sylb ‘stress)
feature function name ; example : (item.feat sylb ‘pos_in_word)
pathname ; examples : (item.feat sylb ‘nn.stress)
(item.feat sylb ‘R:SylStructure.parent.word)

Notes :
sylb is a syllable item
R: is a relation operator

RELATIONNAME may be ‘Token, ‘Word, ‘Phrase, ‘Segment, ‘Syllable, etc

FUNCLIST is a list of items ; example : ‘(name pos)

Some examples are shown hereafter :

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(utt.feat utter 'type)
(item.feat tok 'nn.name)
(item.feat tok 'R:Token.daughter1.name)
(utt.features utter 'Word '(name pos p.pos n.pos))

Utterance access methods

More informations about feature functions as FEATNAME are provided in the next chapter.

Festival Feature Functions

Festival provides a list of basic feature functions available as FEATNAME in utterances. Most are only available for specific items. Some examples are shown hereafter, related to the corresponding items :

Token item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(item.name tok) ; first token
(item.feat tok 'name) ; first token
(item.feat tok 'n.name) ; second token
(item.feat tok 'nn.name) ; third token
(item.feat tok 'whitespace)
(item.feat tok 'prepunctuation)

Utterance Token

Word item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! wrd (item.next (utt.relation.first utter 'Word)))
(item.name wrd) ; second word
(item.feat wrd 'p.name) ; first word
(item.feat wrd 'cap)
(item.feat wrd 'word_duration)

Utterance Word

Segment item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! seg (item.prev (item.prev (utt.relation.last utter 'Segment))))
(item.name seg) ; third last segment
(item.feat seg 'n.name) ; second last segment
(item.feat seg 'seg_pitch)
(item.feat seg 'segment_end)
(item.feat seg 'R:SylStructure.parent.parent.name)

Utterance Segment

Syllable item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylb (utt.relation.first utter 'Syllable))
(item.features sylb) ; first syllable
(item.feat sylb 'asyl_out)
(item.feat sylb 'syl_midpitch)
(utt.features utter 'Syllable '(stress))
(item.feat sylb 'nn.stress) ; stress of third syllable
(item.feat sylb 'R:SylStructure.parent.name)
(item.feat sylb 'R:SylStructure.daughter1.name)
(item.feat sylb 'R:SylStructure.daughter2.name)

Utterance Syllable

SylStructure item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylst (item.prev (utt.relation.last utter 'SylStructure)))
(item.features sylst)
(item.feat sylst 'pos_index)
(item.feat sylst 'phrase_score)

Utterance SylStructure

Intonation item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! inton (utt.relation.first utter 'Intonation))
(item.features inton)
(item.feat inton 'id)

Utterance Intonation

Dumping features

Extracting basic features from a set of utterances is useful for most of the training techniques for TTS voice building. Festival provides a script dumpfeats in the festival/examples folder which does this task. The results can be saved in a single feature file or in separate files for each utterance. An example is shown below, the dumpfeats script was copied in the festival folder of my test voice mbarnig_lb_voxcg :

mbarnig$ ./dumpfeats -feats "(name p.name n.name)"
-relation Segment -output myfeats.txt utts/*.utt

Festival dumpfeats

Spectrograms and speech processing

Posted on October 10, 2014 by Marco Barnig

Last update : July 24, 2022

Spectrograms are visual representations of the spectrum of frequencies in a sound or other signal as they vary with time (or with some other variable). Spectrograms can be used to identify spoken words phonetically. The instrument that generates a spectrogram is called a spectrograph.

Spectrograms are approximated as a filterbank that results from a series of bandpass filters or calculated from the time signal using the Fast Fourier Transform (FFT).

FFT is an algorithm to compute the Discrete Fourier Transform (DFT) and its inverse. A significative parameter of the DFT is the choice of the Window Function. In signal processing, a window function is a mathematical function that is zero-valued outside of some chosen interval. The following window functions are common for spectrograms :

Hann window function : good all-round frequency-resolution and dynamic-range properties
Hamming window function : higher frequency resolution
Kaiser window function : higher dynamic range
Bartlett window function
Rectangular window function

I recorded a sound example.wav file with my name spoken three times, to use as test file for different spectrogram software programs.

Real-Time Spectrogram Software

There are some great software programs to perform a spectrogram for speech analysis in realtime or with recorded sound files :

Javascript Spectrogram
Wavesurfer
Spectrogram16
SFS / RTGRAM
Audacity
RTS
STRAIGHT
iSound

Javascript Spectrogram

Jan Schnupp, sensory neuroscientist, former Professor at the Department of Physiology, Anatomy and Genetics within the Division of Medical Sciences at the University of Oxford, developed an outstanding javascript program to calculate and display a real-time spectrogram in a webpage, from the input to the computer’s microphone. It requires a browser which supports HTML5 and web audio and it requires also WebRTC, which is supported in recent versions of Chrome, Firefox and Opera browsers. WebRTC is a free, open project that enables web browsers with Real-Time Communications (RTC) capabilities via simple JavaScript APIs.

Javascript realtime spectrogram with 3x voice input “Marco Barnig” by microphone

Jan Schnupp is currently Professor of Neuroscience at the City University of Hong Kong. He is also the author of the website howyourbrainworks.net offering free, accessible introductory online lecture courses to neuroscience.
[HTML1]

Wafesurfer

WaveSurfer is an open source multiplatform tool for sound visualization and manipulation. Typical applications are speech/sound analysis and sound annotation/transcription. WaveSurfer may be extended by plug-ins as well as embedded in other applications. A comprehensive user manual and numerous tutorials for Wavesurfer are available on the net.

WaveSurfer was developed at the Centre for Speech Technology (CCT) at the KTH Royal Institute of Technology in Sweden. The latest stable Windows release (1.8.8p6, May 7, 2020) and the source code of WaveSurfer can be downloaded from Sourceforge. The authors of Wavesurfer are Jonas Beskow and Kåre Sjölander.

Wavesurfer auto-calculated, auto-sized spectrogram

By right-clicking in the Wafesurfer pane, a pop-up window opens with menus to add more panes, to customize the configuration and to change the parameters for analysis. In the following rendering, the panes Waveform, Pitch Contour, Formant Plot and Transcription have been added to the spectrogram pane and to the Time Axis pane. The spectrogram frequency range was cut at 5 KHz.
[HTML1]

Wafesurfer customized

Two other panes can be selected: Power Plot and Data Plot. Additional specific panes can be created with plugins.

Wavesurfer uses the Snack Sound Toolkit created by Kåre Sjölander. There exist other software programs with the name Wavesurfer, for example wavesurfer.js, a customizable waveform audio visualization tool, built on top of Web Audio API and HTML5 Canvas by katspaugh.

Spectrogram16

Spectrogram16 is a calibrated, dual channel audio spectrum analyzer for Windows that can provide either a scrolling time-frequency display or a spectrum analyzer scope display in real time for any sound source connected to the sound card. A detailed user guide (51 pages) is joined to the program.

Spectrogram16 customized

The tool was created by Richard Horne, the founder of Visualization Software LLC. The company closed a few years ago. The WayBackMachine shows that Richard Horne announced in 2008 that version 16 of Spectrogram is now freeware (see also local copy). The software is still available from most free software download websites. Richard Horne, MS, who retired as a Civilian Electrical Engineer for the Navy, was member of the Management Team of Vocal Innovations.

The Spectrogram program was (and is still) appreciated by amateur radio operators for aligning ham receivers.

SFS / RTGRAM

RTGRAM is a free Windows program for displaying a real-time scrolling spectrographic display of an audio signal. With RTGRAM you can monitor the spectro-temporal characteristics of sounds being played into the computer’s microphone or line input ports. RTGRAM is optimised for speech signals and has options for different sampling rates, analysis bandwidths (wideband = 300 Hz, narrowband = 45 Hz), temporal resolution (time per pixel = 1 – 10 ms), dynamic range (30 – 70 dB) and colour maps.

RTGRAM realtime spectrogram with 3x voice input “Marco Barnig” by microphone

The current version of RTGRAM is 1.3, released in April 2010. It is part of the Speech Filing System (SFS) tools for speech research.

RTGRAM is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.

Audacity

Audacity is a free, open source, cross-platform software for recording and editing sounds. Audacity was started in May 2000 by Dominic Mazzoni and Roger Dannenberg at Carnegie Mellon University. The current version is 3.0.3, released on July 26, 2021.
[HTML1]

Audacity auto-calculated, auto-sized spectrogram

A huge documentation about Audacity with manuals, tutorials, tips, wikis, FAQ’s is available in several languages.

RTS ^tm

RTS (Real-Time Spectrogram) is a product of Engineering Design, founded in 1980 to address problems in instrumentation and measurement, physical acoustics, and digital signal analysis. Since 1984, Engineering Design has been the developer of the SIGNAL family of sound analysis software. RTS is highly integrated with SIGNAL.

STRAIGHT

STRAIGHT (Speech Transformation and Representation by Adaptive Interpolation of weiGHTed spectrogram) was originally designed to investigate human speech perception in terms of auditorily meaningful parametric domains. STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. The tool was invented by Hideki Kawahara when he was in the Advanced Telecommunications Research Institute International (ATR) in Japan. Hideki Kawahara is now Emeritus Professor from the Wakayama University, Japan.

iSound

Irman Abdić created an audio tool (iSound) for displaying spectrograms in real time using Sphinx-4 as part of his thesis at the Faculty of Mathematics, Natural Sciences and Information Technologies (FAMNIT) from Koper, Slovenia.

No Real-Time Spectrogram Software

Other great software programs to create no-realtime spectrograms of recorded voice samples are :

Praat
SoX
SFS / WASP
Sonogram Visible Speech

Praat

Praat (= talk in dutch) is a free scientific computer software package for the analysis of speech in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the Institute of Phonetics Sciences at the University of Amsterdam. Praat runs on a wide range of operating systems. The program also supports speech synthesis, including articulatory synthesis.
[HTML1]
Praat displays two windows : Praat Objects and Praat Picture.

Praat Objects Window

Praat Picture Window

The spectrogram can also be rendered in a customized window.

Praat customized window

The current version 6.1.51 of Praat was released on August 25, 2021. The source code for this release is available at Github. A huge documentation with FAQ’s, tutorials, publications, user guides is available for Praat. The plugins are located in the directory C:/Users/name/Praat/.

An outstanding plugin for Praat is EasyAlign. It is a user-friendly automatic phonetic alignment tool for continuous speech. It is possible to align speech from an orthographic or phonetic transcription. It requires a few minor manual steps and the result is a multi-level annotation within a TextGrid composed of phonetic, syllabic, lexical and utterance tiers. EasyAlign was developed by Jean-Philippe Goldman at the Department of Linguistics, University of Geneva.

SoX

SoX (Sound EXchange) is a free cross-platform command line utility that can convert various formats of computer audio files in to other formats. It can also apply various effects to these sound files and play and record audio files on most platforms. SoX is called the Swiss Army knife of sound processing programs.

SoX is written in standard C and was created in July 1991 by Lance Norskog. In May 1996, Chris Bagwell started to maintain and release updated versions of SoX. Throughout its history, SoX has had many contributing authors. Today Chris Bagwell is still the main developer.

The current Windows distribution is 14.4.2 released in February 22, 2015. The source code is available at Sourceforge.

SoX provides a very powerful spectrogram effect. The spectrogram is rendered in a png image-file and shows time in the x-axis, frequency in the y-axis and audio signal amplitude in the z-axis. The z-axis values are represented by the colour of the pixels in the x-y plane. The command

sox example.wav -n spectrogram

creates the following auto-calculated, auto-sized spectrogram :

SoX auto-calculated, auto-sized spectrogram

The main options to customize a spectrogram created with SoX are :


-x num : change the width of the spectrogram from its default value of 800px
-Y num : sets the total height of the spectrogram; the default value is 550px
-z num : sets the dynamic range from 20 to 180 dB; the default value is 120 dB
-q num : sets the z-axis quantisation (number of different colours)
-w name : select the window function; the default function is Hann
-l : creates a printer-friendly spectrogram with a light background
-a : suppress the display of the axis lines
-t text : set an image title
-c text : set an image comment (below and to the left of the image)
-o text : set the name of the output file; the default name is spectrogram.png
rate num k : analyse a small portion of the frequency domain (up to 1/2 num kHz)

[HTML1]
A customized rendering follows :

Customized SoX spectrogram

The customized SoX spectrogram was created with the following command :

sox example.wav -n rate 10k spectrogram -x 480 -y 240 -q 4 -c "www.web3.lu" 
-t "SoX Spectrogram of the triple speech sound Marco Barnig"

WASP

WASP is a free Windows program for the recording, display and analysis of speech. With WASP you can record and replay speech signals, save them and reload them from disk, edit annotations, and display spectrograms and a fundamental frequency track. WASP is a simple application that is complete in itself, but which is also designed to be compatible with the Speech Filing System (SFS) tools for speech research. The current version 1.80 was released in June 2020.
[HTML1]
The following figure shows a customized WASP window with a speech waveform pane, a wideband spectrogram, a pitch track and annotations.

WASP customized spectrogram with pitch and annotation tracks

WASP is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.
[HTML1]

Sonogram Visible Speech

Sonogram Visible Speech is a very advanced program for sound, music and speech analysis. It provides multiple tools to perform various transformations and spectral studies on audio signals and to display the results in numerous panels : perspectogram, pitch, wavelet, cepstrum, 3D plots, auto-correlation charts etc.

In short terms, Sonogram is a powerful and complex audio spectrum analyzer with a comprehensive GUI layout.

Sonogram Visible Speech Main Window

Sonogram is programmed in Java and needs Java Runtime in version 16 at least. It runs in Windows, MacOS and Unix/Linux. The current version 5 has been released in August 18, 2021. The source code is available at Github. The next figure shows the start of the program in a Linux terminal.

Program Start with Sonogram.sh

The following files show the help-, settings- and info-panels:

Sonogram Online Help

Sonogram Settings

Sonogram Detailed Info

Sonogram includes a 3D-chart to present processed sound signals in three dimensions and a convenient audio recorder.

3D Perspectogram

Sonogram Audio Recorder

Sonogram Visible Speech was programed from 2000 to 2021 by Christoph Lauer. When he started the project he worked at the DFKI (Deutsches Forschungsinstitut für künstliche Intelligenz) in Saarbrücken. In December 2007 he joined the Saarland University as a scientific assistant, two years later the IDTM (Fraunhofer Institute for Digital Media Technology) as an Audio DSP Researcher. Since 2014 Christoph Lauer works as a Machine Learning Researcher for the BMW Group.

Specific Spectrogram Software

Spectrograms can also be used for teaching, artistic or other curious purposes :

FaroSon
SpectroTyper
ImageSpectrogram

FaroSon

FaroSon (The Auditory Lighthouse), is a Windows program for the real-time conversion of sound into a coloured pattern representing loudness, pitch and timbre. The loudness of the sound is reflected in the brightness and saturation of the colours. The timbre of the sound is reflected in the colours themselves: sounds with predominantly bass character have a red colour, while sounds with a predominantly treble character have a blue colour. The pitch of the sound is reflected in the horizontal banding patterns: when the pitch of the sound is low, then the bands are large and far apart, and when it is high, the bands are narrow and close together. If the pitch of the sound is falling you see the bands diverge; when it is rising, you see the bands converge.

Faroson

FaroSon is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.

SpectroTyper

AudioCheck offers the Internet’s largest collection of online sound tests, test tones, and tone generators. Audiocheck provides a unique online tool called SpectroTyper to insert plain text into a .wav sound file. The downloaded file plays as cool-sounding computer-like tones and is secretly readable from a spectrogram view (linear frequency scale best). It can be used for fun, to hide easter eggs in a music production or to tag copyrighted audio material with own identifiers or source informations.

Here is the barnig_txt.wav sound file with my integrated name as an example, the result is shown below in the SoX spectrogram, created with the command :

sox barnig_txt.wav -n rate 10k spectrogram -x 480 -y 120

SoX Spectrogram of a sound with inserted text, synthesized with SpectroTyper

SpectroTyper and other audio tools and tone generators have been created by Stéphane Pigeon, a research engineer & sound designer from Belgium. He received the degree of electrical engineering from the Université Catholique de Louvain (UCL) in June 1994, with a specialization in signal processing. He finalized a PhD thesis in applied science in 1999. Then, Stéphane Pigeon joined the Royal Military Academy as a part-time researcher. In parallel, he worked as a consultant, exclusively for Roland Corporation in the area of the musical instrument market. He designed various audio-related websites, like AudioCeck.net started in 2007. He also released some iOS apps. His most succesful project is myNoise.net, started in 2013, which offers a unique collection of online noise generators.

ImageSpectrogram

Richard David James, best known by his stage name Aphex Twin, is an British electronic musician and composer. In 1999, he released Windowlicker as a single on Warp Records. In this record he synthesized his face as a sound, only viewable in a spectrogram.

Gavin Black (alias plurSKI) created a perl script to do the same : take a digital picture and convert it into a wave file. Creating a spectrogram of that file then reproduces the original picture.

[HTML1]
Here is the barnig_portrait.wav sound file with my integrated portrait as an example, the result is shown below in the SoX spectrogram, created with the command :

sox barnig_portrait.wav -n spectrogram -x 480 -y 480

SoX Spectrogram of a sound with inserted picture, synthesized with imageSpectrogram

On July 24, 2022, Scott Duplichan published an Audio SpectrumViewer for Windows on Sourceforge. During the development he used the wav-sample with my embedded portrait to test his realtime spectrum viewer. Scott found a converter to create a better image with a smaller wav-file.

The spectrum viewer app contains a demo folder with an audioFileImage subfolder where you can start batch-files to compare the original with the improved spectrum. The result with the new converter is shown in the following screen-shot:

Mary TTS (Text To Speech)

Posted on September 27, 2014 by Marco Barnig

Last update : January 5, 2017

MaryTTS is an open-source, multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University. It is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz GmbH).

Mary stands for Modular Architecture for Research in sYynthesis. The earliest version of MaryTTS was developed around 2000 by Marc Schröder. The current stable version is 5.2, released on September 15, 2016.

I installed Mary TTS on my Windows, Linux and Mac computers. On the Mac (OSX 10.10 Yosemite), version 5.1.2 of Mary TTS was placed on the desktop in the folders marytts-5.1.2 and marytts-builder-5.1.2. The Mary TTS Server is started first by opening a terminal window in the folder marytts-5.1 with the following command :

marytss-5.1.2 mbarnig$ bin/marytts-server.sh

To start the Mary TTS client with the related GUI, a second terminal window is opened in the same folder with the command :

marytss-5.1.2 mbarnig$ bin/marytts-client.sh

On Windows , the related scripts are marytts-server.bat and marytts-client.bat.

As the development version 5.2 of Mary TTS supports more languages and comes with toolkits for quickly adding support for new languages and for building unit selection and HMM-based synthesis voices, I downloaded a snapshot-zip-file from Github with the most recent source code. After unzipping, the source code was placed in the folder marytts-master on the desktop.

To compile Mary TTS from source on the Mac, the latest JAVA development version (jdk-8u31-macosx-x64.dmg) and Apache Maven (apache-maven-3.2.5-bin.tar.gz), a software project management and comprehension tool, are required.

On Mac, Java is installed in

/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/

and Maven is installed in

/usr/local/apache-maven/apache-maven-3.2.5

It is important to set the environment variables $JAVA_HOME, $M2_HOME and the $PATH to the correct values (export in /Users/mbarnig/.bash-profile).

The succesful installation of Java and Maven can be verified with the commands :

mbarnig$ java -version
mbarnig$ mvn --version

Mary TTS : Maven and Java versions

This looks good!

In the next step I compiled the Mary TTS source code by running the command

marytts-master mbarnig$ mvn install

in the top-level marytts-master folder. This build the system, run unit and integration tests, packaged the code and installed it in the following folders :

marytts-master/target/marytts-5.2.SNAPSHOT
marytss-master/target/marytss-builder-5.2-SNAPSHOT

The build took 2:55 minutes and was succesful, without errors or warnings.

Results of building MARYTTS 5.2 SNAPSHOT

The following modules have been compiled :

MaryTTS
marytts-common
marytts-signalproc
marytts-runtime
marytts-lang-de, en, te, tr, ru, it, fr, sv, lx (lx is a pseudo locale for a test language)
marytts-languages
marytts-client
marytts-builder
marytts-redstart
marytts-transcription
marytts-assembly with the sub-modules assembly-builder and assembly-runtime
voice_cmu_slt_hsmm

After checking the whole file structure, I started the Mary TTS 5.2 server

Mary TTS snapshot 5.2 Server

and the Mary TTS 5.2 client

Mary TTS Snapshot 5.2 client

did some trials with text to audio conversion in the GUI window

Mary TTS Client GUI

launched the Mary TTS 5.2 component installer

Mary TTS Component Installer

and finally installed some french, german and english available voices.

Mary TTS Voice Installer GUI

In the next step I will try to create my own voices and develop a voice for the luxembourgish language.

In January 2017, I updated my systems with the stable MaryTTS version 5.2 which supports the luxembourgish language.

eSpeak Formant Synthesizer

Posted on September 24, 2014 by Marco Barnig

Last update : November 2, 2014

eSpeak

eSpeak is a compact multi-platform multi-language open source speech synthesizer using a formant synthesis method.

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn Risc OS computers, developed by Jonathan Duddington in 1995. He is still the author of the current eSpeak version 1.48.12 released on November 1, 2014. The sources are available on Sourceforge.

eSpeak provides two methods of formant synthesis : the original eSpeak synthesizer and a Klatt synthesizer. It can also be used as a front end for MBROLA diphone voices. eSpeak can be used as a command-line program or as a shared library. On Windows, a SAPI5 version is also installed. eSpeak supports SSML (Speech Synthesis Marking Language) and uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum system.

In formant synthesis, voiced speech (vowels and sonorant consonants) is created by using formants. Unvoiced consonants are created by using pre-recorded sounds. Voiced consonants are created as a mixture of a formant-based voiced sound in combination with a pre-recorded unvoiced sound. The eSpeakEditor allows to generate formant files for individual vowels and voiced consonants, based on a sequence of keyframes which define how the formant peaks (peaks in the frequency spectrum) vary during the sound. A sequence of formant frames can be created with a modified version of Praat, a free scientific computer software package for the analysis of speech in phonetics. The Praat formant frames, saved in a spectrum.dat file, can be converted to formant keyframes with eSpeakEdit.

To use eSpeak on the command line, type

espeak "Hello world"

There are plenty of command line options available, for instance to load from file, to adjust the volume, the pitch, the speed or the gaps between words, to select a voice or a language, etc.

To use the MBROLA voices in the Windows SAPI5 GUI or at the command line, they have to be installed during the setup of the program. It’s possible to rerun the setup to add additional voices. To list the available voices type

espeak --voices

eSpeak uses a master phoneme file containing the utility phonemes, the consonants and a schwa. The file is named phonemes (without extension) and located in the espeak/phsource program folder. The vowels are defined in the language specific phoneme files in text format. These files can also redefine consonants if you wish. The language specific phoneme text-files are located in the same espeak/phsource folder and must be referenced in the phonemes master file (see example for luxembourgish).

....
phonemetable lb base
include ph_luxembourgish

In addition to the specific phoneme file ph_luxembourgish (without extension), the following files are requested to add a new language, e.g. luxembourgish :

lb file (without extension) in the folder espeak/espeak-data/voices : a text file which in its simplest form contains only 2 lines :

name luxembourgish
language lb

lb_rules file (without extension) in the folder espeak/dictsource : a text file which contains the spelling-to-phoneme translation rules.

lb_list file (without extension) in the folder espeak/dictsource : a text file which contains pronunciations for special words (numbers, symbols, names, …).

The eSpeakEditor (espeakedit.exe) allows to compile the lb_ files into an lb_dict file (without extension) in the folder espeak/espeak-data and to add the new phonemes into the files phontab, phonindex and phondata in the same folder. These compiled files are used by eSpeak for the speech synthesis. The file phondata-manifest lists the type of data that has been compiled into the phondata file. The files dict_log and dict_phonemes provide informations about the phonemes used in the lb_rules and lb_dict files.

eSpeak applies tunes to model intonations depending on punctuation (questions, statements, attitudes, interaction). The tunes (s.. = full-stop, c.. = comma, q.. = question, e.. = exclamation) used for a language can be specified by using a tunes statement in the voice file.

tunes s1  c1  q1a  e1

The named tunes are defined in the text file espeak/phsource/intonation (without extension) and must be compiled for use by eSpeak with the espeakedit.exe program (menu : Compile intonation data).

meSpeak.js

Three years ago, Matthew Temple ported the eSpeak program from C++ to JavaScript using Emscripten : speak.js. Based on this Javascript project, Norbert Landsteiner from Austria created the meSpeak.js text-to-speech web library. The latest version is 1.9.6 released in February 2014.

meSpeak.js is supported by most browsers. It introduces loadable voice modules. The typical usage of the meSpeak.js library is shown below :

<!DOCTYPE html>
<html lang="en">
<head>
 <title>Bonjour le monde</title>
 <script type="text/javascript" src="mespeak.js"></script>
 <script type="text/javascript">
 meSpeak.loadConfig("mespeak_config.json");
 meSpeak.loadVoice("voices/fr.json");
 function speakIt() {
 meSpeak.speak("Bonjour le monde");
 }
 </script>
</head>
<body>
<h1>Try meSpeak.js</h1>
<button onclick="speakIt();">Speak It</button>
</body>
</html>

Click here to test this example.

The mespeak_config.json file contains the data of the phontab, phonindex, phondata and intonations files and the default configuration values (amplitude, pitch, …). This data is encoded as base64 octed stream. The voice.json file includes the id of the voice, the dictionary used and the corresponding binary data (base64 encoded) of these two files. There are various desktop or online Base64 Decoders and Encoders available on the net to create the required .json files (base64decode.org, motobit.com, activexdev.com, …).

meSpeak cam mix multiple parts (diiferent languages or voices) in a single utterance.meSpeak supports the Web Audio API (AudioContext) with internal wav files, Flash is used as a fallback.

Phonemes, phones, graphemes and visemes

Posted on September 20, 2014 by Marco Barnig

Phonemes

A phoneme is the smallest structural unit that distinguishes meaning in a language, studied in phonology (a branch of linguistics concerned with the systematic organization of sounds in languages). Linguistics is the scientific study of language. Phonemes are not the physical segments themselves, but are cognitive abstractions or categorizations of them. They are abstract, idealised sounds that are never pronounced and never heard. Phonemes are combined with other phonemes to form meaningful units such as words or morphemes.

A morpheme is the smallest meaningful (grammatical) unit in a language. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. The field of study dedicated to morphemes is called morphology.

Phones

Concrete speech sounds can be regarded as the realisation of phonemes by individual speakers, and are referred to as phones. A phone is a unit of speech sound in phonetics (another branch of linguistics that comprises the study of the sounds of human speech). Phones are represented with phonetic symbols. The IPA (International Phonetic Alphabet) is an alphabetic system of phonetic notation based primarily on the Latin alphabet. It was created by the International Phonetic Association as a standardized representation of the sounds of oral language.

In IPA transcription phones are conventionally placed between square brackets and phonemes are placed between slashes.

English Word : make
Phonetics : [meik]
Phonology : /me:k/   /maik/   /mei?/

A set of multiple possible phones, used to pronounce a single phoneme, is called an allophone in phonology.

Graphemes

Analogous to the phonemes of spoken languages, the smallest semantically distinguishing unit in a written language is called a grapheme. Graphemes include alphabetic letters, typographic ligatures, chinese characters, numerical digits, punctuation marks, and other individual symbols of any of the world’s writing systems.

Grapheme examples

In transcription graphemes are usually notated within angle brackets.

<a>  <W>  <5>  <i>  <解> <未>  <ق>

A grapheme is an abstract concept, it is represented by a specific shape in a specific typeface called a glyph. Different glyphs representing the same grapheme are called allographs.

In an ideal phonemic orthography, there would be a complete one-to-one correspondence between the graphemes and the phonemes of the language. English is highly non-phonemic, whereas Finnish come much closer to being consistent phonemic.

Visemes

A viseme is a generic facial shape that can be used to describe a particular sound. Visemes are for lipreaders, what phonemes are for listeners: the smallest standardized building blocks of words. However visemes and phonemes do not share a one-to-one correspondence.

Visemes

Google text to speech (TTS) with processing

Posted on April 17, 2013 by Marco Barnig

Referring to the post about Google STT, this post is related to Google speech synthesis with processing. Amnon Owed presented in November 2011 processing code snippets to make use of Google’s text-to-speech webservice. The idea was born in the processing forum.

The sketch makes use of the Minim library that comes with Processing. Minim is an audio library that uses the JavaSound API, a bit of Tritonus, and Javazoom’s MP3SPI to provide an easy to use audio library for people developing in the Processing environment. The author of Minim is Damien Di Fede (ddf), a creative coder and composer interested in interactive audio art and music games. In November 2009, Damien was joined by Anderson Mills who proposed and co-developed the UGen Framework for the library.

I use the Minim 2.1.0 beta version with this new UGen Framework. I installed the Minim library in the libraries folder in my sketchbook and deleted the integrated 2.0.2 version in the processing (2.0b8) folder modes/java/libraries.

Today I run succesful trials with the english, french and german Google TTS engine. I am impressed by the results.

Google speech to text (STT) with processing

Posted on April 17, 2013 by Marco Barnig

Processing is an open source programming language and environment for people who want to create images, animations, and interactions.

Florian Schulz, Interaction Design Student at FH Potsdam, presented a year ago in the processing forum a speech to text (STT) library, based on the Google API. The source code is available at GitHub, a project page provides additional informations. The library is based on an article of Mike Pultz, named Accessing Google Speech API / Chrome 11, published in March 2011.

I installed the library in my processing environment (version 2.0b8) and run the test examples with success. I did some trials with the french and german Google speech recognition engines. I am impressed by the results.

Additional informations about this topic are provided in the following link list :

Voice driven web applications

Posted on April 10, 2013 by Marco Barnig

Last update : July 17, 2013

The new JavaScript Web Speech API specified by W3C makes it easy to add speech recognition to a web page and to create voice driven web applications. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives.

The Web Speech specification was published by the Speech API Community Group, chaired by Glen Shires, software engineer at Google. The specification is not a W3C Standard nor is it on the W3C Standards Track.

A demo working in the Chrome browser 25 and later is available at the HTML5 rocks website.

There are two processes : Text-to-Speech (speech synthesis : TTS) and Speech-to-Text (speech recognition : ASR). There are at least three different approaches to synthesize text :

integrated : a TTS module is built into the OS, or a separately installed TTS engine can plug-in to the OS’s TTS module.
packaged : instead of requiring a separate install, a synthesizer and voices can be packaged and shipped with the application.
in the cloud : a web-service is used to synthesize text. The advantage of this is a more predictable and consistent voice quality, independent from the hardware and operation system used on the mobile client.

Concerning ASR, Wolf Paulus, an internationally experienced technologist and innovator, compared the performance (speed and accuracy) of the speech recognition systems developed by Google, Nuance, iSpeech and AT&T.

A HTML Speech XG Speech API Proposal, introduced by Microsoft to the HTML Speech Incubator Group, is available as unofficial draft at the W3C website.

A list of speech recognition software is available at Wikipedia. The main hosted speech applications are presented below :

iSpeech

iSpeech provides speech solutions for individuals and business, in different fields as mobiles, connected homes, automotive, publishing (audio books), e-learning and more. The solutions include Text-to-speech (TTS) and speech recognition (ASR).

iSpeech offers API’s and SDK for developers for different devices and programming languages (iPhone, Android, Blackberry, PHP, JAVA, Python, .NET, Flash, Ruby, Perl) and comprehensive documentations, integration guides, web samples and FAQ’s. iSpeech povides development keys to use the three servers :

Mobile Development
Mobile Production
Web/General/Desktop/Other Production

The applications must be configured to use the correct servers.To make the web/general key work, you need to buy credits. The low usage price is $0.02 per word (TTS) or per transaction (ASR).

An free iSpeech app for iOS devices (version 1.3.5 updated May 13, 2013) to convert text to speech with the best sounding voices is available at the iTune store. This app is powered by the iSpeech.org Text to Speech (TTS) software as a service (SaaS) API. Other apps for iOS and Android devices are listed at the iSpeech website. A Text-to-Speech demo is also available.

Nuance

Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that provides speech and imaging applications.

In August 2012, Nuance announced Nina, a collection of personal assistant technologies that will bring Siri-like functionality to customer service mobile apps.

Nuance provides the Dragon Mobile SDK to developers that joined the NDEV Dragon Mobile developer program. This creates a unique opportunity in the mobile developer ecosystem to power any application with Nuance’s proven, best-in-class Dragon Naturally Speaking voice recognition technology.

In joining NDEV Mobile, developers have free access to wrappers and widgets for simple application customization, all through a self-service website. Developers also have access to an on-line community forum for support, a variety of code samples and full documentation. Once an NDEV Mobile developer has integrated the SDK into their application, Nuance provides 90 days of free access to the cloud-based speech services to validate the power of speech recognition on their application. To put an application in production, a licence fee of 3.000 $ has to be prepaid.The low usage price is 0,009 $ per transaction.

The following platforms are supported :

Apple iOS
Android
Windows Phone
HTTP web services interface

A mobile assistant & voice app for iOS and Android is available in the iTunes at GooglePlay stores.

AT&T Watson Speech engine

AT&T offers a free speech development program to access the tools needed to build, test, onboard and certify applications across a range of devices, OSes and platforms.

There are three classes of functionality in the AT&T speech API family :

Speech to Text : 9 contexts are optimized to return the text of what the end users say. The text can be returned in multiple formats, including, JSON and XML.
Text to Speech : Male and female ‘characters’ are available for both English and Spanish.
Speech to Text Custom : the speech service is customized by sending a list of words or phrases commonly spoken by the end users to improve recognition of those unique words. The Grammar List supports 19 languages, the Generic with Hints supports English and Spanish.

The Call Management (Beta) API that is powered by Tropo™ exposes SMS and Voice Calling RESTful APIs, which enable app developers to create voice-enabled apps that send or receive calls, provide Interactive Voice Response (IVR) logic, Automatic Speech Recognition (ASR), Voice to Text (VTT), Text (SMS) integration, and more. SDK’s are available for HTML5 (Sencha Touch), Android, iOS and Microsoft. Tools are provided for key platforms, including Android, Brew MP, HTML5, RIM BlackBerry and Windows Phone.

The Speech API provides two methods for transcribing audio into text and one method for rendering text into audio. An AT&T Natural Voices Text-to-Speech Demo is availbale at the AT&T research website.

API access to the AT&T sandbox and production environments costs 99$ a year. The sandbox and production environments allow you to develop, test, and deploy applications using AT&T APIs, including 1 million points (one transaction = one point) each month to spend on any APIs they like. A US based credit card is required to charge 20$ for each additional group of 2,000 points exceeding one million. See the AT&T pricelist.

AT&T Application Resource Optimizer (ARO) is a free diagnostic tool for analyzing the performance of your mobile applications. It can help your app run faster and smarter by providing recommendations to help optimize your mobile application’s performance, speed, network impact and battery utilization.

Speech API FAQ’s as well as code samples, documents, tutorials, guides, SDK’s, tools, blogs, forums and more are available at the AT&T speech development website.

Google Speech API

The Google Speech API can be accessed safely through a Chrome browser using x-webkit-speech. Some people have reverse engineered the Google speech API for other uses on the web. The interface is free, but it is not an official public API.

On February 23, 2013, Google announced at the Chrome Blog that the new stable Chrome release includes support for the Web Speech API, which developers can use to integrate speech recognition capabilities into their web apps in more than 30 languages. A web speech API demo is available at the Google website. In the Peanut Gallery, you can add intertitles to old black-and-white movies simply by talking to Chrome.

The following list provides links to more informations about the Google speech API’s :

Google Text-to-Speech TTS support
Chrome TTS, Google
speech2text, by Todd Fisher
Voicecolor, by filosophy
Introducing the Google Translate app for iPhone, Google Mobile Blog

More speech applications from other suppliers are listed hereafter :

Apple Siri (rumored to be powered by Nuance)
eSpeak
Free TTS
Infovox
Ivona (Amazon)
Julius
Mozilla HTML5 Speech API
OpenEars
speechapi
Speechutil (free TTS conversion)
Sphinx-4 and CMU Sphinx
Voiceware
Voxeo

The Eclipse Voice Tools Project (VTP) allows you to build and run speech recognition application using industry standards such as VoiceXML and Speech Recognition Grammar Specification (SRGS).

1. Voice Folder

2. Clunits Setup

Phoneset Definition

Lexicon Creation

Building Prompts

Recording Prompts

Labeling

Creating Utterances

txt.done.data

Extracting pitchmarks

Find Mel Frequency Cepstral Coefficients

Building Synthesizer

Install new voice

Speech Corpora

CMU-ARCTIC US Voice Databases

ENST and UMPC French Speech Corpora

PAVOQUE German Corpus of Expressive Speech

Praat

Links :

Festival

Edinburgh Speech Tools

Festvox

Festival Variables

Festival Functions

Utterance Access Methods

Festival Feature Functions

Token item

Word item

Segment item

Syllable item

SylStructure item

Intonation item

Dumping features

Links

Real-Time Spectrogram Software

Javascript Spectrogram

Wafesurfer

Spectrogram16

SFS / RTGRAM

Audacity

RTS tm

STRAIGHT

iSound

No Real-Time Spectrogram Software

Praat

SoX

WASP

Sonogram Visible Speech

Specific Spectrogram Software

FaroSon

SpectroTyper

ImageSpectrogram

Links

eSpeak

meSpeak.js

Links

Phonemes

Phones

Graphemes

Visemes

Links

RTS ^tm