Google speech to text (STT) with processing

Processing is an open source programming language and environment for people who want to create images, animations, and interactions.

Florian Schulz, Interaction Design Student at FH Potsdam, presented a year ago in the processing forum a speech to text (STT) library, based on the Google API. The source code is available at GitHub, a project page provides additional informations. The library is based on an article of Mike Pultz, named Accessing Google Speech API / Chrome 11, published in March 2011.

I installed the library in my processing environment (version 2.0b8) and run the test examples with success. I did some trials with the french and german Google speech recognition engines. I am impressed by the results.

Additional informations about this topic are provided in the following link list :

 

Voice driven web applications

Last update : July 17, 2013

The new JavaScript Web Speech API specified by W3C makes it easy to add speech recognition to a web page and to create voice driven web applications. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives.

The Web Speech specification was published by the Speech API Community Group, chaired by Glen Shires, software engineer at Google. The specification is not a W3C Standard nor is it on the W3C Standards Track.

A demo working in the Chrome browser 25 and later is available at the HTML5 rocks website.

There are two processes : Text-to-Speech (speech synthesis : TTS) and Speech-to-Text (speech recognition : ASR). There are at least three different approaches to synthesize text :

  • integrated :  a TTS module is built into the OS, or a separately installed TTS engine can plug-in to the OS’s TTS module.
  • packaged : instead of requiring a separate install, a synthesizer and voices can be packaged and shipped with the application.
  • in the cloud : a web-service is used to synthesize text. The advantage of this is a more predictable and consistent voice quality, independent from the hardware and operation system used on the mobile client.

Concerning ASR, Wolf Paulus, an internationally experienced technologist and innovator, compared the performance (speed and accuracy) of the speech recognition systems developed by Google, Nuance, iSpeech and AT&T.

A HTML Speech XG Speech API Proposal, introduced by Microsoft to the  HTML Speech Incubator Group, is available as unofficial draft at the W3C website.

A list of speech recognition software is available at Wikipedia. The main hosted speech applications are presented below :

iSpeech

iSpeech provides speech solutions for individuals and business, in different fields as mobiles, connected homes, automotive, publishing (audio books), e-learning and more. The solutions include Text-to-speech (TTS) and speech recognition (ASR).

iSpeech offers API’s and SDK for developers for different devices and programming languages (iPhone, Android, Blackberry, PHP, JAVA, Python, .NET, Flash, Ruby, Perl) and comprehensive documentations, integration guides, web samples and FAQ’s. iSpeech povides development keys to use the three servers :

  • Mobile Development
  • Mobile Production
  • Web/General/Desktop/Other Production

The applications must be configured to use the correct servers.To make the web/general key work, you need to buy credits. The low usage price is $0.02 per word (TTS) or per transaction (ASR).

An free iSpeech app for iOS devices (version 1.3.5 updated May 13, 2013) to convert text to speech with the best sounding voices is available at the iTune store. This app is powered by the iSpeech.org Text to Speech (TTS) software as a service (SaaS) API. Other apps for iOS and Android devices are listed at the iSpeech website. A Text-to-Speech demo is also available.

Nuance

Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that provides speech and imaging applications.

In August 2012, Nuance announced Nina, a collection of personal assistant technologies that will bring Siri-like functionality to customer service mobile apps.

Nuance provides the Dragon Mobile SDK to developers that joined the NDEV Dragon Mobile developer program. This creates a unique opportunity in the mobile developer ecosystem to power any application with Nuance’s proven, best-in-class Dragon Naturally Speaking voice recognition technology.

In joining NDEV Mobile, developers have free access to wrappers and widgets for simple application customization, all through a self-service website. Developers also have access to an on-line community forum for support, a variety of code samples and full documentation. Once an NDEV Mobile developer has integrated the SDK into their application, Nuance provides 90 days of free access to the cloud-based speech services to validate the power of speech recognition on their application. To put an application in production, a licence fee of 3.000 $ has to be prepaid.The low usage price is 0,009 $ per transaction.

The following platforms are supported :

  • Apple  iOS
  • Android
  • Windows Phone
  • HTTP web services interface

A mobile assistant & voice app for iOS and Android is available in the iTunes at GooglePlay stores.

AT&T Watson Speech engine

AT&T offers a free speech development program to access the tools needed to build, test, onboard and certify applications across a range of devices, OSes and platforms.

There are three classes of functionality in the AT&T speech API family :

  • Speech to Text : 9 contexts are optimized to return the text of what the end users say. The text can be returned in multiple formats, including, JSON and XML.
  • Text to Speech : Male and female ‘characters’ are available for both English and Spanish.
  • Speech to Text Custom :  the speech service is customized by sending a list of words or phrases commonly spoken by the end users to improve recognition of those unique words. The Grammar List supports 19 languages, the Generic with Hints supports English and Spanish.

The Call Management (Beta) API that is powered by Tropo™ exposes SMS and Voice Calling RESTful APIs, which enable app developers to create voice-enabled apps that send or receive calls, provide Interactive Voice Response (IVR) logic, Automatic Speech Recognition (ASR), Voice to Text (VTT), Text (SMS) integration, and more. SDK’s are available for HTML5 (Sencha Touch), Android, iOS and Microsoft. Tools are provided for key platforms, including Android, Brew MP, HTML5, RIM BlackBerry and Windows Phone.

The Speech API provides two methods for transcribing audio into text and one method for rendering text into audio. An AT&T Natural Voices Text-to-Speech Demo is availbale at the AT&T research website.

API access to the AT&T sandbox and production environments costs 99$ a year. The sandbox and production environments allow you to develop, test, and deploy applications using AT&T APIs, including 1 million points (one transaction = one point) each month to spend on any APIs they like. A US based credit card is required to charge 20$ for each additional group of 2,000 points exceeding one million. See the AT&T pricelist.

AT&T Application Resource Optimizer (ARO) is a free diagnostic tool for analyzing the performance of your mobile applications. It can help your app run faster and smarter by providing recommendations to help optimize your mobile application’s performance, speed, network impact and battery utilization.

Speech API FAQ’s as well as code samples, documents, tutorials, guides, SDK’s, tools, blogs, forums and more are available at the AT&T speech development website.

Google Speech API

The Google Speech API can be accessed safely through a Chrome browser using x-webkit-speech. Some people have reverse engineered the Google speech API for other uses on the web. The interface is free, but it is not an official public API.

On February 23, 2013, Google announced at the Chrome Blog that the new stable Chrome release includes support for the Web Speech API, which developers can use to integrate speech recognition capabilities into their web apps in more than 30 languages. A web speech API demo is available at the Google website. In the Peanut Gallery, you can add intertitles to old black-and-white movies simply by talking to Chrome.

The following list provides links to more informations about the Google speech API’s :

More speech applications from other suppliers are listed hereafter :

The Eclipse Voice Tools Project (VTP) allows you to build and run speech recognition application using industry standards such as VoiceXML and Speech Recognition Grammar Specification (SRGS).

5.1 Surround Sound and FLAC

Suggested configuration for 5.1 music listening (Wikipedia)

Suggested configuration for 5.1 music listening (Wikipedia)

Five point one (5.1) is the name for six channel surround sound multichannel digital audio systems, most commonly used in commercial cinemas and home theaters. It uses 5 full bandwidth channels (the “five”) and one low-frequency effects channel (the “point one”). The 5.1 system is used by Dolby Digital (AC3 codec), Sony Dynamic Digital Sound (SDDS), Digital Theater Systems (DTS), and Dolby Pro Logic II.

All 5.1 systems use the same speaker channels and configuration, having a front left (L) and right (R), a center channel (C), two surround channels (SL and SR) and a subwoofer (LFE).

Audio files for 5.1 systems are often encoded with the lossless FLAC codec. FLAC is an open format with royalty-free licensing and a reference implementation which is free software. FLAC has support for metadata tagging, album cover art, and fast seeking.

Lossy compression and encoding schemes for digital audio are MP3 and its successor AAC (Advanced Audio Coding). AAC has been standardized by ISO and IEC, as part of the MPEG-2 and MPEG-4 specifications. AAC is the standard audio format for YouTube, Apple (iPhone, iPod, iPad, …) and Sony devices (Playstation, Walkman, …). AAC is more advanced than the Dolby Digital AC3 codec.

Online music : Last.fm, Deezer and Spotify

A renowned online music service is iTunes, based on SoundJam MP and launched by Apple in 2001. Jeff Robbin and Bill Kincaid developed SoundJam MP in 1998 with assistance from Dave Heller. They chose Casady & Greene to publish SoundJam MP. Jeff Robbin is now the vice president of consumer applications at Apple Inc and he remains the lead software designer for iTunes.

Other online music services are less known, among them Last.fm, Deezer and Spotify.

Last.fm is a music website, founded in the United Kingdom in 2002, acquired by CBS Interactive in May 2007. Using a music recommender system called Audioscrobbler, Last.fm builds a detailed profile of each user’s musical taste by recording details of the songs the user listens to. Audioscrobbler began as a computer science project of Richard Jones. Last.fm was founded in 2002 by Felix Miller, Martin Stiksel, Michael Breidenbruecker and Thomas Willomitzer as an internet radio station and music community site. Last.fm won the Europrix 2002 and was nominated for the Prix Ars Electronica in 2003. Last.fm and Audioscrobbler were merged in 2005 and are still active today. A new desktop player was released on January 15, 2013.

Deezer is a French web-based music streaming service. It allows users to listen to music on various devices. It currently has more than 20 million licensed tracks and over 30,000 radio channels. The first version of Deezer, called Blogmusik, has been developed by Daniel Marhely in Paris in 2006. The company became succesful in 2010 when they entered a partnership with Orange. Deezer has three account types : discovery (free), premium and premium-plus. Deezer was launched in Luxembourg in March 2012 in partnership with Tango.

Spotify is a commercial music streaming service providing DRM-protected content from a range of major and independent record labels, including Sony, EMI, Warner Music Group and Universal. The service was launched in October 2008 by Swedish startup Spotify AB. The company was founded by Daniel Ek and Martin Lorentzon. Since November 2012 the service is also available in Luxembourg.

The system is currently accessible using Microsoft Windows, Mac OS X, Linux, iOS, Android, BlackBerry, Windows Mobile, Windows Phone, S60 (Symbian), Sonos, and other devices. Music can be browsed by artist, album, record label, genre, playlist, radio channels, as well as by direct searches. About 20 million songs are available since December 2012. Some artists are missing because of licensing restrictions imposed by the record labels or by the artists. The Beatles, for example, are not available because of a digital distribution agreement that is exclusive to iTunes.

Three subscriptions, with trials, are available : open, unlimited, premium. A free service is only available upon invitation. Spotify operates under a so-called ‘Freemium’ model, which is offering simple and basic services free for the user to try and more advanced or additional features at a premium price based ont the Open Music Model (OMM). The incorporation of DRM diverges however from the OMM.

In 2011 Spotify was announced as a technology pioneer by the World Economic Forum (WEF).

Vocaloids

Vocaloid is a singing synthesizer application, with its signal processing part (concatenative synthesis) developed through a joint research project between the Pompeu Fabra University in Spain and Japan’s Yamaha Corporation, who developed the software into a commercial product. Vocaloid enables users to synthesize singing by typing in lyrics and melody. The main parts of the Vocaloid  system are the Score Editor, the Singer Library and the Synthesis Engine. The project started in 2000, the first commercial Vocaloid version was presented by Yamaha at the Musikmesse in Germany in 2003 and the Vocaloid version 3 was launched in October 2011.

Each Vocaloid is sold as “a singer in a box” designed to act as a replacement for an actual singer. Today seven studios are involved with the production and distribution of Vocaloids, among them are three studios creating english Vocaloids, the other four are solely creating Japanese Vocaloids.

  • Zero-G (english virtual vocalists) : Zero-G Limited was founded in 1990, trading under the name Time+Space, by Ed Stratton and Julie Stratton.  Zero-G  rapidly became the largest distributor of soundware in the UK and one of the most critically acclaimed sound developers in the world.
  • Power-X (english virtual vocalists) : PowerFX is a small recording company, based in Stockholm, Sweden. The company has been producing music samples, loops and sound effects since 1995.
  • Crypton Future Music (japanese and english virtual vocalists) : Crypton, is a media company based in Sapporo, Japan, created in 1995. It develops, imports, and sells products for music, such as sound generator software, sampling CDs and DVDs, sound effect and background music libraries.
  • Internet Co. Ltd. (japanese virtual vocalists) : Internet Co.  is a software company based in Osaka, Japan. It is best known for the music sequencer Singer Song Writer and Niconico Movie Maker for the video sharing website Nico Nico Douga.
  • AH Software (japanese virtual vocalists) : AH-Software is the software brand of AHS Co., Ltd., an importer of digital audio workstations and encoders in Tokyo, Japan. It is also known as the developer of Voiceroid, a speech synthesizer application only available in the Japanese language.
  • Bplats (japanese virtual vocalists) : Bplats, Inc. is an application service provider (ASP) based in Tokyo, Japan. The company offers Software as a Service (SaaS) and Platform as a Service (PaaS) solutions, such as the Vocaloid series VY1 and a Vocaloid online shop.
  • Ki/oon Records (japanese virtual vocalists) : Ki/oon Records is a Japanese record label, a subsidiary of Sony Music Japan.

Hatsune

Kagamine

Leon

Sonika

Big AL

Nekomura

A complete list of the Vocaloid products is available at the Wiki website.  The marketing of  the Vocaloids is done by the studios.

Just like any music synthesizer, the software is treated as a musical instrument and the vocals as sound, belonging to the software user. The mascots for the software can be used to create vocals for commercial or non-commercial use as long as the vocals do not offend public policy. On the other hand, copyrights to the mascot image and name belong to their respective studios and can not be usedd without the consent of the studio who owns them.

There are a number of derivative products, for example Vocaloid-Flex, Vocal Listener, Miku Miku Dance, Project Diva and MMDAgent. An online Vocaloid service (NetVocaloid)  in English and Japanese is available at the Y2 Project website.

The following virtual vocalists are the most famous :

A number of figurines and plush dolls were released for some of these singers, some have their own Twitter, Facebook and MySpace accounts.

In Japan, Vocaloids have a great cultural impact and lead to a lot of legal implications. Vocaloid music is available on CD’s, iTunes, AmazonMP3 etc. Open air concerts with virtual vocalists have been organized recently with great success :

  • 1st live concert (Animelo Summer Live) : August 22, 2009, Saitama Super Arena, Saitama, Japan
  • 2nd live concert (Mikufes 09) : August 31, 2009,
  • 1st overseas concert (Anime Festival Asia) : November 21, 2009, Singapore
  • 3rd live concert (Miku no Hi Kanshasai 39’s Giving Day) : March 09, 2010, Odaibo, Tokio, Japan
  • 1st american live concert : September 18, 2010, San Francisco, USA
  • Vocarock Festival : January 11, 2011
  • Vocaloid Festa : February 12, 2011
  • 4th live concert : March, 9, 2011, Tokio, Japan
  • 2nd american live concert : October 11, 2010, Viz Cinema, San Francisco, USA; screening in the New York Anime Festival
  • 3rd american live concert (Mikunopolis) : July 2, 2010, Nokia Theater, Anime Expo, Los Angeles, USA

During the concerts, 3D animations of the Vocaloid mascots are projected on a transparent screen giving an effect of  a pseudo-hologram. Videos of different Vocaloid concerts are available at the following Youtube playlist.

A similar software as Vocaloids, developped by Ameya/Ayame, is called UTAU and has been released as freeware. Cracked copies of Vocaloids are called Pocaloids.

Microsoft Tellme

Microsoft Tellme simplifies everyday tasks with the natural power of your voice. You can talk to your PC, tablet, phone, TV or car.

The results of the Microsoft Tellme technologies “Say it. Get it” are speech recognition and synthesis capabilities in products ranging from Xbox Kinect for fun to Microsoft Tellme IVR for customer care to Windows Phone 7 for life and work.

In Windows 7 you can use voice recognition to control your computer and to dictate and edit text. A guide how to set up your computer for this task is available at the microsoft website.

The provided technologies for business applications are Microsoft Tellme IVR and embedded speach features in Office, Lync and Exchange . Different platforms are available : cloud, server, desktop, phone.

To extend the built-in speech recognition functionality included in Windows on desktop, you can use Windows Speech Recognition Macros or, for more advanced uses, the Microsoft Speech API (SAPI).

SAPI has been an integral component of all Microsoft Windows versions since Windows 98. Microsoft Windows XP and Windows Server 2003 include SAPI version 5.1. Windows Vista and Windows Server 2008 include SAPI version 5.3, while Windows 7 includes SAPI version 5.4. Code written for SAPI 5.3 (Vista) will run on SAPI 5.4 (Windows 7) without recompiling.

Google Text-to-Speech (TTS) support

Last update : 30 April 2011

On november 16th, 2009, Google announced on their official blog that english text-to-speech was added to the translation tools.  Google used eSpeak, which is an open source software speech synthesizer for this service.

In may 2010,  Google Translate added more audio translations languages, including Afrikaans, Albanian, Catalan, Chinese (Mandarin), Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Haitian Creole, Hindi, Hungarian, Icelandic, Indonesian, Italian, Latvian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swahili, Swedish, Turkish, Vietnamese and Welsh.

The speech audio is in MP3 format and is queried via a simple HTTP GET (REST) request. For english, an example url is:

http://translate.google.com/translate_tts?tl=en&q=how are you?

The TTS web service is restricting the text to 100 characters and the service returns 404 (Not Found) if the request includes a Referer header.

December 3, 2010, Google acquired Phonetic Arts, a company specialised in speech synthesis. Phonetic Arts Limited delivers technology that generates natural expressive speech. The products include Phonetic Morpher,  Phonetic LipSync  and Phonetic Synthesizer. Phonetic Arts, formerly known as Tayvin 356 Limited, was founded in 2006 and is based in Cambridge, UK.  The Phonetic Arts technology generates natural computer speech from small samples of recorded voice and should improve the voice output quality of Googles text-to-speech applications.

Google does not only provide speech output tools, but also speech input tools (Voice Search, Voice Input, Voice Actions), mainly in relation with the mobile phone OS Android.

Version 11 of the Google Chrome browser includes the HTML5 Speech Input API.

An amusing application of the Google TTS system is the Google Translate Beatbox.

Dewplayer : lecteur mp3 en flash

Alsacréations, une agence web à Strasbourg en Alsace, spécialisée dans la conception de sites internet conformes aux standards internationaux W3C, offre depuis plusieurs années un lecteur audio mp3 en Flash par Dew, simple à installer et à utiliser.

Appelé Dewplayer, ce lecteur est distribué sous licence Creative Commons, son utilisation est libre et gratuite même dans un cadre professionnel ou commercial.

Un générateur de code XHTML est disponible sur le site qui va produire un code à copier-coller selon les besoins des usagers. L’utilisation de swfobject est recommandée pour l’intégration du lecteur.

Le pilotage du lecteur par Javascript est possible et il y a de nombreuses options disponibles. J’utilise le lecteur depuis des années avec succès. La version la plus récente est 1.9.6.

SoundFonts (.sf2)

SoundFont, a registered trademark of E-mu Systems, Inc., is a name that collectively refers to a file format and associated technology to synthesize audio in the context of computer music composition. The exclusive license for re-formatting and managing historical SoundFont content has been acquired by Digital Sound Factory.

A SoundFont file, or SoundFont bank, contains one or more sampled audio waveforms (or samples), which can be re-synthesized at different pitches and dynamic levels. SoundFont banks are related to MIDI devices and can be seamlessly used in place of General MIDI (GM) patches in many computer music sequencers.

The original SoundFont file format was developed in the early 1990s by E-mu Systems and Creative Labs (used in Sound Blaster AWE32). Files in this format conventionally have the file extension of sbk. The SoundFont 2.0 version was released in 1996 and was fully disclosed as a public specification to make it an industry standard. New versions up to 2.4 have been relased in the past years and the new SoundFont files conventionally have the file extension sf2.

There are other sound formats available, e.g. The DownLoadable Sounds (DLS) standardized by the MIDI Manufacturers Association (MMA),  the DLS-Level 2 and the Structured Audio Sample Bank Format (SASBF )standardized by he MPEG standards body in collaboration with MMA and MIT and  proprietary formats developed by Yamaha and other music companies. Nevertheless the sf2-soundfonts became a de-facto standard and are widely used today.

There are a lot of websites available that offer free and commercial sf2 soundfonts :

The following tools are best suited to use SoundFonts :

  • SynthFont : a free midi file player using SoundFonts
  • Viena : a free SoundFont editor
  • FluidSynth : an open source real-time software synthesizer used in several music applications
  • Gervill : a software sound synthesizer for use with the Java Sound API
  • SFPack and SFArk : archivers for SoundFont banks which use different compression techniques