My view of Text-to-Speech (TTS) technology evolution in 3 fundamental steps

Written by Roberto Valente

The Author co-founded Interactive Media in 1996 and is the CEO of the company. Interactive Media is a global developer and vendor of speech applications.

Interactive Media has a long history of developing progressively more sophisticated speech applications, with more than 25 years of experience in managing text to speech. But I started working on this even before and I can say that I have been involved in CTI (computer-telephony integration) since its beginning. I want to give a brief perspective of my first-hand experience here.

In 1993 and in the following years I had the privilege to collaborate with the CSELT, the pre-eminent Italian telecommunications lab in Torino. CSELT had then already been working on Text-to-Speech technologies for decades.

Back in the 70s CSELT was, together with AT&T, the only company working on developing TTS for commercial use. Their first publicly demonstrated system was called MUSA. You can hear it speak in this video (in Italian):

In 1993 CSELT released Eloquens, also based on diphones concatenation (diphones are the sounds that we make from the half of a phoneme to half of the next phoneme when we speak a word). Eloquens’ quality was much better than MUSA, and even now it can be considered a good quality product. It is still in use for several applications. See for instance

Record with the nursery rhyme Fra Martino campanaro (Brother John), as sung by MUSA in 1978

Eloquens software had been developed to be used on a stand-alone PC. But CSELT, that was owned by the national telephone company, naturally had the goal to use it on the telephone network. This is where I came in. At that time, I was a consultant for an Italian company that had the exclusive sale rights for Italy of the computer boards made by Natural Microsystems, and American company. These were among the first CTI boards which allowed a PC to communicate with the telephone network.

My role was to adapt the Eloquens software to run with the board’s DSPs, so that it could be used in IVR-type applications. I remember these days as an extraordinary period. Aside from the project, which was very interesting, I was a young engineer just out of university and spending a long period of time away from home for the first time. Torino was at that time a heavily industrial city and at 8:30 pm all restaurants were empty, and no-one was in the streets. The following day the factory sirens would go off before dawn to mark the start of a new working day. This was quite different from my hometown, Roma. I was working with Marcello Balestri and Luciano Nebbia’s group: they were excellent engineers, like most of the staff at CSELT. Together we were then able to develop and release the first Italian version, and one of the first in the world, of a commercial TTS that could be used in an IVR system.

Even today, after 30 years, that software is still deployed in some companies. This is also because only in the past few years there has been a substantial technological leap with noticeably better performance, thanks to the use of neural networks and in particular deep learning techniques. Training neural networks to perform TTS, the process does not rely on diphones concatenation and so it avoids the “pixelation” that is still present in older systems. Using deep learning the prosody is practically perfect, and people can sometime not tell a synthetic voice apart from an original human speaker.

One interesting capability of this technology is the possibility to create one’s own synthetic voice, by recording a few hours of audio, for instance by reading a text. Among the most otherworldly applications is the use of a synthetic voice to create a digital persona for a person, even after that person has passed away.

To speak of more worldly affairs, recently Interactive Media won a contract to produce all the audio responses in TIM Brazil’s customer service systems, using a Neural TTS from Microsoft. The resulting quality is amazing, and the caller has the feeling that the speaker is a person: polite, sympathetic and helpful, while still professional sounding. We at Interactive Media are ready to expand on this experience, with the know-how that we accumulated in 25 years, on all other markets. Please contact us if the voice that you use to talk with your customers is important to you.

