Speech-to-Text results optimization with Interactive Media’s solutions

February 10, 2022

An historical perspective

Interactive Media has offered Conversational AI solutions for many years, focusing on voice-enabled Virtual Agents. We deployed our first conversational Virtual Agents way before Conversational AI was a buzz-word and the explosion of self-service conversational deployments. 

Having focused on voice since the beginning, we are keenly aware of the challenges that come with converting the spoken utterances coming from users into text that conversational systems can use.

This is because conversational AI Virtual Agents can hold a spoken conversation, for instance on the phone, but their AI brain works on text. So, they need to convert the sentences spoken by humans into their text counterpart, and the text that the system uses to answer back into speech.

Ten years ago, the options available on the market to interpret speech and convert it into text (ASR, Automatic Speech Recognition, or Speech-to-Text) were limited. One company, Nuance, dominated the field, having developed their own technology, or acquired smaller competitors in different countries to offer Speech-to-Text in different languages. So, initially Interactive Media relied on Nuance’s technology for all its voice-enabled Virtual Agent deployments.

Other Articles

Today’s landscape

The state of the technology is vastly different now. The wide adoption of AI has changed the way human speech is interpreted by machines in a substantial way, making the task to develop Speech-to-Text systems much easier and performance much better – meaning that transcription precision has improved significantly. Speech-to-Text offers have exploded in number and dozens of companies now provide the service, either directly from the public Cloud or integrated more strictly with speech applications.

However, speech is not the same for all people and applications. The variations are staggering. People speak in different ways depending on what they want, what is being asked of them, where they are in a conversation, and of course in dozens of different languages. Providing a Speech-to-Text service that covers effectively all the variations and parts of a conversation is exceedingly hard. So, inevitably some services are better than other for specific tasks and languages.

Interactive Media’s approach to Speech-to-Text

Since Speech-to-Text is still integral to Interactive Media’s offer, we are constantly monitoring its advances and testing different services on a day-to-day basis. We have developed metrics and standardized test suites to inform the decision of what service to use for the benefit of our customers, depending on the use case which dictates the task at hand, the settings, and the language.

What’s the benefit? We have found that the main general-purpose Speech-to-Text services have some weak points, for instance when the task is to fill in a form with numbers or alphanumeric strings. In this case the field of results is limited, but some services don’t seem to use this to their advantage and retain the same percentage of correct recognition as the general speech. But while a 95% recognition accuracy is usually enough to find out an intent (for instance), when you need to take in a string of 10 digits, you’ll get it wrong roughly 40% of the times.

However, other Speech-to-Text engines are optimized for recognizing digits or allow the user to define tight grammars that can help with the task. Using these engines, you can get an accuracy up to 99%, which over 10 digits results in a 90% probability to get the whole string right.

Similarly, there are more common tasks that need optimization for the Virtual Agent to be effective. Maybe the most challenging one is transcribing an email address. Human agents have a hard time with it, and the percentage of errors is exceedingly high. Again, some Speech-to-Text services do better than others and even a 5% difference makes it worth it to switch to a better performing service in mid-call if the volume of traffic is high enough.

So, we engineered our platform to use several of the best Speech-to-Text services, constantly testing the connected services and adding new ones as they become available. It’s a big task, but (we think) we are being fairly smart about it: we model conversations by defining categories of tasks that Virtual Agents must accomplish, and continuously test each of the services we integrate with using sample atomic interactions belonging to each category. This way, we derive scores for the various services for each task, in several languages.

This would be academic without a way for the Virtual Agent application to tell us what to expect. So, we added this feature to all our services, provided by the PhoneMyBot and OMNIA platforms. The API allows to specify the expected category of utterance coming from the user, based on the question being asked. So for instance, if the system prompts the user to provide a numerical code, the service knows that the next utterance is most likely composed of numbers, and will use the Speech-to-Text engine with the best performance recognizing them.​

The difference in performance is substantial – if even 10% less calls have to be forwarded to human agents, especially when the task is simply collecting data from the customer, the customer experience is better and the ROI for our customers soars, which is the promise of Virtual Agents, delivered.

Other Articles

The history of call qualification – a perspective

The history of call qualification – a perspective

Someone was asking me about the techniques that in time have been used to qualify contact center calls – to understand what the caller wants and so route the call to the best group of agents in the contact center operation. I must say I wasn’t there for the beginning...

read more
My take on Omnichannel digital transformation

My take on Omnichannel digital transformation

Every contact center offer is Omnichannel these days. Companies operating in the space of contact center software – like everyone else – follow trends, and having Omnichannel operation, the ability to save and retain context gathered on a channel to then use it the...

read more

Interact with us

Subscription

Receive our exclusive content: