Digital speech gets neural overhaul

Published on the 12/02/2021 | Written by Heather Wright

Voice actors_MIcrosoft

Voiceover actors and radio jocks need not apply…

We’ve seen a proliferation of ‘digital humans’ over the last two years, now here come the – eerily human – ‘synthetic voices for your brand’.

Microsoft moved Custom Neural Voice, an Azure Cognitive Services offering which lets developers create the synthetic voices with neural text-to-speech technology to general availability this month. That general availability is however, on an approval only basis so you’ll need to talk nicely to Microsoft in order to get access to the service.

“This new technology allows companies to spend a tenth of the effort traditionally needed to prepare training data,” says Microsoft.

Traditional text to speech (TTS) requires reams of voice data – in the range of 10,000 lines or more according to Microsoft – to produce a fluent voice model. TTS models with fewer recorded lines tend to sound noticeably robotic.

“This new technology allows companies to spend a tenth of the effort traditionally needed to prepare training data.”

Neural TTS, harnessed by Custom Neural Voice, changes things, learning’ the way phonetics are combined in natural human speech rather than using classical programming or statistical methods.

The result is highly-realistic voices with just a small number of training audios. (you can listen to the examples here.)

It says companies can use the technology for customer service chatbots, voice assistants in appliances, cars and homes, online learning, audio books, assistive technology and real-time translations and public service announcements.

But the offering comes with a warning about the potential for abuse – hence the reason for the limited access ‘general availability’ requiring companies to submit an ‘intake form’ before accessing the service.

“We are designing and releasing Custom Neural Voice with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteract the proliferation of harmful deepfakes and misleading content,” Microsoft says.

It’s even served up a Code of Conduct, including prohibited use cases for custom neural voice, including that “This service must not be used to simulate the voice of politicians or government officials, even with their consent.”

Also included is a requirement that voice actors used by companies understand the technology and that the customer is having a voice model created, based on their voice.

Custom Neural Voice consists of three components: Text Analyzer, Neural Acoustic Model and Neural Vocoder. Text is inputted into the analyser, converted into a phoneme (the basic unit of sound) sequence, pumped through the model to predict acoustic features such as speaking style, speed and intonations, and then converted into audible speech.

The offering has been in preview since September 2019 and Microsoft says it’s been used by the likes of AT&T, Duolingo, US insurance company Progressive and Swisscom.

AT&T used Customer Neural Voice to create a Bugs Bunny soundalike at its AT&T Experience Store in Dallas, where customers can interact with Bugs.

Microsoft isn’t alone in the synthetic speech field. Back in 2019 Google debuted its AI-synthesised WaveNet voices. In December 2020 it launched an initiative converting ebooks to audiobooks for publishers, using a combination of text-to-speech and WaveNet voices.

Amazon meanwhile has BrandVoice, which allows customers to engage the Amazon Polly team to build neural text to speech voices representing the customer’s brand persona.

Post a comment or question...

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Thank you! Your subscription has been confirmed. You'll hear from us soon.
Follow iStart to keep up to date with the latest news and views...