Computer text to speech voices vs voice over actors

Not long ago, almost a hundred percent of the voice-over market was occupied by real voice actors. Computer Text to speech conversion was used mainly for technical purposes in specific industries, such as Robotech, computers or GPS.

David Ciccarelli, co-founder and CEO of Voices.com said a couple years ago: “While there’s a time and place for synthetic voices to provide navigational prompts or brief instructions, communicating important messages with the intent to inform, educate, and inspire audiences should be left strictly to professional voice actors.”

But a lot has changed in recent years.

How the voice acting market has changed over the past 20 years

In 2019, Google presented an audio wherein a very lively, natural voice calls the hairdresser and makes an appointment. In the second audio, the client’s voice reserves a table in a restaurant. Both dialogs sounded completely natural, with pauses and interjections. The administrators of the women's salon and the cafe did not even guess that they were dealing with synthesized voices and artificial intelligence.

“Human brains may not be able to distinguish a speaker’s voice from its morphed version,” said Nitesh Saxena, lead researcher on the study and the director of the University of Alabama at Birmingham's Security and Privacy in Emerging Computing and Networking Systems (SPIES) Lab.

The new synthesized voices have become much more realistic and therefore already directly compete with voice actors.

Today many businesses use synthesized voices to convert text to speech. The fear of using computer tts voices is fading into the background, giving way to such advantages as the ability to quickly and inexpensively get high-quality voice over.

Computer voices have become more realistic, and the choice among them is growing. If among the synthesized voices you come across robotic voices, such as the voice used by Steven Hawking, then this is an exception.

You might ask - what is the best app for text to speech? The highest-quality text to speech online software, predictably, turned out to be from giants like Google, Microsoft, IBM and Amazon. The problem is that these platforms aren't designed for end users. Rather, they're meant as a B2B solution. So if you ask ‘how do I use Google text to speech’, we would answer that even if you are a tech-savvy person, that won’t be easy. To start converting text to speech you’ll need to create an account on a platform, update lots of settings, and even to code. No need for that, since you can get backdoor access to the best neural voices from Google, Microsoft, IBM and Amazon on Kukarella website.

If you're interested, enjoy this short excursion into the history of synthetic voices:

Today you can choose not only the type of voice, but also the accent and intonation, and sometimes the temperament of the narrator. Computer voices are so natural and realistic that we often don't notice them in YouTube or Tik Tok videos or when we hear announcements at airports and train stations.

Top reasons why computer voices are in demand

Of course, the main reason computer voices are becoming popular is the ease with which they can be created and adjusted, and their low cost.

A professional actor can charge up to USD $750 to create a voice-over for one sentence; this arrangement also requires time to set up an appointment for the recording.

The voiceover rate guide offers more information about cost, but below is just one sample that shows how much on average it costs to create a voice-over for a commercial video.

Screen Shot 2022-02-17 at 21.25.41.png

On the Kukarella website, in a few seconds, you can convert the same phrase to speech in 100 different voices and it will cost pennies.

In addition to the price, the human factor should not be overlooked. If an actor is ill, the respiratory tract, and therefore the voice, suffers first. In general, any illness, even just weakness, invariably affects the voice. So you have to either wait for the actor to restore his voice or look for a replacement.

The computer voice is always fresh and tireless. Programmers, filmmakers, game developers, online educators and advertisers quickly calculated the benefits of new tools and started using synthesized voices.

Of course, if you create voice-over for a complex emotional text or dialogue, it is not yet possible to replace a professional actor. In this case, the actor not only reads the text, he creates a character, an image. So if you have a challenging dramatic task and a sufficient budget, then a professional actor can handle the task better than a computer voice.

“If I gave a computer a news article, it would do a reasonable job of rendering the words in the article,” says Andrew Breen, senior manager of the TTS research team in Cambridge, UK. “But it’s missing something. What is missing is the understanding of what is in the article, whether it’s good news or bad, and what is the focal point. It lacks that intuition.”

However, that is changing. Looking forward, Amazon researchers are working toward teaching computers to understand the meaning of a set of words, and speak those words using the appropriate effect. Now, computers can be taught to say the same sentence with varying kinds of inflection.

In the future, it’s possible they’ll recognize how they should be saying those words based simply on the context of the words, or the words themselves. “We want computers to be sensitive to the environment and to the listener, and adapt accordingly,” says Breen.

This future is really close. In the meantime, you can manually make the computer voice more realistic. You can control the emotionality of a computer voice by adding effects, sighs, pauses, stretching out some words and emphasizing others. You can select many basic effects that will make voices more realistic and add them to your text in seconds.

Effects

Effects not only add realism to your voice, but also help make your message more emotionally charged. Read a comment from the “born salesman” Jordan Belfort, immortalized by Leo Dicaprio in The Wolf of Wall Street:

“How many times in your life have you lowered your voice to just above a whisper to tell someone a secret? Again, we’ve all done this a thousand times, because we intuitively know that a whisper intrigues people and draws them in—compelling them to listen more closely. The key here is modulation. You want to lower your voice, and then raise your voice; you want to speed up, and then slow down; you want to make a declarative statement, and then turn it into a question; you want to group certain words together, and then say others in staccato-like beats.. It’s like you’re saying, “Listen, pal, this particular point is really important, and it’s something I really, really believe in, so you need to pay very close attention to it.”

Just as you control your voice, you can control synthesized voices.

In some cases, this requires the use of SSML markup, code that makes the program know where you want to go faster, and where to add a pause or pitching.

In other cases, you just need to click the button with the selected effect and it will be applied to the text. On the Kukarella website, in addition to whispering, you will find a lot of other effects, such as pitching, breathing, the softness of voice, stress.

As Jordan Belfort advises: Remember never to stay in any one tonality for too long, or else the prospect will become bored — or in scientific terms, habituate — and ultimately tune out.

How do I apply the effects?

By changing the rate of speech, you control the attention of the interlocutor or audience: excessive regularity can make speech boring, and they will stop listening to you; too fast a tempo also distorts perception, and the listener does not have time to process the information. Therefore, it is necessary to choose an average and at the same time comfortable speech rate for you, changing it depending on what is spoken.

Emphasis allows you to make semantic accents and draw the audience's attention to certain phrases.

The breathing effect can create realism. We are all humans and we are used to hearing people breathe when they talk - it’s just part of our natural conversation. If it's not there, we miss it. There are exceptions to this rule. In commercial advertising, breath must be removed, because the advertiser does not want to lose the listener’s attention. For more details, read the article ‘Debreath Your Voiceovers the Human Way’

Rhythm refers to the tempo and duration of spoken phrases. It helps to make meaningful pauses, as well as to break sentences correctly. For example, by starting a paragraph, you can gradually move to the most important sentences. Read the following sentences aloud and feel the rhythm change:

Two / twenty - two / two hundred - twenty - two / two thousand - two hundred - twenty - two / twenty - two - thousand - two hundred - twenty - two

Pauses are not directly related to intonation, but with their help, we also influence the audience. Temporary silence is necessary both for the listener to process the information received and for the speaker himself, who during the pauses can concentrate and rest.

Give your audience time to digest. A long-time voice actor shared this piece of advice he received at the beginning of his career. “I'd be doing a joke, and I wouldn't be getting a laugh, and they'd say, ‘Hey, kid. You're not timing it right. You're not giving the audience a chance to digest what you just said. You ought to say the first part of that line and then actually count to two, then say the rest of the line, and I bet you get a laugh.’ So I would take the two-count, and I would read the line, and I would get the laugh."

The effects are what makes voice acting realistic. Experiments show that we can distinguish an artificial voice if it speaks for more than one minute. The reason is that the voice behaves the same. One speed, timbre, tonality. An actor will not behave like that. He will change intonation, speed, pitch, and he will have pauses of different lengths. Remember how different actors said the phrase ‘To be or not to be’ For now a computer voice cannot add so many nuances yet. But just give it a time.

Therefore, if you want the synthesized speech to sound believable, you will have to learn how to work with effects and experiment a lot.

It's hard to say who will be the first to create a truly realistic synthesized voice. But we can be sure of one thing: in another few years we will not be able to determine who is talking to us on the phone - a real person or a computer. And the realism of the voices will be primarily achieved through the organic use of voice effects.