Google Researchers Just Made Computers Sound Much More Like People

Google Researchers Just Fabricated Computers Audio Much More Like People

This site may earn chapter commissions from the links on this folio. Terms of use.

A squad of researchers at Google has found a mode to dramatically amend figurer-generated speech, substantially improving its cadence and intonation. It's a step towards the kind of sophisticated spoken communication synthesis that has, to date, existed entirely within the realm of science fiction.

Computers, even when they speak, do not sound human. Even in science fiction, where such constraints need not be, computers, androids, and robots ordinarily use stilted grammar, inaccurate pronunciation, or speak in harsh, mechanical tones. In Tv shows and movies where bogus lifeforms speak naturally (the avant-garde Cylon models in the 2004 Battlestar Galactica reboot, for case), this capability is often used to play upwards why the bogus life forms represent a threat. The ability to speak naturally is often treated as a vital component of humanity. Mechanical life forms in Star Trek: The Adjacent Generation and its various spin-offs almost always speak with mannerisms intended to convey their artificiality, fifty-fifty when their intentions are perfectly benign.

In the real world, programs like Dr. Sbaitso were often the offset introduction estimator users had to text-to-spoken communication technology. Yous can hear what Artistic Labs' text-to-speech technology sounded like below, circa 1990.

Mod applied science has dramatically improved on this, but technologies like Alexa, Cortana, Google Assistant, or Siri would never be mistaken for a human save in very specific cases. A meaning part of the reason why we can tell when a reckoner is speaking versus an private is because of the (mis)use of prosody. Prosody is defined as the pattern of intonation, tone, rhythm, and stress within a linguistic communication.

There's an old joke almost the importance of commas that compares ii simple sentences to make its point: "It'southward time to swallow Grandma" conveys a rather different meaning than "It's time to eat, Grandma." In this case, the comma is used to convey information well-nigh how the sentence should exist pronounced and interpreted. Not all prosodic information is encoded via grammar, still, and education computers how to interpret and use this data has been a major stumbling block. At present, researchers beyond multiple Google teams have plant a fashion to encode prosody information into the Tacotron text-to-spoken language (TTS) organisation.

Tacotron

Nosotros can't embed Google's speech samples directly, unfortunately, but information technology'due south worth visiting the folio to hear how the new information impacts pronunciation and diction. Here'due south how Google describes this work:

We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of man speech (the reference audio). This embedding captures characteristics of the sound that are contained of phonetic information and idiosyncratic speaker traits — these are attributes similar stress, intonation, and timing. At inference time, we tin can use this embedding to perform prosody transfer, generating speech in the vocalisation of a completely dissimilar speaker, but exhibiting the prosody of the reference. The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works all-time when the reference and target phrases are similar in length and construction.

There are samples and clips y'all tin can play to see how Tacotron handles diverse tasks. The researchers note they can transfer prosody even when the reference audio uses an accent not in Tacotron'due south grooming data. And even more than chiefly, they've plant a mode to model what they call latent "factors" of speech, assuasive for the prosody within any spoken communication clip to be represented without requiring a reference audio prune. This expanded model can force Tacotron to employ specific speaking styles to make various statements audio happy, aroused, or sad.

None of the clips sound completely human being — there'southward still a degree of artificiality to the underlying presentation — but they're a substantial improvement on what's come earlier. Maybe the next Elderberry Scrolls game won't have to feature the same viii voice actors in approximately 40,000 different roles.