Deep Mind produces WaveNet ‘significantly better’ text-to-speech




Google’s business for Artificial Intelligence Deep Mind has made a big leap with voice generated by the computer. The text-to-speech engine WaveNet can speak English and Mandarin in a way that seems almost real.

The self-learning, neural network WaveNet produces raw sound and learned his ‘voice’ through train data with tens of thousands of audio samples per second. A single WaveNet can take the natural way of speaking of different speakers and switch between them. Besides training on speech, the researchers got WaveNet also analyze music clips, and could create new WaveNet and realistic sounding piano tracks. The model also recognizes differences between phonemes, or the smallest sound units that emit a significant difference.

WaveNet Sample

Example of the construction of one second voice: up to 16,000 sample particles. Source: Deep Mind

The researchers were able to achieve results by training WaveNet with rough waves. This is a method that is often avoided, write the researchers on their blog. A sound consists of 16,000 samples per second or more. In order to obtain good-sounding speech, each sample is to be influenced in the right way by all the previous bits. Because the samples thus be constructed by step, a lot of computing power needed. The chance that this technology will soon fall into consumer products, is therefore small.


In: A Technology & Gadgets Asked By: [23616 Red Star Level]

Answer this Question

You must be Logged In to post an Answer.

Not a member yet? Sign Up Now »