Baidu develops a text-to-speech system capable




Baidu develops a text-to-speech system capable of imitating human voice accurately and instantaneously

China’s Baidu announced the launch of its second-generation Deep Voice voice-to-speech system, just three months after the launch of the first generation. Deep Voice 2 comes with significant improvements that promise digital assistance that interacts with users as if They were real people.

In February, the Chinese research giant Deep Voice 1, a system to generate artificial human sounds using deep neural networks, was launched.

Baidu said that in contrast to alternative voice-to-speech systems, Deep Voice 1 worked in real time, combining sound as quickly as possible, making it usable for interactive applications such as media and chat interfaces, such as digital assistants.

The company added that, by training deep neural networks capable of learning from large amounts of data and simple features, it has created an incredibly flexible system for synthesizing high quality voice in real time.

Although Deep Voice 1 was able to produce almost indistinguishable speech from the actual human voice at the first hearing, the system’s capabilities were limited by learning from only one sound at a time, requiring long hours of sound to build a sample.

The new Deep Voice 2, Baidu said that, in just three months, it was able to extend the system from 20 hours of speech and one voice to hundreds of hours with hundreds of votes with the ability to imitate them completely. This is in addition to the ability of the glasses to learn from hundreds of unique sounds in less than half an hour of data per speaker with high sound quality.

The company explained that Deep Voice 2 is capable of learning to generate a speech by finding the common characteristics of different voices. Unlike all previous text-to-speech systems, Deep Voice 2 learns these qualities from scratch, without any guidance on what makes sounds recognizable.

Baidu posted on the research department on its website a set of samples from the Deep Voice 2 system, trained to listen to nearly 100 speakers. Each speaker had a rhythm of speech, tone, tone, and pronunciation, and the system was able to imitate almost all of this.

Baidu believes that this technology will be useful for digital help services controlled by voice commands and RFID by talking to its users. It also sees potential in text-to-speech applications such as e-books.


In: A Technology & Gadgets Asked By: [22655 Red Star Level]

Answer this Question

You must be Logged In to post an Answer.

Not a member yet? Sign Up Now »

Star Points Scale

Earn points for Asking and Answering Questions!

Grey Sta Levelr [1 - 25 Grey Star Level]
Green Star Level [26 - 50 Green Star Level]
Blue Star Level [51 - 500 Blue Star Level]
Orange Star Level [501 - 5000 Orange Star Level]
Red Star Level [5001 - 25000 Red Star Level]
Black Star Level [25001+ Black Star Level]