Speech recognition systems are easy to deceive through machine learning’




Researchers at Salesforce have presented their progress on Black Hat in the field of attacks on Apple and Microsoft’s speech recognition systems. They used open source tools to imitate someone’s voice as easily as possible with machine learning.

The research focused on Siri and Microsoft’s speech recognition api. The aim of the research was to find the simplest possible method to fool these systems. In the background, more and more services offer authentication based on a sentence that users have to pronounce. The researchers point out that there is a difference between recognition and authentication, but that their approach is to be extended to other systems that work on the basis of previously known password phrases. Although they have not tested it, their research should serve as a warning that attacks will become simpler in the future.

During their presentation, the researchers, Azeem Aqil and John Seymour, first showed that the Microsoft api can be fooled by the Lyrebird service. This allows users to generate a digital version of their voice by having them speak 30 phrases that are equal for everyone. The disadvantage of this service is that these specific sentences are required. The researchers showed a fragment from the Sneakers film, in which someone plays a recording of certain words obtained via social engineering to fool voice recognition. That served as an example for a way that works, but that also takes a lot of effort. That is why the researchers started looking for a simpler method.

For their purposes, they looked at two systems for generating votes on the basis of a dataset. On one side there was WaveNet from DeepMind and on the other side there was Tacotron, which also comes from the umbrella of Google parent company Alphabet. The choice fell on the second option, because it works a lot easier than WaveNet, which would require a lot of tuning. The first version of Tacotron came out in April of last year, followed by a second version in December, which produces better results. However, the researchers still had to collect some speech samples from their target.

In their example they assume that clips of the target can be found on YouTube. By selecting the sound based on quality and converting it into text manually, they could obtain about five to ten minutes of audio. They convert that material into fragments of about ten seconds with ffmpeg. However, Tacotron requires at least a total of 24 hours of data to imitate a voice, so the researchers looked for the solution to artificially supplement their data. They did this by raising and lowering the pitch of the YouTube audio clips by a factor of between 0.8 and 1.2, which increased their data set by a factor of 30.

The final step in making their ten minutes of audio usable for an attack was the use of transfer learning. This means that they first trained their Tacotron model for two days on the basis of a public dataset of the so-called Blizzard Challenge. Then they replaced that public dataset with the audio clips of their target, on the basis of which they trained for another day. That was enough to fool the aforementioned systems, even if it did not work every time. However, that problem could be solved with further tuning, one of the researchers explained to Tweakers. With the research they want to show that it is relatively easy to imitate someone’s voice and that it does not necessarily require large quantities of source material.


In: A Technology & Gadgets Asked By: [23633 Red Star Level]

Answer this Question

You must be Logged In to post an Answer.

Not a member yet? Sign Up Now »