A team at Waterloo University says it has produced a system that is 99% effective against synthetic speech detection countermeasures used to protect Voice Biometric authentication schemes.
When I read the headline in The Register this morning, I almost choked on my coffee, so I spent some time trying to understand the real-world implications of this research. They published their full paper here (Requires payment to download).
With the increasing threat posed by Deepfakes to Voice Biometrics authentication services (see our community video here), one of several mitigation strategies is using synthetic speech detection algorithms. These systems are trained to pick up on the unique characteristics of the text-to-speech (TTS) generation process and alarm if sufficiently different from normal human speech. Many of these characteristics are well known, but as they are imperceptible to human ears, commercial synthetic speech services such as Elevenlabs, Resemble, Microsoft, Google and Amazon et al. have little incentive to change. It was, however, inevitable that someone would try to eliminate them to circumvent these systems.
What they did
The team took a subset of genuine and synthetic speech samples used in the ASV Spoof 2019 synthetic speech detection challenge. They ran these through a tool they developed that removed the key TTS artefacts. They then evaluated these new samples using various voice biometric authentication and synthetic speech detection algorithms. Finally, they asked a panel of humans to consider each sample for realism.
The Synthetic Spech Masking Tool
The synthetic speech detection avoidance or masking tool focused on six key attributes of text to speech generation:
- Leading and Trailing Silence – Adding realistic noise at the beginning and end of an utterance to simulate real world
- Inter-word silence – Adding realistic noise to silence between words.
- Low centre spectrum energy – Boosting this area that is not prioritised by deepfake audio systems
- Local Echo – Adding echo to simulate the use of a microphone for recording
- Amplifying lower-frequency content – Removing some of the features added by synthetic voice generation to improve transmission.
- Noise Reduction – Removing characteristics most likely left by recording devices used during model training.
Finally, using an adversarial speaker verification model, the team developed a mechanism to “engrave” samples with elements of the genuine speaker’s voice print. However, the specific mechanism for this needs to be clarified. This didn’t add an awful lot to the success rate of the other approaches.
The attack simulated an in-app use case (using audio at 16khz), assuming they could inject raw audio into the application through a rooted device. There were only a handful of telephone channel (8khz) attacks simulated. They evaluated effectiveness based on the success rate against each available combination of Verification and Detection algorithms. To account for retries allowed in implemented systems, they evaluated results for three and six attempts on each combination.
There was a significant range of results from these end-to-end combinations ranging from 4.9% for the best to the reported 99% for the worst. They also present findings on how different synthetic speech detection models stand up to varying combinations of attribute targetting from their tool, with success rates ranging from 16% to 62%. In both cases, however, we need to be very careful with the definition of success as it may differ from how these systems are implemented in the real world.
I’m not a speech scientist or academic, so I can’t comment on the experimental design or the more technical aspects of their work. However, I have implemented Voice Biometric systems in organisations covering millions of consumers. From this perspective, there are a few significant issues when you scratch beneath the surface, although to be fair to the researchers, I’m not sure they could have overcome many of them.
- Models/Systems Used – The researchers did try to access the latest commercial models but were, unsurprisingly, unable to. The only exception was Amazon’s Voice ID service which was only used for speaker verification of telephone channel data. We know from some of our testing that the best commercial models significantly outperform these public domain models in the speaker verification task and expect, given the current focus, the same is true for synthetic speech detection. If it’s not now, then it soon will be.
- System Tuning and Calibration – All of the Voice Biometrics Authentication and Synthetic Speech Detection systems were calibrated with a threshold set at the Equal Error Rate (EER), where the risk of false accept equals the risk of false reject. In practice, most systems in high-security applications are implemented with thresholds at far lower False Accept rates than the EER.
- Synthetic Speech Samples – The samples were generated for the 2019 ASV Spoofing challenge using systems that do not represent today’s state of the art, so they are probably poorer than some of the current best of the breed. The ASV spoof evaluation plan does not specify on what basis the samples were generated. Still, it is almost certain that it was based on large datasets (certainly more than the 15 minutes of audio suggested by the researchers) from professionally recorded audiobooks. This is, therefore, unlikely to be representative of consumer attack scenarios.
Implications for Organisations
So another genie is out of the bag. Does that mean organisations implementing or considering Voice Biometrics for their call and contact centres should change direction?
Whilst it’s easy to go down a rabbit hole of fear and uncertainty, organisations must take a step back and look at the bigger picture.
- Knowledge-based authentication remains by far the biggest risk – PINs, Passwords and Knowledge-Based Questions are still the weakest links. It is trivially easy to engineer or compromise this data at scale socially.
- Voice Biometric systems are vulnerable to other attacks – Voice Biometric systems are not without their vulnerabilities, and any responsible organisation must understand and accept these risks when implementing systems. Not least of these is that the many mismatch processes allow fraudsters to bypass these controls entirely.
- Low incentives to attack Voice Biometrics at scale today – Given the above two factors, fraudsters interested in making money rather than headlines have very low incentives to go to the effort required even to start to compromise these systems at scale. That won’t always be the case, but it certainly is now.
- Deepfakes are not guaranteed to be successfull – Synthetic speech and deep fakes are good but not yet good enough to guarantee a biometric match every time, especially in well tuned and optimised environment. Even if fraudsters could obtain sufficient target customer audio and create a voice that would fool a human ear are a range of simple mitigation measures available to organisations which I covered in my session with Haydar Talib from Nuance a few weeks ago.
- Synthetic Speech Detection methods will improve – Synthetic Speech Detection is a last line of defence that will be increasingly important but hasn’t really been a focus for commercial vendors until recently as they focus on core authentication performance. I’ve heard from many commercial vendors in the last few weeks about their plans, and I’m excited by the energy they are showing for solving this and the early results. This research adds momentum to that work.
On the flip side, it does reinforce the need for a thorough and dynamic vulnerability assessment (which I can help with, of course), although the following generic advice remains extant:
- High-risk users – VIPs and those with significant media presence are far more vulnerable than the average consumer.
- Layered defence – Voice Biometrics should always be implemented as part of a layered authentication and fraud detection scheme. It is the most convenient authentication method for high-frequency and high-risk voice contacts but other measures, such as network authentication and behavioural analytics, should be used to add additional security and confidence as the call progresses. I’ve written a whole book, Unlock Your Call Centre, on this if you are interested.
- Predictability is the biggest risk – Unattended systems are far more predictable than agents, making it easier for bad actors to prepare their attack, so detection efforts should focus on these as they are likely to be the leading indicator of impending scale attacks.
- Maintain currency in the cloud – Maintaining currency with the latest voice biometric and synthetic speech detection algorithms is almost impossible if you use an on-premise system, reinforcing the priority of moving to cloud-based solutions.
The ASV Spoof challenge, referenced by the researchers, is about to start its 2023 challenge, focusing on spoofing countermeasures such as synthetic speech detection. The results will be published in late 2023, so I’ll update my perspective. Commercial vendors will also have a lot of news to share on their plans over the next few months.
This is an impressive, sophisticated, and rigorous academic work, but this approach won’t be as efficacious as the headline suggests against modern real-world voice biometric authentication systems. It will undoubtedly improve the probability of incorrect acceptance, but likely by only be a few percentage points from an already low base (as demonstrated by some of the simulations).
Voice Biometrics remains, on aggregate, a significantly more secure, easier to use and efficient authentication method than traditional authentication methods such as PINs, Passwords and Knowledge-Based questions or even One Time Passcodes. However, it is just one layer of defence that needs to balance security with an organisation’s efficiency and usability requirements. There will continue to be a cat-and-mouse game, and the leading Voice Biometric providers will quickly respond with targeted counter-measures and updated synthetic speech detection methods.
More concerning is the implications of this research for society through improving bad actors’ ability to create fake media for consumption by the general public that will evade detection by reputable media organisations.
I applaud the team for doing this in the open and publishing their source code and methods, which will surely form the baseline commercial synthetic speech detection tools need to beat.