Is the latest attack on Synthetic Speech Detection really 99% effective?

A team at Waterloo University says it has produced a system that is 99% effective against synthetic speech detectionSynthetic Speech Detection is a mechanism used to protect Voice Biometrics systems from presentation attacks using synthetic speech. It relies on detecting characteristics inherent in the text-to-speech (TTS) generation process. countermeasures used to protect Voice Biometric authenticationAuthentication is the call centre security process step in which a user's identity is confirmed. We check they are who they claim to be. It requires the use of one or more authentication factors. schemes.

When I read the headline in The Register this morning, I almost choked on my coffee, so I spent some time trying to understand the real-world implications of this research. They published their full paper here (Requires payment to download).

Background

With the increasing threat posed by Deepfakes to Voice BiometricsVoice Biometrics uses the unique properties of a speakers voice to confirm their identity (authentication) or identify them from a group of known speakers (identification). authentication services (see our community video here), one of several mitigation strategies is using synthetic speech detection algorithms. These systems are trained to pick up on the unique characteristics of the text-to-speech (TTS) generation process and alarm if sufficiently different from normal human speech. Many of these characteristics are well known, but as they are imperceptible to human ears, commercial synthetic speech services such as Elevenlabs, Resemble, Microsoft, Google and Amazon et al. have little incentive to change. It was, however, inevitable that someone would try to eliminate them to circumvent these systems.

What they did

The team took a subset of genuine and synthetic speech samples used in the ASV Spoof 2019 synthetic speech detection challenge. They ran these through a tool they developed that removed the key TTS artefacts. They then evaluated these new samples using various voice biometric authentication and synthetic speech detection algorithms. Finally, they asked a panel of humans to consider each sample for realism.

The Synthetic Speech Masking Tool

The synthetic speech detectionSynthetic Speech Detection is a mechanism used to protect Voice Biometrics systems from presentation attacks using synthetic speech. It relies on detecting characteristics inherent in the text-to-speech (TTS) generation process. avoidance or masking tool focused on six key attributes of text to speech generation:

Leading and Trailing Silence - Adding realistic noise at the beginning and end of an utterance to simulate real world
Inter-word silence - Adding realistic noise to silence between words.
Low centre spectrum energy - Boosting this area that is not prioritised by deepfake audio systems
Local Echo - Adding echo to simulate the use of a microphone for recording
Amplifying lower-frequency content - Removing some of the features added by synthetic voice generation to improve transmission.
Noise Reduction - Removing characteristics most likely left by recording devices used during model training.

Finally, using an adversarial speaker verification model, the team developed a mechanism to "engrave" samples with elements of the genuine speaker's voice print. However, the specific mechanism for this needs to be clarified. This didn't add an awful lot to the success rate of the other approaches.

The Attack

The attack simulated an in-app use case (using audio at 16khz), assuming they could inject raw audio into the application through a rooted device. There were only a handful of telephone channel (8khz) attacks simulated. They evaluated effectiveness based on the success rate against each available combination of Verification and Detection algorithms. To account for retries allowed in implemented systems, they evaluated results for three and six attempts on each combination.

The Results

There was a significant range of results from these end-to-end combinations ranging from 4.9% for the best to the reported 99% for the worst. They also present findings on how different synthetic speech detection models stand up to varying combinations of attribute targetting from their tool, with success rates ranging from 16% to 62%. In both cases, however, we need to be very careful with the definition of success as it may differ from how these systems are implemented in the real world.

Challenges

I'm not a speech scientist or academic, so I can't comment on the experimental design or the more technical aspects of their work. However, I have implemented Voice Biometric systems in organisations covering millions of consumers. From this perspective, there are a few significant issues when you scratch beneath the surface, although to be fair to the researchers, I'm not sure they could have overcome many of them.

Models/Systems Used - The researchers did try to access the latest commercial models but were, unsurprisingly, unable to. The only exception was Amazon's Voice ID service which was only used for speaker verification of telephone channel data. We know from some of our testing that the best commercial models significantly outperform these public domain models in the speaker verification task and expect, given the current focus, the same is true for synthetic speech detection. If it's not now, then it soon will be.
System TuningTuning, in the context of voice biometrics, refers to the process of adjusting the configuration and parameters of a voice biometric system to optimize its performance for a particular task or environment. and Calibration - All of the Voice BiometricsVoice Biometrics uses the unique properties of a speakers voice to confirm their identity (authentication) or identify them from a group of known speakers (identification). AuthenticationAuthentication is the call centre security process step in which a user's identity is confirmed. We check they are who they claim to be. It requires the use of one or more authentication factors. and Synthetic Speech Detection systems were calibrated with a threshold set at the Equal Error Rate (EER), where the risk of false acceptA False Accept is when an imposter is incorrectly accepted as the genuine user during authentication. equals the risk of false rejectA False Reject is the case when during authentication, a user who is the genuine user is incorrectly rejected.. In practice, most systems in high-securitySecurity is one of three key measures of Call Centre Security process performance. It is usually expressed as the likelihood that the process allows someone who isn't who they claim to be to access the service (False Accept). applications are implemented with thresholds at far lower False Accept rates than the EER.
Synthetic Speech Samples - The samples were generated for the 2019 ASV Spoofing challenge using systems that do not represent today's state of the art, so they are probably poorer than some of the current best of the breed. The ASV spoof evaluation plan does not specify on what basis the samples were generated. Still, it is almost certain that it was based on large datasets (certainly more than the 15 minutes of audio suggested by the researchers) from professionally recorded audiobooks. This is, therefore, unlikely to be representative of consumer attack scenarios.

Implications for Organisations

So another genie is out of the bag. Does that mean organisations implementing or considering Voice Biometrics for their call and contact centres should change direction?

Whilst it's easy to go down a rabbit hole of fear and uncertainty, organisations must take a step back and look at the bigger picture.

Knowledge-based authentication remains by far the biggest risk - PINs, Passwords and Knowledge-Based Questions are still the weakest links. It is trivially easy to engineer or compromise this data at scale socially.
Voice Biometric systems are vulnerable to other attacks - Voice Biometric systems are not without their vulnerabilities, and any responsible organisation must understand and accept these risks when implementing systems. Not least of these is that the many mismatch processes allow fraudsters to bypass these controls entirely.
Low incentives to attack Voice Biometrics at scale today - Given the above two factors, fraudsters interested in making money rather than headlines have very low incentives to go to the effort required even to start to compromise these systems at scale. That won't always be the case, but it certainly is now.
Deepfakes are not guaranteed to be successfull - Synthetic speech and deep fakes are good but not yet good enough to guarantee a biometric match every time, especially in well tuned and optimised environment. Even if fraudsters could obtain sufficient target customer audio and create a voice that would fool a human ear are a range of simple mitigation measures available to organisations which I covered in my session with Haydar Talib from Nuance a few weeks ago.
Synthetic Speech Detection methods will improve - Synthetic Speech Detection is a last line of defence that will be increasingly important but hasn't really been a focus for commercial vendors until recently as they focus on core authentication performance. I've heard from many commercial vendors in the last few weeks about their plans, and I'm excited by the energy they are showing for solving this and the early results. This research adds momentum to that work.

On the flip side, it does reinforce the need for a thorough and dynamic vulnerability assessment (which I can help with, of course), although the following generic advice remains extant:

High-risk users - VIPs and those with significant media presence are far more vulnerable than the average consumer.
Layered defence - Voice Biometrics should always be implemented as part of a layered authentication and fraud detection scheme. It is the most convenient authentication method for high-frequency and high-risk voice contacts but other measures, such as network authentication and behavioural analytics, should be used to add additional security and confidence as the call progresses. I've written a whole book, Unlock Your Call Centre, on this if you are interested.
Predictability is the biggest risk - Unattended systems are far more predictable than agents, making it easier for bad actors to prepare their attack, so detection efforts should focus on these as they are likely to be the leading indicator of impending scale attacks.
Maintain currency in the cloud - Maintaining currency with the latest voice biometric and synthetic speech detection algorithms is almost impossible if you use an on-premise system, reinforcing the priority of moving to cloud-based solutions.

What's next

The ASV Spoof challenge, referenced by the researchers, is about to start its 2023 challenge, focusing on spoofing countermeasures such as synthetic speech detection. The results will be published in late 2023, so I'll update my perspective. Commercial vendors will also have a lot of news to share on their plans over the next few months.

Conclusion

This is an impressive, sophisticated, and rigorous academic work, but this approach won't be as efficacious as the headline suggests against modern real-world voice biometric authentication systems. It will undoubtedly improve the probability of incorrect acceptance, but likely by only be a few percentage points from an already low base (as demonstrated by some of the simulations).

Voice Biometrics remains, on aggregate, a significantly more secure, easier to use and efficient authentication method than traditional authentication methods such as PINs, Passwords and Knowledge-Based questions or even One Time Passcodes. However, it is just one layer of defence that needs to balance security with an organisation's efficiencyEfficiency is one of three measures of Call Centre Security process performance. It represents the actual and opportunity cost of the security process, for example the costs of agent time spent on manual authentication or the missed opportunity for self-service. and usabilityUsability is the primary performance dimension of the security process. Get it right and both security and efficiency flow, but usability is all wrapped up in human psychology. It’s a complicated subject deeply linked to behaviour, not just that of customers, but also of call-centre agents. requirements. There will continue to be a cat-and-mouse game, and the leading Voice Biometric providers will quickly respond with targeted counter-measures and updated synthetic speech detection methods.

More concerning is the implications of this research for society through improving bad actors' ability to create fake media for consumption by the general public that will evade detection by reputable media organisations.

I applaud the team for doing this in the open and publishing their source code and methods, which will surely form the baseline commercial synthetic speech detection tools need to beat

Is the latest attack on Synthetic Speech Detection really 99% effective? - The Truth Behind the Headline