You’ve heard of deepfakes—photos or videos that show a public figure or celebrity (like Tom Cruise or Will Smith) somewhere they never were, doing something they never did. But you may not know that an emerging class of machine learning tools makes that same kind of fakery possible for audio.
Speech synthesis technologies have come a long way since the Voder, unveiled by Bell Labs in 1939. That robotic droning once controlled by an operator using keys and pedals has evolved into digital voices that are indistinguishable from the real thing—powered by artificial intelligence. The speech synthesis technology available now is so realistic and accessible that audio engineers use it to duplicate the speech of podcast hosts or voice actors and add new information to content without recording a word.
This technology is also being used by cybercriminals and fraudsters, forcing organizations in every industry to adopt new cybersecurity models to minimize the unavoidable risks.
A Choir of Burglars on the Rise
In 2019, in the first known case of voice clone fraud, thieves recreated the voice of an executive at the parent company of an undisclosed UK-based energy firm. When the firm’s CEO received a call from the “executive,” he recognized his colleague’s German accent and speech cadence, and quickly made the urgent fund transfer as requested. The scammers made contact again a few hours later to attempt a second theft, but this time, the CEO noticed that the call was coming from an unknown location and became suspicious.
All the ingredients are in place for massive use of voice cloning technology for malicious purposes.
In early 2022, the FBI published a report alerting the public to a new swindling technique on virtual meeting platforms. After taking control of an executive’s login, the attackers invite employees to a meeting where they deploy a cloned voice, claim their video isn’t working, and ask for restricted information or an emergency transfer of funds.
The sudden appearance of voice clone frauds is raising alarms around the globe. According to Irakli Beridze, Head of the Centre on Artificial Intelligence and Robotics at the United Nations Interregional Crime and Justice Research Institute (UNICRI), all the ingredients are in place for a massive adaptation of this technology for malicious purposes. “Whether it’s for committing fraud, framing people, derailing political processes, or undermining political structures, that is all within the realm of possibility,” he tells Toptal.
Impersonating a top executive at an organization in order to commit fraud cost companies around the world more than $26 billion between 2016 and 2019, according to the FBI’s Internet Crime Complaint Center. And those are just the cases reported to law enforcement—most victims keep such attacks under wraps to protect their reputations.
Criminals are learning fast, too, so while the incidence of voice clone fraud is low now, that could change soon. “Five years ago, even the term ‘deepfake’ was not used at all,” Beridze says. “From that point on, we went from very inaccurate, very primitive automatically generated voice or visual content to extremely accurate deepfakes. If you analyze the trend from a historical point of view, this happened overnight. And that’s an extremely dangerous phenomenon. We have not yet seen its full potential.”
Making the Fakes
Audio deepfakes run on neural networks. Unlike traditional algorithms, in which a human programmer must predefine every step of a computational process, neural networks allow software to learn to perform a prescribed task by analyzing examples: Feed an object recognition network 10,000 images of giraffes, label the content “giraffe,” and the network will eventually learn to identify that particular mammal even in images it has never been fed before.
The problem with that model was that it needed large, carefully curated and labeled datasets, and very narrow questions to answer, all of which took months of planning, correcting, and refining by human programmers. This changed quickly following the introduction of generative adversarial networks (GANs) in 2014. Think of a GAN as two neural networks in one that learn by testing and giving feedback to each other. GANs can generate and assess millions of images quickly, gaining new information every step of the way with little need for human intervention.
GANs also work with audio waveforms: Give a GAN some number of hours of human speech, and it will start to recognize patterns. Input enough speech from a particular human, and it will learn what makes that voice unique.
White-hat Uses for Deepfake Speech Synthesis
Descript, an audio editing and transcription tool founded by Groupon’s Andrew Mason with a seed investment from Andreessen Horowitz, can identify the equivalent of DNA in every voice with only a few minutes of sample audio. Then, the software can produce a copy of that voice, incorporating new words but maintaining the style of the speaker, says Jay LeBoeuf, the company’s Head of Business and Corporate Development.
Descript’s most popular feature, Overdub, not only clones voice, it also lets the user edit speech in the same way that they would edit a document. Cut a word or phrase and it disappears from the audio. Type additional text, and it’s added as spoken words. This technique, called text-informed speech inpainting, is a revolutionary deep-learning breakthrough that would have been unthinkable just five years ago. A user can make the AI say anything, in whichever voice they’ve programmed, just by typing.
“One of the things that almost seemed like science fiction to us was the ability to retype a mistake that you might have made in your voiceover work,” LeBoeuf tells Toptal. “You say the wrong product name, the wrong release date, and you would usually have to redo the entire presentation or at least a large section of it.”
A user can make the AI say anything, in whichever voice they’ve programmed, just by typing.
Voice cloning and Overdub technology can save content creators hours of editing and recording time without sacrificing quality. Pushkin Industries, the company behind Malcolm Gladwell’s popular podcast Revisionist History, uses Descript to generate a digital version of the host’s voice to use as a stand-in voice actor while assembling an episode. Previously, this process required the real Gladwell to read and record content so the production team could check an episode’s timing and flow. It took many takes and several hours of work to produce the desired results. Using a digital voice also frees the team to make small editorial fixes later in the process.
This technology is also being used for companies’ internal communications, says LeBoeuf. One Descript client, for example, is cloning the voices of all the speakers in its training videos so the company can modify the content in post-production without returning to the studio. The cost to produce training videos ranges from $1,000 to $10,000 per minute so voice cloning could yield enormous savings.
Protecting Your Business From Cloned-voice Crimes
Despite it being a relatively new technology, the global market for voice cloning was worth $761.3 million in 2020, and is projected to reach $3.8 billion by 2027. Startups like Respeecher, Resemble AI, and Veritone offer services similar to Descript; and Big Tech companies like IBM, Google, and Microsoft have invested heavily in their own research and tools.
The continued evolution, growth, and availability of cloned voices is practically assured, and the rapid advances in technology will make cyberattacks impossible to avoid.
“You cannot fight deepfakes,” says Ismael Peinado, a global cybersecurity expert with two decades of experience leading security and technology teams, and Toptal’s Chief Technology Officer. “The sooner you accept it, the better. It may not be today, but we will face the perfect voice or video deepfake. Not even a workforce fully trained in risk awareness may be able to spot a fake.”
There are software solutions specialized to detect deepfakes, tools that use deep-learning techniques to catch evidence of forgery in all kinds of content. But every expert we consulted disregarded such investments. The speed at which technology is evolving means detection techniques are quickly outdated.
“It’s ultimately somewhat of a losing battle to pursue detection purely,” Andy Parsons, Senior Director of Adobe’s Content Authenticity Initiative (CAI), tells Toptal. “To put it bluntly, the bad guys would win because they don’t have to open-source their data sets or their trained models.”
So what’s the solution?
Move Away From Email
“First, stop using email for internal communication. Ninety percent of your security concerns will vanish,” says Peinado. Most phishing attacks, including ones aimed at gaining access to private company spaces like Zoom, originate with emails. “So use a different tool to communicate internally, like Slack; set aggressive security protocols for every email received; and change the cybersecurity culture to address the most critical vulnerabilities. ‘If you receive an email or an SMS, don’t trust it’; that’s our policy, and every member of the organization knows it. This single action is more powerful than the best antivirus on the market.”
Take to the Cloud
Peinado also says all communication and collaboration tools should be on the cloud and include multifactor authentication. This is the most effective way to reduce the danger of fake identities because it significantly reduces the points of entry to critical business data. Even if your CEO’s laptop is stolen, the risk that a malicious actor could use it to access the company’s information or stage a deepfake attack would be minimal.
Support Digital Provenance Efforts
“As things become more photo-realistic and audio-realistic, we need another foundation on the internet itself to depict truth or provide transparency to consumers and fact-checkers,” says Parsons. To that end, Adobe’s CAI, an alliance of creators, technologists, and journalists founded in 2019 in partnership with Twitter and the New York Times, has joined forces with Microsoft, Intel, and other major players to develop a standard framework for content attribution and digital provenance. It embeds unalterable information, such as time, author, and type of device used, every time digital content is created or modified.
This framework’s function is to foster a safe environment for creating content with AI. Even virtual meeting platforms could integrate this technology to prove that a caller is who they claim to be, no matter what voice attendees think they’re hearing. “Among the members of the standard’s body, we have Intel, Arm, and other manufacturers looking at potential hardware implementations, so that capture devices of all kinds—including streaming cameras, audio devices, and computer hardware itself—can benefit. We hope and expect to see that adoption,” Parsons says.
Invest in Threat Assessment and Education
With no technological tools at hand, limited strategic security actions, and an enemy that gets bigger and wiser by the day, there are no silver bullets. But collaboration between governments, academia, and the private sector is aiming to protect businesses and society at large, says Beridze.
“Governments should adopt national cybersecurity programs and should do very thorough assessments of their needs and competitive advantages,” he says. “The same thing goes with the private sector: Whether they’re small, medium, or large enterprises, they need to invest in threat assessment and knowledge.”
Initiatives like the CAI’s standard framework require massive adoption to be successful, and that will take time. For now, leaders must prioritize reducing their organization’s attack surface and spreading the message that thieves armed with cloned voices are trolling for victims.