Deepfake clone voice
There’s a video that pops up periodically in my YouTube feed. It’s a conversation between rappers Snoop Dogg and 50 Cent, who lament the fact that compared to their generation, all contemporary hip-hop artists sound the same. “When a person decides to be themselves, they offer something that no one else can offer,” 50 Cent says. “Yeah, because once you become yourself – who can be you but you?” replies Snoop.
When the video was uploaded in October 2014, it may have been generally true. But just a few years later, it no longer is. In the world of deep audio spoofing, you can train an artificial intelligence to sound eerily similar to another person by feeding it an audio corpus consisting of several hours of their spoken word. The results turn out to be frighteningly accurate.
Public figures like rapper Jay-Z and psychologist Jordan Peterson have already complained about people appropriating their voices by creating audio fakes and then making them say stupid things online. “Wake up,” Peterson wrote. “The sanctity of your voice and your image is in serious jeopardy.” These are only the most mischievous cases. In other cases, the results can lead to reckless criminality. In one case in 2019, criminals used audio spoofing to impersonate the voice of an energy company CEO and convince a subordinate over the phone to urgently transfer $243,000 into a bank account.
Veritone, a company that creates smart tools for tagging media files in the entertainment industry, is putting the power over audio fakes back in the hands (or, uh, throats) of those who rightfully own it. This month the company announced Marvel.ai, which company president Ryan Stilberg described to Digital Trends as a “complete voice-as-a-service solution.” For a fee, Veritone will create an A.I. model that sounds just like you (or, more likely, like a famous person with an easily recognizable voice), which can then be loaned out as a high-tech version of Ariel’s voice from “The Little Mermaid.”
When the video was uploaded in October 2014, it may have been generally true. But just a few years later, it no longer is. In the world of deep audio spoofing, it is possible to train an artificial intelligence to eerily resemble another person’s voice by providing it with an audio corpus consisting of several hours of their spoken voice. The results turn out to be frighteningly accurate.
Of course, the big question hanging over all this is how the public will react to it all. This is the most difficult and unpredictable moment. Celebrities today have complicated roles to play: They are both great figures worthy of having their faces plastered across billboards, and easily identifiable personalities who have relationship problems, who tweet about watching TV in their pajamas, and who make silly faces while eating hot sauce.
So what happens when there are commercials in which a celebrity not only reads lines, but where we know that the performer never actually said those lines and that his voice has been programmed to bring us targeted advertising? Stilberg said it’s not much different than if a celebrity handed over control of their social media accounts to a third-party account manager. If we see Taylor Swift’s tweet, we know it’s probably not Taylor herself, especially when it comes to endorsements or promotional content.
But the voice is, in fact, different, precisely because it’s more personal. Especially if it is accompanied by a degree of personalization, which is one of the uses that makes the most sense. The truth is that, to quote screenwriter William Goldman, no one knows what the audience’s reaction will be, precisely because no one has done anything like this before.
“The reactions are going to be all sorts, right?” said Stilberg. “Some people will say, ‘I’m going to use this tool to supplement my day a little bit and help me save time.’ Others will say, ‘I want my voice everywhere to extend my brand, and I’m going to license it.'”
He believes consent will be given on a case-by-case basis. “You have to be aware of your audience’s reaction and if you see something working or not working,” he said. “They might like it. They might say, ‘You know what? I like that you’re putting out 10 times as much content or more personal content for me, even though I know you used synthetic content to supplement it. Thank you. Thank you.”
Think about the future
What about the future? Stilberg said: “We want to work with all the big talent agencies. We believe that anyone in the business of making money from a scarce brand should think about their voice strategy.”
And don’t expect it to remain exclusively audio. “We’ve always been attracted to the potential of using synthetic content to augment, supplement or perhaps completely replace some legacy forms of content production,” he continues. “Whether it’s audio or, eventually, in the future, video.”
That’s right: Having captured the audio spoof market, Veritone plans to take it one step further and enter the world of fully realized virtual avatars that both sound and look indistinguishable from their source.
Suddenly the personalized ads from Minority Report will become much less like science fiction.