Audio deepfakes have recently received bad press after an artificial intelligence-generated robocall impersonating Joe Biden’s voice urged New Hampshire residents not to vote. Meanwhile, spear phishing (phishing campaigns that specifically target specific individuals or groups using information known to be of interest to the target) is a money grab, with actors aiming to preserve audio likenesses.
However, what has received less media attention is the use of audio deepfakes, which may actually benefit society. In this Q&A prepared for MIT News, postdoc Nauman Dawalatabad addresses the potential benefits and concerns of emerging technologies. A full version of this interview can be seen in the video below.
cue: What are the ethical considerations that justify hiding the identity of the source speaker in audio deepfakes, especially when this technology is used to create innovative content?
all: For example, despite the primary use of generative models for audio generation in entertainment, an investigation into why research is important in obscuring the identity of the source speaker raises ethical considerations. Words contain more than just “Who are you?” (Identity) or “What are you talking about?” (contents); It outlines a ton of sensitive information, including your age, gender, accent, current health, and even clues about your future health. For example, a recent research paper on “Detecting Dementia in Long Neuropsychological Interviews” shows the potential to detect dementia through speech with a reasonably high degree of accuracy. There are also several models that can detect gender, accent, age, and other information from speech with very high accuracy. Advances in technology are needed to prevent such personal data from being inadvertently disclosed. Source Efforts to anonymize the identity of speakers are not simply a technical challenge, but a moral imperative to protect individual privacy in the digital age.
cue: How can the challenges posed by audio deepfakes in spear phishing attacks be effectively addressed, taking into account the risks involved, the development of countermeasures, and advances in detection technologies?
all: Deploying audio deepfakes in spear phishing attacks poses a variety of risks, including spreading misinformation and fake news, identity theft, privacy violations, and malicious alteration of content. The recent distribution of deceptive robocalls in Massachusetts highlights the harmful effects of such technology. We also spoke with Spork recently. boston globe Find out about this technology and how easy and cheap it is to create such deepfake audio.
Anyone without any significant technical background can easily create such audio using a number of tools available online. Such fake news from deepfake generators could disrupt financial markets and even election results. Stealing one’s voice to access voice-operated bank accounts and unauthorized use of one’s voice identity for financial gain are reminders that a strong response is urgently needed. Additional risks may include privacy violations, where an attacker could exploit a victim’s audio without the victim’s permission or consent. An attacker could also change the content of the original audio, which could have serious repercussions.
In designing systems to detect fake audio, two basic and important directions have emerged: artifact detection and biometric detection. When audio is generated by a generative model, the model introduces some artifacts into the generated signal. Researchers design algorithms/models to detect these artifacts. However, as the sophistication of audio deepfake generators increases, there are several problems with this approach. In the future, we may also see models with very small or almost no artifacts. Biometric detection, on the other hand, leverages the unique qualities of natural speech, such as breathing patterns, intonation, and rhythm, that are difficult for AI models to replicate accurately. Some companies, such as Pindrop, are developing solutions to detect audio fakes.
Strategies such as audio watermarking also serve as a proactive defense by inserting encrypted identifiers into the original audio to track its origin and prevent tampering. Despite other potential vulnerabilities, such as the risk of replay attacks, ongoing research and development in this field offers promising solutions to mitigate the threat posed by audio deepfakes.
cue: Despite the potential for misuse, what are the positive aspects and benefits of audio deepfake technology? How do you think the future relationship between AI and audio recognition experiences will evolve?
all: Despite the primary focus on the nefarious applications of audio deepfakes, the technology has enormous potential to have a positive impact across a variety of fields. Beyond the realm of creativity where voice-to-speech technology enables unprecedented flexibility in entertainment and media, audio deepfakes offer revolutionary possibilities in healthcare and education. For example, ongoing work to anonymize patient and physician voices in cognitive healthcare interviews facilitates the global sharing of important medical data for research while ensuring privacy. Sharing this data among researchers will foster advances in the field of cognitive health care. Applying this technology to speech restoration holds out hope for improving communication skills and quality of life for individuals with speech disorders such as ALS or dysarthric speech.
I am very positive about the future impact of audio generation AI models. Future interactions between AI and audio recognition are poised for breakthroughs, especially through the lens of psychoacoustics, the study of how humans perceive sound. Innovations in augmented and virtual reality, exemplified by devices like Apple Vision Pro, are pushing the boundaries of audio experiences toward unparalleled realism. Recently we have seen an exponential increase in the number of sophisticated models being released almost every month. The rapid pace of research and development in this field promises not only to improve these technologies, but also to expand their applications in ways that will greatly benefit society. Despite the inherent risks, the potential of audio-generating AI models to revolutionize healthcare, entertainment, education, and more demonstrates the positive trajectory of this field of research.