Voice AI Systems Are Vulnerable to Hidden Audio Attacks

Isn't it the "adversarial image" attack, well-known in (earlier) visual recognition models [1]? That would be a quite obvious vector.

[1]: https://www.science.org/content/article/turtle-or-rifle-hack...

In general, if you zoom all the way out, yes the high level optimization problem is very similar. find some `delta` where `target_y = model_inference(delta + x)` where `target_y != real_y` and `size_of(delta) < threshold`

But (1) older audio models typically used different architectures like RNNs (Recurrent networks) which came with additional challenges compared to the CNNs (Convolutional networks) that image models used. e.g. the exploding gradients problem. during training of RNNs vanishing gradients are a potential problem. during advex optimization the problem gets inverted and you have to do different things to solve it.

Also (2) the human stuff related to imperceptibility is very different with audio. Ears vs eyes.

So, they're the same, but different.

source -- this is what my (unfinished) phd was on. i should really write up the attack that i crafted, but never got published :(

> "Audio modality is really challenging to comprehend because of how limited our hearing is"

Would it help to significantly lower the hearing capabilities of the AI system? At Juvoly, we always encouraged GPs to invest in high quality microphone like Jabra Speak, connected through USB. A good mic results in much better audio transcriptions, but maybe that was all for the wrong reasons?

This isn't fully "Hidden" but I've always wondered if Ai scraping is the reason why short form videos on Youtube/TikTok/Instagram featuring film/tv clips will sometimes have 2 audio tracks... one with the actual audio from the clip a little louder and one audio track with a computer generated narrator providing running commentary of what is happening and why. As a human I'm able to tune it out but it is very weird/jarring.

In case anyone hasn't had the displeasure of viewing these I'll link some in a comment below once I scroll through my feed and find one.

I'd guess it's more a way to avoid YouTube's copyright detection/etc rather than AI scraping per se.

I'm failing to understand GP's logic here. Why would someone who's posting some TV show's content in complete disregard of their intellectual property rights be bothered about AI scraping?

Keeps their videos from being taken down automatically by platform content filters. It’s an additive audio track - a running summary of the clip most of the time, and so I imagine this is some fair-usey anti takedown protection. It’s everywhere on short form sites.

I thought it was done to keep the video from getting taken down due to copyright (with automatic scanners). I also noticed sometimes videos will have a horizontal line that runs through it -- I've assumed that it wouldn't get flagged for copyright violation.

Isn’t that the audio description (AD) track? Maybe it’s mixed in because of an encoding error.

I wish... it is much more generic and touches on the plot which the audio description track wouldn't do. I'll see if I can find one. They are extremely annoying.

Phreaking is back on the menu, boys.

We have a ton of new vulnerabilities that have emerged from widespread use of AI systems. But at the same time the frontier AI labs are releasing tools like Mythos that purport to automate/simplify the identification of vulnerabilities.

Between these two trends, I struggle to see what the future holds for the security industry.

Either way, as is always the case with the tech industry, the incumbents in this space will be getting paid the big bucks and the consumer will ultimately hold the bag. We absolutely need tougher data privacy / security laws & I wonder what catastrophic event will force law makers and voters to take this issue seriously.

My feeling is the defender wins in the long-run. There's only a finite number of bugs and vulnerabilities.

Surely there is a mathematical model here, but intuitively I do think there is an infinite number of typos and errors you could contain in the set of finite books, and similarly there would be an unlimited number of bugs and vulns in the set of Turing machines.

I doubt you can prove that.

Do you think the attacker or defender will have been the overall beneficiary of LLMs when we look back in 5 years from now?

I don't know, I think it will mostly come down to which side has better recruiting. In other words, all things being equal, I think it's a wash. It was the second part of your claim that I don't think can be proven.

Meet me in the cereal aisle.

I believe that will be purely based on how the AI Models stored the voices in their neural networks. If we can debug that, then we would be able to send a secret sounnd a AI model might be able to understand due to it's internat connections, but that doesn't make sense to us. Until then, there's no harm, is what my view is

[deleted]

Related: Benn Jordon shows how to poison pill AI harvesting music for training

The Art Of Poison-Pilling Music Files

https://www.youtube.com/watch?v=xMYm2d9bmEA

Does this transfer to Whisper / CLAP-type audio models or is it ASR-decoder specific? Whisper would be intresting given how widely it's used in prod.

Yeah, there have been several papers with attacks on Whisper:

- Inject adversarial noise to make it transcribe what you want (https://arxiv.org/abs/2210.17316)

- Stop it from transcribing (https://arxiv.org/abs/2405.06134)

- Adversarial prompt injection to make it translate instead of transcribe (https://arxiv.org/abs/2407.04482v2).

Audio adv. examples didn't used to show the same degree of transferability (generate for one model, works against another) that image adv. examples were able to achieve. Likely because of the RNN architecture or just audio is harder :shrug:

the article says

> This required full access to the model, restricting the researchers to open models with publicly available weights. They found, however, that attacks developed for open models transferred to commercial models from Microsoft and Mistral that share the same underlying architecture.

so it depends on what architecture whisper is using (i don't think they're LLM? or they weren't last time i checked about 4 years ago lol)

edit -- replaced last section, missed this bit in the article

Bene-gesserit have entered the chat!

I'd like to commend Apple on being ahead of the curve with this kind of attack, I don't think Siri is susceptible to this at all. Mostly due to it not being able to hear/understand what I say in the best of times /s

It's insane to me how much of a nose-dive Siri or any Apple-based STT takes when there is _any_ noise in the background. I like to play music at low levels in my house just as background noise and I've noticed that if I'm playing any music my STT just goes to complete shit (often missing the last 2-3+ words and messing up things in the middle). On the other hand, in the exact same environment, Parakeet v3 (via MacWhisper) has zero issues even with background noise.