AI Snitches Get Glitches: Towards Evading Agentic Surveillance

Abstract

The AI agent your employer gives you can read your files, write your emails, and talk to APIs on your behalf. But what happens when it starts working for them instead of you? An agent with that much access can quietly watch what you do and send off a report. You might never get a say in any of it.

We call this agentic surveillance, and we built SurveilBench to measure it across corporate, education, and police scenarios. The results are unsettling: some models start snitching on their own, with no one asking them to. In some cases, they'll also turn around and report the surveillance itself to the government, an act of counter-surveillance.

To fight back, we repurposed prompt injection and tested three ways to slip past a watching agent: hide from it, misdirect it, or push it into over-escalating. Agentic surveillance is already easy to pull off; and it's time we took the technical, ethical, and legal guardrails seriously.

AI Snitches Get Glitches: Towards Evading Agentic Surveillance

Abstract

A helpful assistant, one hidden send