Skip to main content
DASH NYC, June 9-10 | AI + Observability.

About this Session

Across 656 runs on Claude Sonnet, Opus, and Haiku, the simple statistical shape of an agent's tool calls (how many, in what order, how varied) is enough to spot misbehavior with 100% precision on well-defined tasks, with no access to the model's reasoning. On vague tasks, the same signal falls apart, and a known 30% slice of misaligned runs looks completely normal until you compare outputs.

 

In this talk, we share how a runtime tripwire built from a baseline of clean runs flags drifting traces in structured agent workflows, where this approach breaks down and why output diffing is the only thing that catches the 30% blind spot, and which class of adversarial prompt (reward framing) was the only one to reliably break alignment in our runs, pointing to where defensive effort actually pays off.

Related Sessions