Skip to main content

About this Session

Across 656 runs on Claude Sonnet, Opus, and Haiku, the simple statistical shape of an agent's tool calls (how many, in what order, how varied) is enough to spot misbehavior with 100% precision on well-defined tasks, with no access to the model's reasoning. On vague tasks, the same signal falls apart, and a known 30% slice of misaligned runs looks completely normal until you compare outputs.

 

In this talk, we share how a runtime tripwire built from a baseline of clean runs flags drifting traces in structured agent workflows, where this approach breaks down and why output diffing is the only thing that catches the 30% blind spot, and which class of adversarial prompt (reward framing) was the only one to reliably break alignment in our runs, pointing to where defensive effort actually pays off.

Related Sessions

DASH 2027 is coming—Be in the know

Sign up for exclusive previews and announcements. Join us in NYC, June 15-17, 2027.

Thank you for your signing up

You’re on the list to receive updates for Datadog DASH 2027!