Evals in Production: Lessons From Building Bits AI SRE

About this Session

Engineering teams are building AI agents capable of correlating across millions of data points, executing tool calls, and maintaining state through complex reasoning. However, the non-deterministic nature of agents makes it difficult to understand quality and explain variance across outcomes. Evals are how teams make agent performance measurable in a repeatable way, moving beyond the final answer to evaluate the entire chain of reasoning.

In the high-stakes world of incident response, agent performance isn't abstract; it's whether tools can reliably accelerate triage, produce accurate RCAs, and guide remediation during an engineer's most pressurized moments. In this session, we'll explore how Datadog uses evaluations as the core engineering loop to objectively measure Bits AI SRE and what we learned building that system. Whether you're building your own AI agents or using Datadog Bits AI SRE, you'll leave understanding how rigorous evals drive better reasoning on the problems that matter most.

Speakers

Benjamin Barton

Senior Software Engineer Datadog

Dan Green

Senior Product Manager Datadog

Related Sessions

Security & Compliance

Harnessing AI

Breakout Session

BewAIre: Detecting Malicious Pull Requests at Scale with LLMs

June 9 01:00 PM – 01:40 PM

Speakers

D Niu, Senior Software Engineer, Datadog

Kassen Qian, Senior Product Manager, Datadog

Harnessing AI

Breakout Session

all levels

LLM Observability at Scale: Governing, Monitoring, and Securing AI Agents in Production

June 9 09:00 AM – 09:40 AM

Speakers

Rodrigo Moreno, Head of Cloud & SRE, Banco BV

Flávia Sacramoni, Head of Command Center and IT Services Management, Banco BV

Harnessing AI

Fireside Chat

all levels

The New Shape of Engineering

June 9 01:00 PM – 01:40 PM

Speakers

Alexis Lê-Quôc, CTO & Co-Founder, Datadog

Thibault Sottiaux, Head of Codex, OpenAI

End-to-End Observability

Harnessing AI

Workshop

Build with LLM Observability: From Setup to Signal - Pre-Day

June 8 01:30 PM – 04:00 PM

End-to-End Observability

Harnessing AI

Developer Autonomy

Datadog Core Skills for Developers - Pre-Day

June 8 01:30 PM – 04:00 PM

End-to-End Observability

Harnessing AI

Scaling Systems

Datadog Core Skills for Site Reliability Engineers (SREs) - Pre-Day

June 8 01:30 PM – 04:00 PM