Evaluate and Optimize AI Agent Performance
About this Session
Large Language Model (LLM) applications are nondeterministic: their outputs are never exactly the same. Traditional testing methods and operational metrics are not sufficient to measure output quality, accuracy, or safety in agentic workflows. When you change a prompt, model, or your application’s architecture, how do you know if the changes actually made things better?
In this workshop, you'll learn how to use Datadog LLM Observability's Experiments and Evaluations features to systematically measure and improve the quality of your agentic AI applications. Evaluations enable you to measure quality in production, and Experiments help you validate changes offline using real production traces. Together, they enable deliberate, data-driven improvement instead of relying on guesswork or trial and error.
Through a hands-on lab, you'll walk through a full development loop: identifying a quality issue, creating an evaluator that defines what "good" means for your use case, running an experiment to compare variations, and confirming improvements with runtime evaluations and monitors. By the end of the workshop, you'll have the skills to build a continuous feedback loop between production monitoring and pre-deployment testing, so you can ship improvements with confidence.
Related Sessions
From Reactive to Proactive: How SREs Can Optimize Their Application Services Before Users Are Affected
Speakers
Build with LLM Observability: From Setup to Signal
Datadog Core Skills for Developers - Pre-Day
Datadog Core Skills for Site Reliability Engineers (SREs) - Pre-Day
Serverless Observability on AWS
From Ingestion to AI: Ensuring Data Reliability Across the Full Lifecycle
From Reactive to Proactive: How SREs Can Optimize Their Application Services Before Users Are Affected
Speakers
Build with LLM Observability: From Setup to Signal
Datadog Core Skills for Developers - Pre-Day
Datadog Core Skills for Site Reliability Engineers (SREs) - Pre-Day
Serverless Observability on AWS
From Reactive to Proactive: How SREs Can Optimize Their Application Services Before Users Are Affected
Speakers
How AI Is Redefining the Datadog Experience—and How to Make the Most of It
The AI Engineering Playbook: How to Evaluate & Iterate at Every Phase of Development
Build with LLM Observability: From Setup to Signal
Datadog Core Skills for Developers - Pre-Day
Datadog Core Skills for Site Reliability Engineers (SREs) - Pre-Day
From Ingestion to AI: Ensuring Data Reliability Across the Full Lifecycle
From Reactive to Proactive: How SREs Can Optimize Their Application Services Before Users Are Affected
Speakers