From Reactive to Proactive: The Journey to Achieving Hyperscale Availability with AI-Driven Prediction
About this Session
At hyperscale, a regional cloud outage is not merely a technical disruption—for Samsung Account, which serves 2.1 billion users across three global regions, it is an immediate global service crisis. Fragmented, region-siloed monitoring creates blind spots that make early detection nearly impossible, leaving SRE teams perpetually reactive rather than predictive. The path to proactive reliability requires both a philosophical shift and a foundational change in how observability data is collected, unified, and reasoned over.
In this session, Samsung’s Je Min Kim (Dev Lead) and Junhee Kim (DevOps Engineer) share how their team built an agentic AI platform that included an AIOps Agent which, using Datadog MCP Server, predicted a major regional cloud failure before it happened. They also explain how that event catalyzed rebuilding their telemetry strategy around a single source of truth.
They will walk through the real outage case study where AI-assisted analysis surfaced subtle precursor signals spanning services, infrastructure, managed databases, and DNS layers that no single alert would have caught. They will also explain how Observability Pipelines and CloudPrem now serve as the unified telemetry foundation that enables their AI systems to reason with greater confidence at global scale.
You will leave with a practical framework for evolving from reactive firefighting to a predictive, AI-augmented reliability culture and a concrete understanding of how unified telemetry architecture unlocks the full potential of AI-driven observability tools.