Metrics, Mayhem, and Microservices: Taming the Cloud Observability Beast
Cloud applications scale their workload on massively distributed software and hardware infrastructure to deliver swift performance and meet stringent service level objectives. The latest advancements in AI and the promising results delivered by flagship AI models have reinforced this trend, fueling considerable additional investments by all major cloud players to further scale their infrastructure. Yet, fault tolerance and performance debugging remain among the relatively few levers against this exponential growth, demanding ubiquitous system instrumentation and observability to anticipate failures or identify root-causes post-mortem. Overall, observability has become mission-critical to operate cloud technology at scale. In this talk, we introduce the first-ever offloading of observability operations to data centers' Infrastructure Processing Units (IPU) accelerators. This novel architecture can significantly reduce the costs and increase operational efficiency to manage large-scale cloud systems serving millions of users daily.
Overview
Firstly, we elaborate on the efficiency of observability for cloud-native microservice applications and quantify the impact of today's observability on application performance. We show that state-of-the-art observability frameworks fail to meet the demands of cloud-native environments, either resulting in crippling complexity and high costs for collecting and storing huge data volumes, or sacrificing events coverage due to coarse-granularity sampling. Then, we present our framework, MicroView, which leverages the proximity between IPUs and the monitored services to tackle the observability bloat. IPUs crucially enable MicroView to run continuous real-time and high-resolution analysis of observability data in a lightweight data plane, while not hurting application performance. The performance evaluation on representative benchmark applications demonstrates that MicroView's real-time analysis helps to (1) anticipate SLOs violations, (2) narrow the focus on informative observability data, and (3) trigger useful signals about service performance.
Presenters
Alessandro Cornacchia, Postdoctoral Fellow
Brief Biography
Alessandro Cornacchia is a Postdoctoral Fellow in the SANDS lab of the Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division at KAUST. He received his PhD degree with merit in telecommunications engineering from Politecnico di Torino (Italy) in 2024. Prior to joining KAUST, he took part in the RESTART initiative, funded by the European Union, by deploying and operating the national PROGNOSE lab in Italy. In 2020, he visited Huawei's research center in Shenzhen. Alessandro's research work addresses various aspects of data center networks and systems. His most recent works focused on the lightweight acquisition and real-time analysis of monitoring signals using state-of-the-art programmable network devices. Other key research areas include traffic congestion control and flow scheduling to reduce communication costs within the data center network fabric. His current work focuses on enhancing system visibility and optimizing performance analysis for distributed AI workloads.