Making RPCs first class datacenter citizens

Event Start
Event End
Marios Kogias, Researcher, Microsoft Research


"Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of μs. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network policy enforcement.
In this talk I’ll describe the R2P2 internals and design choices that allow efficient and scalable RPC policy enforcement on software or hardware middle boxes in the network by separating the RPC target selection from request and reply streaming. I’ll cover thee different RPC policies implemented on top of R2P2. Specifically, we’ll see how R2P2 enables efficient in-network RPC load balancing based on a novel  join-bounded-shortest-queue (JBSQ) policy. JBSQ lowers tail latency by centralizing pending RPCs in the middle box and ensures that requests are only routed to servers with a bounded number of outstanding requests. Then, I’ll talk about SVEN, an SLO-aware RPC admission control mechanism implemented as an R2P2 policy on P4 programmable switches. Finally, I’ll describe HovercRaft, a new approach to building fault-tolerant generic RPC services by integrating state-machine replication in the transport layer. HovercRaft manages to increases both the resilience and the performance of general-purpose state-machine replication by adding nodes. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks, load balances requests, and offloads the fan-out/fan-in management to in-network compute."

Brief Biography

Marios Kogias is a researcher at MSR Cambridge. He graduated from EPFL in August 2020. His main research interests are at the intersection of operating systems and networking in the datacenter. He’s working on building and understanding systems with strict tail-latency SLOs leveraging new emerging hardware. He was an IBM PhD Fellow and won the best student paper award at Eurosys2020. He has interned at Cern, Google, and Microsoft Research.