Scaling deep learning to a large cluster of workers is challenging due to high communication overheads that data-parallelism entails. This talk describes our efforts to rein in distributed deep learning's communication bottlenecks. We describe SwitchML, the state-of-the-art in-network aggregation system for collective communication using programmable network switches. We introduce OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use. We touch on our work to develop compressed gradient communication algorithms that perform efficiently and adapt to network conditions. Lastly, we take a broad look at the challenges to accelerated decentralized training in the federated learning setting where heterogeneity is an intrinsic property of the environment.
Marco does not know what the next big thing will be. But he's sure that our next-gen computing and networking infrastructure must be a viable platform for it and avoid stifling innovation. Marco's research spans a number of areas in computer systems, including distributed systems, large-scale/cloud computing and computer networking with emphasis on programmable networks. His current focus is on designing better systems support for AI/ML and providing practical implementations deployable in the real-world.
Marco is an associate professor in Computer Science at KAUST. Marco obtained his Ph.D. in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior research scientist at Deutsche Telekom Innovation Labs & TU Berlin. Before joining KAUST, he was an assistant professor at UCLouvain. He also held positions at Intel, Microsoft and Google.