Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads

This presentation will focus on addressing the communication bottlenecks in distributed deep learning (DDL) training workloads. Deep neural networks (DNNs) are widely used in various domains, but training them can be time-consuming, especially with large models and datasets. Three innovative solutions are proposed and evaluated in the dissertation.

Overview

Abstract

This presentation will focus on addressing the communication bottlenecks in distributed deep learning (DDL) training workloads. Deep neural networks (DNNs) are widely used in various domains, but training them can be time-consuming, especially with large models and datasets. Three innovative solutions are proposed and evaluated in the dissertation. The first, SwitchML, introduces an in-network aggregation primitive that reduces exchanged data volume, accelerating DDL workloads up to 5.5 times. The second, OmniReduce, optimizes sparse collective communication, outperforming existing solutions by 3.5 to 16 times, even at 100 Gbps. The third, CoInNetFlow, tackles congestion in shared data centers, significantly improving Job Completion Time Inflation. These solutions hold promise in enhancing the efficiency and speed of distributed deep learning systems, benefiting various domains of DNN training.

Brief Biography

Chen-Yu Ho is a Ph.D. candidate in computer science at KAUST. He received a B.S. from National Taiwan University and an M.S. from KAUST. During his undergraduate studies, he worked on image segmentation processing and techniques for converting handwriting and ancient Chinese calligraphy to digital fonts. His current research focuses on identifying bottlenecks of distributed machine learning training systems and developing better systems.

Presenters