Compressed Communication in Distributed Deep Learning and Generative Language Models

Deep Learning and generative Artificial Intelligence has grown rapidly during the past few years due to the advancement of computing powers and parallel distributed training algorithms. As a result, it has been a common practice to use hundreds or thousands of machines to train very large Deep Neural Networks.

Overview

Abstract

Deep Learning and generative Artificial Intelligence has grown rapidly during the past few years due to the advancement of computing powers and parallel distributed training algorithms. As a result, it has been a common practice to use hundreds or thousands of machines to train very large Deep Neural Networks. When datasets and models are extremely large, multiple data centers need to be orchestrated to perform Federated Learning in order to further accelerate the training process. However, training such large models consisting of billions of parameters incurs frequent massive data exchanges in the network communication, making the communication a significant bottleneck in the training pipeline. In addition, there also exists a frequent data movement between the host memory and accelerators (e.g. GPUs) when performing long text generation with language models, which increases the latency of text generation and therefore affects the user experience. Compressed communication is one of the most effective solutions to address these challenges. It removes the redundancy in the transferred data by using low-precision, low-rank, or sparse representations while maintaining the training or inference quality. However, during practical deployment, compressed communication algorithms often need non-trivial adaptation and solid implementation in order to deliver a promising outcome, including accuracy guarantee and wall time speedup.

This dissertation focuses on the practical challenges of deploying compressed communication in various settings, including high performance distributed training in local clusters, bandwidth-limited federated learning among geo-distributed data centers, and the inference of generative large language models on commodity machines. First, we introduce a universal gradient compression framework, offering a standardized benchmark for practitioners to explore the trade-offs between training accuracy and communication efficiency in distributed training algorithms. Next, we present efficient sparse tensor compression algorithms designed specifically for communication-efficient federated deep learning. Moving forward, we demonstrate a novel distributed deep learning optimizer seamlessly integrating sparse communication with large-batch training algorithms, facilitating unparalleled scalability across thousands of GPUs. Lastly, we introduce a novel approximate attention mechanism aimed at minimizing KV-cache data transfer overhead between the host and accelerator, which improves the overall inference efficiency without sacrificing the accuracy of large language models.

Brief Biography

Hang Xu is a Ph.D. student in the Computer Science Program at InfoCloud Research Group under the supervision of Professor Panagiotis Kalnis at King Abdullah University of Science and Technology (KAUST). His current research interests include distributed system, machine learning algorithms and high performance computing.

Presenters