Network communication is a major bottleneck in large-scale distributed deep learning. To minimize the problem, many compressed communication schemes, in the form of quantization or sparsification, have been proposed. We investigate them from the Computer Systems perspective, under real-life deployments. We identify discrepancies between the theoretical proposals and the actual implementations, and analyze the impact on convergence. We also develop a general framework for compressed communication that supports both TensorFlow and PyTorch, and employ it to implement many existing schemes. We provide a thorough quantitative evaluation over real models and datasets. We demonstrate the effect of the various overheads, such as the computational cost of compression/decompression, on throughput. Finally, we report some preliminary results on stochastic sparse communication primitives.
Panos Kalnis is a Professor in the King Abdullah University of Science and Technology (KAUST, http://www.kaust.edu) and served as Chair of the Computer Science program from 2014 to 2018. In 2009 he was a visiting assistant professor at Stanford University. Before that, he was an assistant professor at the National University of Singapore (NUS). In the past he was involved in the designing and testing of VLSI chips and worked in several companies on database designing, e-commerce projects and web applications. He has served as associate editor for the IEEE Transactions on Knowledge and Data Engineering (TKDE) from 2013 to 2015, and on the editorial board of the VLDB Journal from 2013 to 2017. He received his Diploma from the Computer Engineering and Informatics Department, University of Patras, Greece in 1998 and his PhD from the Computer Science Department, Hong Kong University of Science and Technology (HKUST) in 2002. His research interests include Big Data, Parallel and Distributed Systems, Large Graphs and Systems for Machine Learning.