Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Heterogeneous Large-Scale Systems
Out-of-Core simulation systems often produce a massive amount of data that cannot fit on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Buffers) and up to CPU’s main memory and GPU’s High Bandwidth Memory. However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Buffer System (MLBS) proposed in this research demonstrates a general method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. By continually prefetching up and down across all hardware layers of the memory/storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime.
Overview
Abstract
Out-of-Core simulation systems often produce a massive amount of data that cannot fit on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Buffers) and up to CPU’s main memory and GPU’s High Bandwidth Memory. However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Buffer System (MLBS) proposed in this research demonstrates a general method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. By continually prefetching up and down across all hardware layers of the memory/storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on a Cray XC40 Shaheen-2 supercomputer for a representative I/O-bound application, seismic inversion, shows that MLBS outperforms state-of-the-art PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively. On the IBM built Summit supercomputer, using 2048 compute nodes equipped with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared to the reference PFS based implementation. MLBS is also demonstrated on applications from cosmology, combustion, and a classic out-of-core computational physics and linear algebra routines.
Brief Biography
Tariq Alturkestani is pursuing a Ph.D. in computer science at the King Abdullah University of Science and Technology (KAUST) in Thuwal. Under the supervision of his advisor, Professor David Keyes, Tariq’s dissertation research focuses on overlapping IO and compute in large-scale scientific computation using multilayered buffering mechanisms. His work utilizes new emerging storage technology such as node-local and remote-shared Solid State Drive (SSD) and Non-volatile Memory (NVMe). He has been representing KAUST via collaborative research, most notably in early 2018 at Saudi Aramco, in 2015 at the University of Pittsburgh in Pennsylvania, and a year earlier at IBM Research in Austin, Texas.
His work has been recently accepted for publication at the 26th edition of the IEEE International Conference on High Performance Computing, Data and Analytics (HiPC) and the 26th International European Conference on Parallel and Distributed Computing (Euro-Par).
Tariq’s journey with KAUST began immediately upon graduating from high school in Jeddah in June 2008. He is part of the first KAUST Gifted Students Program (KGSP) group, a group of 24 Saudi students selected by KAUST to pursue their bachelor's degrees in top universities in the United States and the United Kingdom. Tariq joined the Pennsylvania State University (Penn State) in 2009. In 2013, he was awarded a bachelor’s degree in Computer Science with a minor in Mathematics. He then joined KAUST and received a Master's degree in Computer Science in December 2014.