Designing High-Performance and Scalable Middleware for HPC, AI, and Data Science
This talk will focus on challenges and opportunities in designing middleware for HPC, AI (Deep/Machine Learning), and Data Science. We will start with the challenges in designing runtime environments for MPI+X programming models by considering support for multi-core systems, high-performance networks (InfiniBand and RoCE), GPUs, and emerging BlueField-2 DPUs. Features and sample performance numbers of using the MVAPICH2 libraries will be presented. For the Deep/Machine Learning domain, we will focus on MPI-driven solutions to extract performance and scalability for popular Deep Learning frameworks (TensorFlow and PyTorch), large out-of-core models, and Bluefield-2 DPUs.
Overview
Abstract
This talk will focus on challenges and opportunities in designing middleware for HPC, AI (Deep/Machine Learning), and Data Science. We will start with the challenges in designing runtime environments for MPI+X programming models by considering support for multi-core systems, high-performance networks (InfiniBand and RoCE), GPUs, and emerging BlueField-2 DPUs. Features and sample performance numbers of using the MVAPICH2 libraries will be presented. For the Deep/Machine Learning domain, we will focus on MPI-driven solutions to extract performance and scalability for popular Deep Learning frameworks (TensorFlow and PyTorch), large out-of-core models, and Bluefield-2 DPUs. MPI-driven solutions to accelerate data science applications like Dask will be highlighted. Challenges and experiences in deploying this middleware to the HPC cloud environments for Azure, AWS, and Oracle Cloud are presented. The talk concludes with an overview of the newly established NSF-AI Institute ICICLE (https://icicle.osu.edu/) to address challenges in designing future high-performance edge-to-HPC/cloud middleware for AI-driven data-intensive applications.
Brief Biography
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 500 papers. The MVAPICH2 MPI libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,275 organizations worldwide (in 90 countries). More than 1.61M downloads of this software have taken place from the project's site. This software is empowering many clusters in the TOP500 list. High-performance and scalable solutions for Deep Learning frameworks and Machine Learning applications are available from https://hidl.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow and recipient of the 2022 IEEE Charles Babbage Award. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.