
Architecture and System Co-Design for High-performance LLM Inference and Training
This talk explores the advantages and challenges of integrated training and inference architectures for AI, which are becoming increasingly necessary due to the evolving nature of AI models and the growing demand for efficient computing power.
Overview
Artificial intelligence breakthroughs like LLMs have led to a sharp increase in computing power demand. Computing power data centers extensively use GPU processors and NPU processors to meet their computing needs, becoming the main engines of computing infrastructure. On the other hand, AI models can continuously learn and evolve during interaction with users, such that they will frequently switch between training and inference tasks. Thus, the integrated architecture of training and inference will become increasingly popular. Compared to the separate training and inference architecture, the integrated training and inference architecture has advantages in response speed, operational cost, and resource utilization, such as reducing data transmission and storage redundancy. However, the integrated training and inference architecture also brings new full-stack system challenges, including numerical data types, multi-operator fusion, memory management, and co-located deployment. In this talk, I will share our group’s ideas, preliminary explorations, and practices in the integrated training and inference system architecture, while proposing future prospects to fully utilize the potential of the integrated training and inference architecture.
Presenters
Jingwen Leng, Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Brief Biography
Jingwen Leng is a full professor at the Department of Computer Science and Engineering at Shanghai Jiao Tong University. His research direction is the intelligent computer system design for the artificial intelligence, with the focus on performance, energy efficiency, and reliability. He has received multiple grants from National Science Foundation of China and top industrial companies. He has published more than 80 papers in top tier computer architecture conferences and more than 20 domestic/international patents. His work has received best paper award or nomination at venues/conferences including IEEE Micro Top Picks, ISCA, DAC, and PACT. He received the Olympus Award from Huawei company (2024), DAMO Young Fellow from Alibaba company (2020), and Microsoft Young Faculty Fellowship (2019).