Pretraining Large Language Models on Japanese Supercomputers

Event Start
Event End
Location
Building 1, Level 3, Room 3119

Abstract

Large language models (LLM) have become part of our daily life and are now indispensable tools for conducting research as well. The performance of LLMs is known to increase as the model size and data size is scaled up. The global race to pretrain the most capable LLM has resulted in supercomputers of unforeseen scale to be utilized at maximum capacity for months of training. This talk will describe the recent efforts to pretrain LLMs on Japanese supercomputers such as Fugaku, ABCI, and TSUBAME. We also use cloud resources such as Google Cloud and Sakura Internet — a Japanese cloud computing infrastructure. The talk will cover topics such as filtering techniques for the pretraining data, distributed parallel training, frameworks and hyperparameters, evaluation benchmarks, continual pretraining, and cross-lingual transfer. Filtering the pretraining data is a crucial part of the training, since a large portion of the data on the internet is not useful for training. Distributed parallelism is also an essential technology, and multiple techniques such as data parallel, tensor parallel, pipeline parallel, sequence parallel, and expert parallel are combined to perform pretraining of the largest models. The choice of frameworks and hyperparameters is vast, and searching is done mostly through trial and error at the moment. Evaluation benchmarks play an important role in determining what capabilities we target, but we always need to be cautious about overfitting to specific benchmarks. Finally, I will give some examples of continual pretraining and cross-lingual knowledge transfer in Japanese LLMs.

Brief Biography

Rio Yokota is a Professor at the Global Scientific Information and Computing Center, Tokyo Institute of Technology. His research interests lie at the intersection of high performance computing and machine learning. He is the developer of numerous libraries for fast multipole methods (ExaFMM) and hierarchical low-rank algorithms (Hatrix) that scale to the full system on the largest supercomputers today. He has also led efforts to train ImageNet in two minutes, and more recently to pre-train large language models using thousands of GPUs. He has been optimizing algorithms on GPUs since 2006, and was part of a team that received the Gordon Bell prize in 2009.

Contact Person