Pretraining Large Language Models on Japanese Supercomputers

Mathematics and Applications Colloquium

Event Start

2024-08-29 - 16:00

Event End

2024-08-29 - 17:00

Location

Building 1, Level 3, Room 3119

language models

Prof. Rio Yokota, Tokyo Institute of Technology

https://www.rio.gsic.titech.ac.jp/en/index.html

Abstract

Large language models (LLM) have become part of our daily life and are now indispensable tools for conducting research as well. The performance of LLMs is known to increase as the model size and data size is scaled up. The global race to pretrain the most capable LLM has resulted in supercomputers of unforeseen scale to be utilized at maximum capacity for months of training. This talk will describe the recent efforts to pretrain LLMs on Japanese supercomputers such as Fugaku, ABCI, and TSUBAME. We also use cloud resources such as Google Cloud and Sakura Internet — a Japanese cloud computing infrastructure. The talk will cover topics such as filtering techniques for the pretraining data, distributed parallel training, frameworks and hyperparameters, evaluation benchmarks, continual pretraining, and cross-lingual transfer. Filtering the pretraining data is a crucial part of the training, since a large portion of the data on the internet is not useful for training. Distributed parallelism is also an essential technology, and multiple techniques such as data parallel, tensor parallel, pipeline parallel, sequence parallel, and expert parallel are combined to perform pretraining of the largest models. The choice of frameworks and hyperparameters is vast, and searching is done mostly through trial and error at the moment. Evaluation benchmarks play an important role in determining what capabilities we target, but we always need to be cautious about overfitting to specific benchmarks. Finally, I will give some examples of continual pretraining and cross-lingual knowledge transfer in Japanese LLMs.

Brief Biography

Rio Yokota is a Professor at the Global Scientific Information and Computing Center, Tokyo Institute of Technology. His research interests lie at the intersection of high performance computing and machine learning. He is the developer of numerous libraries for fast multipole methods (ExaFMM) and hierarchical low-rank algorithms (Hatrix) that scale to the full system on the largest supercomputers today. He has also led efforts to train ImageNet in two minutes, and more recently to pre-train large language models using thousands of GPUs. He has been optimizing algorithms on GPUs since 2006, and was part of a team that received the Gordon Bell prize in 2009.

Contact Person

David Keyes

Related Persons

David Keyes

Professor, Applied Mathematics and Computational Sciences

Professors

Event Start

Event End

Location

Abstract

Brief Biography

Contact Person

Related Persons

David Keyes

Events

Why are distributed architectures the future of satellite networks?

Enhancing Interference Immunity and Information Freshness in Non-Terrestrial Networks for mMTC IoT Applications

Aerial Access Networks for 6G: From UAV, HAP, to Satellite Communication Networks

CEMSE - Computer, Electrical and Mathematical Sciences and Engineering Division

Biological and Environmental Sciences Engineering Division

Physical Science and Engineering Division

Study

Expanding Knowledge

Student Affairs

Living in KAUST

About KAUST

Latest from KAUST

CEMSE Division

Follow us