Averaging, Momentum, and Schedulers in Optimization for Deep Learning

In this talk, I will present some work in progress on practical optimization methods for deep learning. We will start with a discussion of several empirical techniques that enable training of large-scale models in language and vision tasks, including weight decay, averaging, and schedulers. We will then look at a new approach that we call schedule-free due to its ability to work without a pre-defined time horizon. I will share some details about the theory for these methods, explain why they might be useful in practice and then shed some light on their limitations. This talk will be oriented towards people who already have some knowledge of optimization methods.

Overview

Abstract

In this talk, I will present some work in progress on practical optimization methods for deep learning. We will start with a discussion of several empirical techniques that enable training of large-scale models in language and vision tasks, including weight decay, averaging, and schedulers. We will then look at a new approach that we call schedule-free due to its ability to work without a pre-defined time horizon. I will share some details about the theory for these methods, explain why they might be useful in practice and then shed some light on their limitations. This talk will be oriented towards people who already have some knowledge of optimization methods.

Brief Biography

Konstantin Mishchenko is a Research Scientist at Samsung in Cambridge, UK. Before joining Samsung, he was a postdoc in the group of Francis Bach at Inria Paris, and he did his PhD at KAUST under the supervision of Peter Richtárik. Konstantin’s work on adaptive methods was awarded the Outstanding Paper Award from the ICML conference in 2023.

Presenters

Konstantin Mishchenko