Optimization for Deep Learning

Abstract

The field of optimization for machine learning has undergone significant changes in recent years with deep learning models increasing in scale and fine-tuning taking a more prominent role. In this presentation, I will share a perspective on the direction of changes in the field and highlight interesting research directions. I will provide real-world examples of what practitioners want from optimization methods to train deep networks at scale. I will then present my recent work on adaptive methods, such as Adam and Adagrad, and explain how we can estimate the learning rate for these methods using theoretical tools from convex deterministic optimization and provide their convergence guarantees. Finally, I will present an extensive numerical evaluation of these methods on the task of training deep networks, including ViT on ImageNet, RoBERTa on BookWiki, GPT Transformer on BookWiki, and others. 

Brief Biography

Konstantin Mishchenko is a Research Scientist at Samsung in Cambridge, UK. Before joining Samsung, he was a postdoc in the group of Francis Bach at Inria Paris, and he did his PhD at KAUST under the supervision of Peter Richtárik. Konstantin's work on adaptive methods was awarded the Outstanding Paper Award from the ICML conference in 2023.

Contact Person