
From the Ball-Proximal (Broximal) Point Method to Efficient Training of LLMs
This talk introduces the Ball-Proximal Point Method, a new foundational algorithm for non-smooth optimization with surprisingly fast convergence, and Gluon, a new theoretical framework that closes the gap between theory and practice for modern LMO-based deep learning optimizers.
Overview
I will present selected results from two recent related papers [1, 2]. The abstracts of both are included below:
Non-smooth and non-convex global optimization poses significant challenges across various applications, where standard gradient-based methods often struggle. We propose the Ball-Proximal Point Method, Broximal Point Method, or Ball Point Method (BPM) for short - a novel algorithmic framework inspired by the classical Proximal Point Method (PPM) (Rockafellar, 1976), which, as we show, sheds new light on several foundational optimization paradigms and phenomena, including non-convex and non-smooth optimization, acceleration, smoothing, adaptive stepsize selection, and trust-region methods. At the core of BPM lies the ball-proximal ("broximal") operator, which arises from the classical proximal operator by replacing the quadratic distance penalty by a ball constraint. Surprisingly, and in sharp contrast with the sublinear rate of PPM in the nonsmooth convex regime, we prove that BPM converges linearly and in a finite number of steps in the same regime. Furthermore, by introducing the concept of ball-convexity, we prove that BPM retains the same global convergence guarantees under weaker assumptions, making it a powerful tool for a broader class of potentially non-convex optimization problems. Just like PPM plays the role of a conceptual method inspiring the development of practically efficient algorithms and algorithmic elements, e.g., gradient descent, adaptive step sizes, acceleration (Ahn & Sra, 2020), and "W" in AdamW (Zhuang et al., 2022), we believe that BPM should be understood in the same manner: as a blueprint and inspiration for further development.
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as Muon [3] and Scion [4]. After over a decade of Adam’s [5] dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called Gluon, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of Muon and Scion, and leads to convergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values reported by Pethick et al. (2025). Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.
- Kaja Gruntkowska, Hanmin Li, Aadi Rane, and Peter Richtárik. The ball-proximal (="broximal") point method: a new algorithm, convergence theory, and applications. ArXiv preprint ArXiv:2502.02002, 2025
- Artem Riabinin, Kaja Gruntkowska, Egor Shulgin, and Peter Richtárik. Gluon: Making Muon & Scion great again! (Bridging theory and practice of LMO-based optimizers for LLMs). ArXiv preprint arXiv:2505.13416, 2025
- Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/
- Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. arXiv preprint arXiv:2502.07529, 2025
[5] Diederik P. Kingma, Jimmy Ba. Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014
Presenters
Peter Richtarik, Professor, Computer Science
Brief Biography
Before joining KAUST in 2017, he was an Associate Professor of Mathematics at the University of Edinburgh, and held postdoctoral and visiting positions at Université Catholique de Louvain, Belgium, and University of California, Berkeley, USA, respectively. Richtárik obtained a Mgr. in Mathematics ('01) at Comenius University in his native Slovakia. In 2007, he received his Ph.D. in Operations Research from Cornell University, U.S. Dr. Richtarik is a founding member and a Fellow of the Alan Turing Institute (UK National Institute for Data Science and Artificial Intelligence), and an EPSRC Fellow in Mathematical Sciences.
A number of honors and awards have been conferred on Dr. Richtárik, including:
- the Best Paper Award at the NeurIPS 2020 Workshop on Scalability, Privacy, and Security in Federated Learning (joint with S. Horvath);
- the Charles Broyden Prize, a Distinguished Speaker Award at the 2019 International Conference on Continuous Optimization, the SIAM SIGEST Best Paper Award (joint with O. Fercoq);
- the IMA Leslie Fox Prize (second prize, three times, awarded to two of his students and a postdoc);
- the SIAM SIGEST Best Paper Award (joint award with Professor Olivier Fercoq);
- the IMA Leslie Fox Prize (Second prize: M. Takáč 2013, O. Fercoq 2015 and R. M. Gower 2017);
- the INFORMS Computing Society Best Student Paper Award (sole runner-up: M. Takáč);
- the EUSA Award for Best Research or Dissertation Supervisor (Second Prize), 2016;
- and the Turing Fellow Award from the Alan Turing Institute, 2016.
Before joining KAUST, he was nominated for the Chancellor’s Rising Star Award from the University of Edinburgh in 2014, the Microsoft Research Faculty Fellowship in 2013, and the Innovative Teaching Award from the University of Edinburgh in 2011 and 2012.
Dr. Richtárik has given more than 150 research talks at conferences, workshops and seminars worldwide. And several of his works are among the most read papers published by the SIAM Journal on Optimization and the SIAM Journal on Matrix Analysis and Applications.
Dr. Richtárik regularly serves as an Area Chair for leading machine learning conferences, including NeurIPS, ICML and ICLR, and is an Action Editor of the Journal of Machine Learning Research (JMLR), Associate Editor of Optimization Methods and Software and Numerische Mathematik, and a Handling Editor of the Journal of Nonsmooth Analysis and Optimization. In the past, he served as an Action Editor of Transactions of Machine Learning Research and an Area Editor of Journal of Optimization Theory and Applications. He was an Area Chair for ICML 2019 and a Senior Program Committee Member for IJCAI 2019. And he is an Associate Editor of Optimization Methods and Software and a Handling Editor of the Journal of Nonsmooth Analysis and Optimization.