Learning to Identify and Exploit Neural Network Dynamics in Multi-Step Inference
B4 L5 R5220
This dissertation studies the temporal dynamics of multi-step inference and reveals that contributions across steps are sparse and uneven.
Overview
As neural networks scale, many tasks require multiple inference steps rather than a single forward pass. Understanding how models behave across these steps is essential for improving efficiency and design. This dissertation studies the temporal dynamics of multi-step inference and reveals that contributions across steps are sparse and uneven. Building on this insight, it develops analytical tools and modeling strategies to identify and exploit such dynamics, including mechanisms for accelerating diffusion models and architectures that adapt to step-wise variation. These principles are further scaled to multimodal and video generation, demonstrating improved efficiency and capability. Collectively, this work provides systematic approaches for analyzing and optimizing neural network dynamics in multi-step inference.
As model capacity scales, neural networks are increasingly expected to solve complex tasks that cannot be completed within a single inference step. Instead, they operate in a manner similar to human decision-making, repeatedly interacting with the environment and leveraging accumulated historical information. Understanding model behavior in such settings is crucial for revealing underlying mechanisms and guiding effective design. However, analyzing inference trajectories remains challenging due to their causal structure, in which each step depends on all preceding ones, complicating step-wise attribution.
To address this challenge, this dissertation makes the following contributions:
- It introduces Deep State Identifier, an analytical tool for characterizing deep neural networks across sequential inference steps, showing that step-wise contributions are sparse and uneven across diverse tasks and environments.
- Building on this insight, the analysis is extended to diffusion-based generative models, where empirical results reveal heterogeneous contributions across inference steps. This finding motivates TGATE, a feature-caching mechanism that accelerates diffusion inference by reusing components with negligible contribution in later steps.
- Guided by these observations, this dissertation proposes MoS, a multimodal generation architecture that adaptively aligns with diffusion dynamics by learning conditional signals that vary across inference steps.
- Finally, these insights into diffusion dynamics are ultimately scaled to video generation, resulting in MarDini, a scalable autoregressive diffusion model trained on millions of videos that supports video interpolation, extension, and image-to-video generation.
Collectively, these studies offer systematic tools and modeling strategies for temporal dynamics in multi-step inference.
Presenters
Brief Biography
He received his M.S. degree in Computer Science from Shenzhen University in 2022, where he was honored with the China National Scholarship and Outstanding Graduate Award. In 2022, he interned at Jarvis Lab, Tencent, and was a visiting student at Norwegian Biometrics Laboratory (NBL), Norwegian University of Science and Technology (NTNU). In 2024, he joined Meta AI research scientist internship where he contributing to the development of large-scale foundation models, including autoregressive diffusion and LLM-driven multimodal systems. Haozhe Liu is (co-)first author of over 10 papers in top-tier venues, with more than 1,000 citations. He has participated in large-scale pre-training efforts, scaling models across thousands of GPUs, with research outcomes integrated into commercial products. His work was recognized as the best paper in the NeurIPS 2023 workshop. He is the reviewer for CVPR, ICCV, ECCV, ICML, AAAI, and MICCAI.