This thesis advances egocentric video understanding through multimodal learning, large-scale dataset development, and robust adaptation techniques; it introduces new models, benchmarks, and methods for building scalable, resilient perception systems that operate effectively in real-world, first-person environments.

Overview

Egocentric, or first-person, video offers a powerful medium for capturing and understanding natural human behavior in real-world environments. With the growing availability of wearable cameras and multimodal sensors, there is increasing interest in developing intelligent systems that can process, localize, and interpret activities from this unique perspective. However, egocentric video understanding presents fundamental challenges: videos are often long and unstructured, captured in diverse and dynamic scenes, and rich in sensory signals that may be noisy, incomplete, or constrained by real-world limitations such as privacy and power.

This thesis explores key problems in egocentric video understanding through the lens of multimodal learning, large-scale dataset curation, and robust modeling techniques for deployment under real-world constraints. It comprises four core contributions. First, we propose OWL, a method that integrates audiovisual temporal context to improve action localization in egocentric video, demonstrating significant performance gains on two large-scale datasets. Second, we document contributions to the creation of two landmark datasets, Ego4D and Ego-Exo4D, that enable scalable benchmarking of first-person perception tasks, including efforts in diverse data collection and energy-efficient activity recognition. Third, we address the practical issue of missing modalities in multimodal learning by introducing the Missing Modality Token (MMT), a transformer-compatible mechanism that allows robust inference even when inputs are partially absent. Finally, we present MiDl, a self-supervised, online test-time adaptation framework that dynamically adapts to modality incompleteness without requiring retraining or labeled data.

Together, these contributions advance the field of egocentric video understanding across both algorithmic and infrastructural dimensions. This work lays the foundation for building perception systems that are not only accurate and multimodally rich, but also scalable, resilient, and ready for deployment in the wild.

Presenters

Brief Biography

Merey Ramazanova is a Ph.D. candidate in computer vision at KAUST, supervised by Professor Bernard Ghanem, with whom she also completed her Master’s degree. Her research focuses on egocentric video understanding, including multimodal modeling, action localization, and test-time adaptation under missing modality conditions. She also completed a research internship at Adobe.