This dissertation advances fine-grained, content-aware video retrieval by developing novel models and frameworks for Video-Language Grounding, enabling accurate alignment between natural language queries and specific temporal segments in unstructured video content.

Overview

The explosive growth of both user-generated and professionally produced video content has established video as the dominant medium for communication, learning, and entertainment. With billions of users uploading and consuming footage every minute, today’s video content spans an immense range of topics, formats, and durations, from short clips to multi-hour lectures and documentaries. As this volume continues to surge, users increasingly need tools to efficiently locate specific moments of interest, such as a critical step in a tutorial, a speaker’s main argument, or a memorable event in a personal archive. Meeting this demand requires search systems that go beyond surface-level metadata. Yet current methods, based on titles, tags, engagement signals, or automatic captions, struggle to capture the rich, nuanced semantics embedded in videos, especially at the level of fine-grained temporal segments. As a result, users are often left to manually sift through hours of footage to find the precise content they seek.

This thesis addresses the growing demand for fine-grained, content-aware video retrieval by focusing on the task of Video-Language Grounding (VLG), which entails identifying the precise temporal segment within an untrimmed video that best corresponds to a natural language query. I examine the limitations of existing approaches and the key challenges of aligning language with vision across diverse, unstructured, and variable-length video content. To overcome these challenges, I introduce a set of novel models, datasets, and retrieval pipelines that collectively form a comprehensive framework for accurate and scalable semantic video search.

Presenters

Brief Biography

Mattia Soldan is a final-year Ph.D. candidate in Electrical and Computer Engineering at KAUST, where he is advised by Prof. Bernard Ghanem. His research lies at the intersection of computer vision and natural language processing, with a focus on scalable and efficient algorithms for semantic video understanding and retrieval. His work spans task-specific deep learning architectures, dataset creation, and efficient visual encoding pipelines. Mattia is passionate about building intelligent systems that connect visual content with language and advancing research that bridges fundamental understanding with practical impact.