Query Localization in Long-form Videos
The growth of digital cameras and data communication has led to an exponential increase in video production and dissemination. As a result, automatic video analysis and understanding has become a crucial research topic in the computer vision community. However, the localization problem, which involves identifying a specific event in a large volume of data, particularly in long-form videos, remains a significant challenge.
Overview
Abstract
The growth of digital cameras and data communication has led to an exponential increase in video production and dissemination. As a result, automatic video analysis and understanding has become a crucial research topic in the computer vision community. However, the localization problem, which involves identifying a specific event in a large volume of data, particularly in long-form videos, remains a significant challenge.
While recognizing video activity has been extensively researched, localizing a specific query in a long and untrimmed video requires the AI system to have a long-term understanding of the video, up to minutes or even hours. Therefore, we focus on studying the challenging problem of query localization in long-form videos from three perspectives: temporal modeling, localization feature, and data sampling.
Firstly, a graph-based method is proposed to model long-range dependencies, which are known as the semantic context of human activities. Secondly, the task gap between action recognition and action localization is identified, and two methods are proposed from pre-training and end-to-end learning perspectives to improve the video representation with a minimal gap. Lastly, a spatiotemporal localization problem is addressed in a more realistic dataset constructed from live human beings with various job positions and geometric locations. Our study finds that data sampling is more crucial than model architecture design, and proposes several ways to reduce sampling bias.
In summary, the dissertation aims to advance the machine intelligence of video understanding and we hope this work has practical implications for human beings in AI assistants, recommendations, health care, security, and other applications.
Brief Biography
Mengmeng Xu (Frost) is a highly motivated Ph.D. candidate in Electrical and Computer Engineering at KAUST's Image and Video Understanding Lab (IVUL) under the guidance of Professor Bernard Ghanem. Frost has published multiple first-author papers in top-tier AI/CV conferences such as CVPR, ICCV, and NeurIPS, demonstrating his expertise in these fields. Also, through his internships at Samsung, Amazon, and Meta, he has acquired valuable industry experience in the relevant field. His research interests lie in computer vision and video understanding, positioning him as a promising researcher with great potential to make significant contributions.