Toward long-form video understanding
Nowadays, videos are omnipresent in our daily lives. From TikTok clips to Bilibili videos, from surveillance footage to vlogs recordings, the sheer volume of video content is staggering. Processing and analyzing the substantial volume of video data demands immense human effort. While computer vision techniques have made remarkable progress in automating video understanding in short clips, their effectiveness and efficiency when applied to long-form videos still fall short of the mark.
Overview
Abstract
Nowadays, videos are omnipresent in our daily lives. From TikTok clips to Bilibili videos, from surveillance footage to vlogs recordings, the sheer volume of video content is staggering. Processing and analyzing the substantial volume of video data demands immense human effort. While computer vision techniques have made remarkable progress in automating video understanding in short clips, their effectiveness and efficiency when applied to long-form videos still fall short of the mark. This talk explores the challenges and respective solutions in long-form video understanding from the aspects of computing resources, algorithms, and data. The first part addresses how to ease the resource requirements when training a long-form video understanding network end-to-end. It underlines the huge memory consumption from the large number of video frames as the key challenge and demonstrates a simple ‘rewiring for reversibility’ strategy to significantly reduce memory cost. The second part focuses on effective algorithm design to precisely identify and localize interesting moments from a long-form video. It explores how to represent a video using graph convolutional networks (GCNs) to deal with the large duration and context variety in long videos. The third part delves into the datasets and benchmarks required to train and evaluate effective long-form video understanding algorithms. It introduces the massive-scale egocentric video dataset and benchmark suites Ego4D, as well as the video-language dataset MAD containing over 1200 hours of movies with aligned text descriptions.
Brief Biography
Dr. Chen Zhao is currently a Research Scientist of Artificial Intelligence Initiative at King Abdullah University of Science and Technology (KAUST), and Lead of the Video Understanding Theme in the Image and Video Understanding Lab (IVUL). She received her Ph.D. degree from Peking University (PKU), China in 2016. She studied in University of Washington (UW), US from 2012 to 2013, and at the National Institute of Informatics (NII), Japan in 2016. Her research interests include computer vision and deep learning, with a focus on video understanding. She has published 40+ papers on representative journals and conferences such as T-PAMI, CVPR, ICCV, ECCV, and has received over 2500 citations according to Google Scholar. She has been serving as a reviewer in T-PAMI, T‑IP, T‑CSVT, CVPR, ICCV, ECCV, NeurIPS, ICLR, etc. and was recognized as Outstanding Reviewer in CVPR 2021. She received the Best Paper Award in CVPR workshop 2023, the Best Paper Nomination in CVPR 2022 (top 0.4%), and the Best Paper Award in NCMT 2015. She was also awarded the scholarship of Outstanding Talent in Peking University, First Prize of the Qualcomm Innovation Fellowship Contest (QInF) (only 2 in China), the Goldman Sachs Global Leaders Award (only 150 worldwide), and National Scholarship, etc.