We are stumbling across a video tsunami flooding our communication channels. The ubiquity of digital cameras and social networks has increased the amount of visual media content generated and shared by people, in particular videos. Cisco reports that 82% of the internet traffic would be in the form of videos by 2022. The computer vision community has embraced this challenge by offering the first building blocks to translate the visual data in segmented video clips into semantic tags. However, users usually require to go beyond tagging at the video level. For example, someone may want to retrieve important moments such as the “first steps of her child” from a large collection of untrimmed videos; or retrieving all the instances of a home-run from an unsegmented video of baseball. In the face of this data deluge, it becomes crucial to develop efficient and scalable algorithms that can intelligently localize semantic visual content in untrimmed videos.
In this work, I address three different challenges on the localization of actions in videos. First, I develop deep-based action proposals and detection models that take a video and generate action-agnostic and class-specific temporal segments, respectively. These models have two appealing properties: (a) they retrieve temporal locations highly accurate, and (b) generate the detections quickly, faster than real-time. Second, I propose the new task to retrieve and localize temporal moments from a collection of videos given a natural language query. To tackle this challenging task, I introduce an efficient and effective model that aligns the text query to individual clips of fixed length while still retrieves moments spanning multiple clips. This approach not only allows smooth interactions with users via natural language queries but also reduce the index size and search time for retrieving the moments. Lastly, I introduce the concept of actor-supervision that exploits the inherent compositionality of actions, in terms of transformations of actors, to achieve spatiotemporal localization of actions. Actor-supervision in the form of a deep architecture efficiently learns to localize actions without the need of action box annotations. These three components provide insights that put us closer to the goal of general video understanding by means of efficient localization.
Victor graduated with a B.Sc. in Electronic Engineering and Electrical Minor from Universidad del Norte in 2012. After graduation, he started his M.Sc. in Electronic Engineering at Universidad del Norte working in computer vision for video understanding with Prof. Juan Carlos Niebles. He joined KAUST as a Ph.D. student in 2015 working under the supervision of Prof. Bernard Ghanem to develop efficient algorithms for localizing human actions in videos. His research interests lie in the fields of computer vision and machine learning where he has contributed to more than 10 publications. Victor has extensive experience in the field of video understanding and has attracted the attention of multiple Fortune 500 companies such as Samsung, Adobe and Qualcomm.