Towards Scalable and Structured Understanding in Visual LLMs

In this talk, we explore a suite of recent advances toward scalable, structured video comprehension using Large Vision Language Models (Video LLMs).

Overview

The rise of Vision Language Models (VLMs) has opened new frontiers in video understanding, yet scaling these systems to handle long, complex, and structured visual narratives remains a fundamental challenge. In this talk, we explore a suite of recent advances toward scalable, structured video comprehension using Large Vision Language Models (Video LLMs). We begin by examining instructable models like MiniGPT-4 (image-based), MiniGPT-4-v2(image-based), MiniGPT-3D, and its video extension, which leverage multimodal instruction tuning for diverse video tasks. We then introduce Long Video LLMs - including the Goldfish and LongVU models - that tackle the token explosion problem with retrieval-augmented generation and spatiotemporal compression. Further, we address structured understanding via StoryGPT-V and Vgent, which model character consistency and entity-based graph reasoning. To rigorously evaluate progress, we present InfiniBench and SpookyBench, two novel benchmarks designed to probe long-form comprehension and temporal perception in state-of-the-art models. Finally, we extend the discussion to include multimodal capabilities in multilingual, emotional, and action-driven contexts, as well as exploratory work on bridging vision and brain signals.

Presenters

Brief Biography

Mohamed Elhoseiny is an associate professor in the Computer Science (CS) Program at KAUST, a senior member honoree of AAAI and IEEE, and the principal investigator of the KAUST Vision-CAIR Research Group. Before joining KAUST, he was a visiting faculty member at Stanford Computer Science Department, a visiting faculty member at Baidu Research, and a postdoctoral researcher at Facebook AI Research.

Elhoseiny earned his Ph.D. in 2016 from Rutgers University. He received a B.Sc. degree in 2006 and an M.S. degree in 2010, both in computer systems from Ain Shams University.

His work has received numerous recognitions, including third place at the Data+AI Summit hackathon at San Francisco held in May 2024 (200 participants) with a multimodal LLM hack called HomeGPT. He was selected as an MIT 35 under 35 semi-finalist in 2020. He received the Best Paper Award at the 2018 European Conference on Computer Vision (ECCV) Workshop on Fashion, Art, and Design for his research "DesIGN: Design Inspiration from Generative Networks." He also received the Doctoral Consortium Award at the 2016 Conference on Computer Vision and Pattern Recognition (CVPR) and an NSF Fellowship for his "Write-a-Classifier Project" in 2014. His research on creative art generation has been featured in New Scientist Magazine and MIT Technology Review, which also highlighted his work on lifelong learning.

Professor Elhoseiny’s contributions include zero-shot learning, which was featured at the United Nations, and his creative AI work was featured in MIT Technology Review, New Scientist Magazine, Forbes Science, and HBO's Silicon Valley. He has served as an Area Chair at major CV/AI conferences, including CVPR21, ICCV21, IJCAI22, ECCV22, ICLR23, CVPR23, ICCV’23, NeurIPS23, ICLR’24, CVPR’24, ECCV’24, SG Asia’24, and has organized Closing the Loop Between Vision and Language workshops at ICCV’15, ICCV’17, ICCV’19, ICCV’21, ICCV’23.

He has been involved in several pioneering works in affective AI art creation and has authored or co-authored numerous award-winning papers.