
Computer Vision for Video Editing Learning to Cut, Classify, Assemble, and Generate
This thesis advances video editing by developing a suite of computer vision models for understanding and generating editorial decisions, including a method for ranking video cuts, a dataset for classifying cut types, a language-guided timeline assembler, and a diffusion-based technique for creating match cuts.
Overview
The rise of democratized video creation has increased the demand for tools that assist, accelerate, or automate the editing process. While advances in Computer Vision have led to powerful models for video understanding, captioning, and content retrieval, one of the most creative and time-consuming aspects of video creation—timeline and narrative editing—remains highly underexplored. This thesis frames editing as the task of making and executing editorial decisions that shape the structure, pacing, and meaning of video narratives. We specifically focus on timeline editing: selecting and placing cuts, arranging clips, and designing transitions between visual elements.
We propose four contributions that collectively advance semantic understanding, execution, and generative modeling for editorial tasks. First, we introduce a contrastive learning method to rank the plausibility of video cuts by analyzing professionally edited videos, enabling models to evaluate cuts with human-like judgments. Second, we present MovieCuts, the first large-scale dataset for cut-type recognition, and benchmark multimodal models that classify ten professional cut categories. Third, we develop the Timeline Assembler, a multimodal generative model that edits visual timelines in response to natural language instructions, combining content understanding and language-guided editing. Finally, we propose MatchDiffusion, a training-free method that exploits the noise properties of text-to-video diffusion models to generate match cuts, expanding editing from selection and classification to creative generation.
Together, these contributions establish a progressive roadmap toward Computer Vision systems capable of understanding, executing, and generating editorial decisions, bridging the gap between content analysis and creative media production. The models and benchmarks developed in this thesis lay the groundwork for future research in automated editing, AI-assisted creativity, and accessible storytelling tools.
Presenters
Brief Biography
Alejandro Pardo is a final-year Ph.D. student at KAUST, advised by Prof. Bernard Ghanem. His research explores the intersection of computer vision and creativity, with a focus on automating video editing using generative models. He previously earned his M.Sc. under Pablo Arbeláez and has interned at Intel's Embodied AI Lab and Adobe Research. Alejandro is passionate about bridging technology and storytelling.