Computer Vision for Video Editing Learning to Cut, Classify, Assemble, and Generate

Alejandro Pardo, Ph.D. Student, Electrical and Computer Engineering

Principal Investigator:

Bernard Ghanem, Professor, Electrical and Computer Engineering

Jun 23, 10:00 - 12:00

Click here to join the Ph.D. defense via Zoom

This thesis advances video editing by developing a suite of computer vision models for understanding and generating editorial decisions, including a method for ranking video cuts, a dataset for classifying cut types, a language-guided timeline assembler, and a diffusion-based technique for creating match cuts.

The rise of democratized video creation has increased the demand for tools that assist, accelerate, or automate the editing process. While advances in Computer Vision have led to powerful models for video understanding, captioning, and content retrieval, one of the most creative and time-consuming aspects of video creation—timeline and narrative editing—remains highly underexplored. This thesis frames editing as the task of making and executing editorial decisions that shape the structure, pacing, and meaning of video narratives. We specifically focus on timeline editing: selecting and placing cuts, arranging clips, and designing transitions between visual elements.

We propose four contributions that collectively advance semantic understanding, execution, and generative modeling for editorial tasks. First, we introduce a contrastive learning method to rank the plausibility of video cuts by analyzing professionally edited videos, enabling models to evaluate cuts with human-like judgments. Second, we present MovieCuts, the first large-scale dataset for cut-type recognition, and benchmark multimodal models that classify ten professional cut categories. Third, we develop the Timeline Assembler, a multimodal generative model that edits visual timelines in response to natural language instructions, combining content understanding and language-guided editing. Finally, we propose MatchDiffusion, a training-free method that exploits the noise properties of text-to-video diffusion models to generate match cuts, expanding editing from selection and classification to creative generation.

Together, these contributions establish a progressive roadmap toward Computer Vision systems capable of understanding, executing, and generating editorial decisions, bridging the gap between content analysis and creative media production. The models and benchmarks developed in this thesis lay the groundwork for future research in automated editing, AI-assisted creativity, and accessible storytelling tools.

Presenters

Alejandro Pardo, Ph.D. Student, Electrical and Computer Engineering

Brief Biography

Alejandro Pardo is a final-year Ph.D. student at KAUST, advised by Prof. Bernard Ghanem. His research explores the intersection of computer vision and creativity, with a focus on automating video editing using generative models. He previously earned his M.Sc. under Pablo Arbeláez and has interned at Intel's Embodied AI Lab and Adobe Research. Alejandro is passionate about bridging technology and storytelling.

Computer Vision for Video Editing Learning to Cut, Classify, Assemble, and Generate

Presenters

Alejandro Pardo, Ph.D. Student, Electrical and Computer Engineering

Brief Biography

Alejandro Pardo

Share

Related Sites

Computer, Electrical and Mathematical Sciences and Engineering (CEMSE)

Connect with us

Computer Vision for Video Editing Learning to Cut, Classify, Assemble, and Generate

Overview

Presenters

Alejandro Pardo, Ph.D. Student, Electrical and Computer Engineering

Brief Biography

Related People

Related Researchers

Alejandro Pardo

Supervisors

Bernard Ghanem

Share

Related Sites