This thesis presents practical methodologies for building scalable multimodal agents that move from narrow automation toward open-ended self-improvement.

Overview

Recent progress in foundation models is enabling a new generation of agents that can perceive, reason, and interact with environments. Yet moving from narrow automation to open-ended self-improvement remains challenging.

This thesis develops methods for scalable multimodal agents and outlines a practical path from automation toward open-ended self-improvement:

i) Societies of agents.

We motivate natural-language societies of mind, where specialized agents communicate through a shared language interface to integrate knowledge, tools, and multimodal observations.

ii) Standardized workflows.

We present MetaGPT, which distills human standardized operating procedures into role-based collaboration with explicit intermediate artifacts for software development.

iii) Graph-based optimization.

We introduce GPTSwarm, a graph representation of agent systems that supports node-level improvement and system-level optimization.

iv) Scalable evaluation.

We propose Agent-as-a-Judge and the DevAI benchmark to assess developer agents using intermediate evidence and task dependencies, reducing reliance on purely human scoring.

v) Interactive environment modeling.

We study neural computers, action-conditioned generative models that learn command-line and desktop dynamics, enabling more controllable interfaces for agents.

Together, these contributions establish practical methodologies and evaluation frameworks for building multimodal agent systems that move beyond one-off automation toward continual self-improvement.

Presenters

Brief Biography

Mingchen Zhuge is a Ph.D. candidate in the Computer Science program at King Abdullah University of Science and Technology (KAUST), working on AI agents, world models, and recursive self-improvement under the supervision of Professor Jürgen Schmidhuber. He has authored more than 20 top-tier publications with over 6,300 citations, of which more than 70% are from first-author papers. His representative projects include MetaGPT, NLSOM, GPTSwarm, Agent-as-a-Judge, Kaleido-BERT, and NeuralComputer.

His work has been recognized through six oral presentations at leading AI conferences, including the International Conference on Learning Representations (ICLR 2024, 2025, 2026; acceptance rate <1.2%) and the International Conference on Machine Learning (ICML 2024; acceptance rate <1.5%). He also received the Best Paper Award at the NeurIPS Ro-FoMo Workshop and an Outstanding Paper Nomination at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2025). Beyond research, he was nominated for the WAIC Future Star Award, recognized as an Outstanding Reviewer at the Conference on Computer Vision and Pattern Recognition (CVPR 2023; 232 out of 7,000+ reviewers), served as Lead Organizer of ICLR RSI 2026, and served as an Area Chair for COLM 2026 and ACM CAIS 2026.