Humans perceive the world through multiple senses and integrate them to better understand their surroundings. For example, hearing helps us detect a racing car approaching from behind, while watching someone’s face enhances our understanding of their speech. To build truly intelligent systems—machines that interpret the physical world through both sight and sound—we must move beyond unimodal approaches that rely solely on vision or audition.

Yapeng Tian,
University of Texas at Dallas
In this talk, I will present our research on audio-visual scene perception, which develops multimodal approaches that integrate computer vision and audition to achieve richer environmental understanding. These methods enable systems to recognize complex multisensory events and spatially segment sounding objects. Beyond general video content, we also explore multimodal modeling of social interactions. Furthermore, we extend our work into content creation, developing intelligent systems capable of generating realistic sounds and videos from diverse multimodal inputs. In particular, I will highlight our recent efforts in building universal and holistic audio generation models that effectively combine visual, auditory, and textual cues.
Yapeng Tian is an Assistant Professor in the Computer Science Department at the University of Texas at Dallas and leads the Computer Vision and Multimodal Computing (CVMC) Lab. Before joining UTD, he obtained his Ph.D. at the University of Rochester in 2022. He is interested in solving core computer vision, computer audition, and machine learning problems and applying the developed learning approaches to broad AI applications, such as multisensory perception, computational photography, AR/VR, accessibility, and healthcare. His work has been recognized with several awards, including the AAAI New Faculty Highlights, Amazon Research Award, Belonging & Inclusion Best Paper Award at UIST 2024, and a Best Paper Honorable Mention at ACCV 2024. He has served as an Area Chair for CVPR, ECCV, AAAI, ACL ARR, NeurIPS, and ICLR. His homepage: https://www.yapengtian.com/