Today’s machine perception systems rely heavily on supervision provided by humans, such as labels and natural language. I will talk about our efforts to make systems that, instead, learn from two ubiquitous sources of unlabeled data: visual motion and cross-modal associations. I will first discuss our work on creating unified motion analysis methods that can address both object tracking and optical flow tasks. I’ll then discuss how, perhaps surprisingly, these same techniques can be applied to localizing sound sources from stereo audio, and how sound localization can be jointly learned with visual rotation estimation.
Finally, I’ll talk about our work on learning from tactile sensing data that has been collected “in the wild” by humans, and our work on capturing camera properties by learning the cross-modal correspondence between images and camera metadata.
Andrew Owens is an assistant professor in the Department of Electrical Engineering and Computer Science at the University of Michigan. Prior to that, he was a postdoctoral scholar at UC Berkeley. He received a Ph.D. in electrical engineering and computer science from MIT in 2016. He is a recipient of a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award, and a Microsoft Research Ph.D. Fellowship.