Opening up the Large-Scale Computational Study of Film

Movies are a massively popular and influential form of media, but their computational study at scale has largely been off-limits to researchers in the United States due to the Digital Millennium Copyright Act.

In this talk, I’ll discuss recent regulatory changes at the U.S. Copyright Office that allows for large-scale text and data mining of film, and describe our efforts to build a collection of 2,307 films representing the top 50 movies by U.S. box office over the period 1980 to 2022, along with award nominees.

David Bamman,
UC Berkeley

Building this collection allows us to carry out several large-scale computational studies of film; I’ll discuss our work measuring changing patterns in the representation of gender and race/ethnicity over the past 43 years (where we see an increase in diversity over the past decade) and in leveraging it to model variation in emotional performances over both narrative and historical time. This work illustrates a new frontier of the data-driven analysis of film at a large scale.

David Bamman is an associate professor in the School of Information at UC Berkeley, where he works in the areas of natural language processing and cultural analytics, applying NLP and machine learning to empirical questions in the humanities and social sciences. His research focuses on improving the performance of NLP for underserved domains like literature (including LitBank and BookNLP) and exploring the affordances of empirical methods for the study of literature and culture.

Before Berkeley, he received his Ph.D. in the School of Computer Science at Carnegie Mellon University and was a senior researcher at the Perseus Project of Tufts University. Dr. Bamman’s work is supported by the National Endowment for the Humanities, National Science Foundation, an Amazon Research Award, and an NSF CAREER award.

Opening up the Large-Scale Computational Study of Film

Departments