Home > Seminars > CSE Seminar Series: Heng Ji - Knowledge Base Population: From Conservative to Liberal

CSE Seminar Series: Heng Ji - Knowledge Base Population: From Conservative to Liberal


11/19/2015 at 3:30PM


11/19/2015 at 5:00PM


138 DeBartolo


College of Engineering close button

Ronald Metoyer

Ronald Metoyer

VIEW FULL PROFILE Email: rmetoyer@nd.edu
Phone: 574-631-5893
Website: http://www.nd.edu/~rmetoyer/
Office: 325C Cushing


College of Engineering Assistant Dean of Diversity and Special Initiatives
Dr. Metoyer's research interests are broadly in the areas of human-computer interaction with an emphasis on information visualization and applications in the areas of health and wellness, education, intelligence analysis, and software engineering.
Click for more information about Ronald
Add to calendar:
iCal vCal

Knowledge Base Population (KBP) aims to extract and populate structured knowledge from unstructured data. The state-of-the-art KBP paradigm includes three steps: (1). Some expert linguists define a schema about “what to link/fill in” such as entity and slot types for a specific data collection based on the needs of potential users and stakeholders, and write annotation guidelines for each type in the schema; (2). Human annotators follow the guidelines to annotate a certain amount of documents (a typical size is 500 documents); (3). Researchers write heuristic rules or design features and train supervised learning models from these manually annotated data. This paradigm is not fully automatic because it involves human in the loop for the first two steps. Both of them are very expensive, yet such a predefined schema can only cover a limited number of types. In addition, traditional KBP systems are highly dependent on linguistic resources tuned to the pre-defined schema, so they suffer from poor scalability and portability when moving to a new language, domain or genre. We propose a brand new “Liberal” KBP paradigm to combine the merits of traditional KBP/Information Extraction (IE) (high quality and fine granularity) and Open IE (high scalability). A Liberal KBP system can simultaneously discover a domain-rich schema and extract information units with fine-grained types. It has a “cold-start” (or with minimal supervision from some existing knowledge bases or schemas) and can be adapted to any domain, genres or language without any human annotated data. The only input to a Liberal KBP system is an arbitrary corpus, without any supervision, restriction or prior knowledge on its size, topic, or domain. The output also includes a schema discovered from the specific input corpus itself. The schema contains a flexible hierarchy of information unit types with multi-level granularities. Following the general principle of effectively leveraging both corpus statistics and linguistic knowledge, Liberal KBP combines symbolic semantics with knowledge discovered from distributional semantics using unsupervised learning. Experiments for multiple low-resource languages, multiple domains, and multiple genres demonstrate that Liberal KBP can discover new and more fine-grained schemas than both traditional KBP/IE and Open IE, and construct high-quality knowledge graphs for a new language/domain/genre overnight. Finally I will present the detailed quantitative and qualitative analysis on the remaining challenges and sketches future research directions for KBP.

Seminar Speaker:

Heng Ji

Heng Ji

Rensselaer Polytechnic Institute

Heng Ji is Edward P. Hamilton Development Chair Associate Professor in Computer Science Department of Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing and its connections with Data Mining, Network Science, Social Cognitive Science, Security and Vision. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, Sloan Junior Faculty Award in 2012, IBM Watson Faculty Award in 2012 and 2014, "Best of SDM2013" paper and "Best of ICDM2013" paper awards. She has been coordinating the NIST TAC Knowledge Base Population task since 2010. She served as the Information Extraction area chair for NAACL2012, ACL2013, EMNLP2013, NLPCC2014, EMNLP2015, WWW2015, NAACL2016 and ACL2016, the vice Program Committee Chair for IEEE/WIC/ACM WI2013 and CCL2015, the Financial Chair of IJCAI2016 and the Program Committee Chair of NLPCC2015. Her research is funded by the U.S. government (NSF, ARL, DARPA, AFRL and DHS) and industry (Google, Disney, IBM and Bosch).