North East DB/IR Day

October 22, 2010
AT&T Shannon Labs
Building 103, 180 Park Avenue
Florham Park, NJ

Graham Cormode, AT & T Research, graham at
Srinivas Bangalore, AT&T Research, srini at
Sponsored by DIMACS and AT&T  


Juliana Freire, Scientific Computing and Imaging Institute (SCI) and School of Computing, University of Utah

Title: Provenance-Rich Science

Computing has been an enormous accelerator to science and it has led to an information explosion in many different fields. The unprecedented volume of data acquired by sensors, derived by simulations and analysis processes, and shared on the Web opens up new opportunities, but it also creates many challenges when it comes to managing and making sense out of these data. In this talk, I discuss the importance of maintaining detailed provenance (also referred to as lineage and pedigree) for digital data. Provenance provides important documentation that is key to preserve data, to determine the data's quality and authorship, to understand, reproduce, as well as validate results. I will review some of the state-of-the-art techniques, as well as research challenges and open problems involved in managing provenance throughout the data life cycle. I will also discuss benefits of provenance that go beyond reproducibility, and present, in a live demo, techniques and tools we have developed that leverage provenance information to support reflective reasoning and collaborative data exploration and visualization. I conclude with a discussion on new applications that are enabled by provenance. In particular, I will show how provenance can be used to aid in teaching, to create reproducible publications, and as the basis for social data analysis.

About the speaker:

Juliana Freire is an Associate Professor at the School of Computing at the University of Utah. Before, she was member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies) and an Assistant Professor at OGI/OHSU. An important theme is Professor Freire's work is the development of data management technology to address new problems introduced by emerging applications, including the Web and scientific applications. Her recent research has focused on two main topics: scientific data management and Web mining. Within scientific data management, she is best known for her work in provenance and scientific workflows, and for being a co-creator of VisTrails. Professor Freire is an active member of the database and Web research communities, having co-authored over 80 technical papers and holding 4 U.S. patents. She is a recipient of an NSF CAREER and an IBM Faculty award. She has chaired or co-chaired several workshops and conferences, and she has participated as a program committee member in over 50 events. Her research has been funded by grants from the National Science Foundation, Department of Energy and the University of Utah.

Renee Miller, University of Toronto

Title: On Schema Discovery

Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this talk, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We revisit a set of information-theoretic tools proposed in SIGMOD 04 (with Andritsos and Tsaparas) for finding structural summaries that are useful in characterizing the information content of the data, and consider new applications of these techniques.

About the Speaker:

Renée J. Miller received BS degrees in Mathematics and in Cognitive Science from the Massachusetts Institute of Technology. She received her MS and PhD degrees in Computer Science from the University of Wisconsin in Madison, WI. She received the Presidential Early Career Award for Scientists and Engineers (PECASE) external link, the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received the National Science Foundation Early Career Award (formerly, the Presidential Young Investigator Award) external link for her work on data integration. She is a Fellow of the ACM, the President of the VLDB Endowment, and the Program Chair for ACM SIGMOD 2011 in Athens, Greece. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans data integration, data exchange, knowledge curation and data sharing. She is a Professor and the Bell Canada Chair of Information Systems at the University of Toronto.

Douglas W. Oard, University of Maryland, College Park

Title: Who 'Dat? Identity resolution in large email collections

Automated techniques that can support the human activities of search and sense-making in large email collections are of increasing importance for a broad range of uses, including historical scholarship, law enforcement and intelligence applications, and lawyers involved in "e-discovery" incident to civil litigation. In this talk, I'll briefly describe some of the work to date on searching large email collections, and then for most of the talk I will focus on the more challenging task of support for sense-making. Specifically, I'll describe joint work with Tamer Elsayed to automatically resolve the identity of people who are mentioned ambiguously (e.g., just by first name) in a collection of email from a failed corporation (Enron). Our results indicate that for people who are well represented in the collection we can use a generative model to guess the right identity about 80% of the time, and for others we are right about half the time. I'll conclude the talk with a few remarks on our next directions for techniques, evaluation, and additional types of collections to which similar ideas might be applied.

About the Speaker:

Douglas Oard is a Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland, and his research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval, searching conversational media, and support for sense-making in large digital archival collections. Additional information is available at

