DIMACS/CCICADA Workshop on Data Quality Metrics

DIMACS/CCICADA Workshop on Data Quality Metrics

February 3 - 4, 2011
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Tamraparni Dasu, AT & T Research, tamr at research.att.com; Lukasz Golab, AT & T Research, lgolab at research.att.com

Presented under the auspices of The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Abstracts:

Fei Chiang & Renee J. Miller

Title: A Unified Model for Data and Constraint Repair

Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed.

In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a new technique for comparing the cost of data and constraint repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality.

We evaluate the quality and scalability of our repair algorithms over a set of synthetic databases and present a qualitative case study over a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.

Lise Getoor, University of Maryland, College Park

Title: Collective Graph Identification

The importance of network analysis is growing across many domains, and is fundamental in understanding online social interactions, biological processes, communication, ecological, financial, and transportation networks, and many more. In most of these domains, the networks of interest are not directly observed, but must be inferred from noisy and incomplete data, data that was often generated for purposes other than scientific analysis. In this talk, I will describe graph identification, the process of inferring the hidden network from noisy observational data. In particular, I will describe a collective approach to graph identification, which interleaves the necessary steps in the accurate reconstruction of the network.

Joint work with Galileo Namata and Stanley Kok, University of Maryland.

Lukasz Golab, AT&T Labs

Title: Measuring data quality with Data Auditor

I will describe the Data Auditor tool, which implements a novel approach to understanding and measuring data quality. The idea behind Data Auditor is to "try out" various rules and constraints to see how strongly they are satisfied by a given data set. Many types of constraints are supported, including logical predicates, association rules, and constraints that specify regularity in a time series (e.g., consecutive elements must arrive between four and six minutes apart). The goal of Data Auditor is to compute a concise and meaningful summary of which subsets of the data tend to satisfy or fail a given rule. I will discuss the technical challenges in achieving this goal and demonstrate the utility of Data Auditor on a variety of large data sets.

Alan F. Karr, National Institute of Statistical Sciences

Title: What Can Research on Data Confidentiality Teach Us about Data Quality?

Conceptually, data quality is the capability of data to inform sound decisions. Consequently, data quality is itself a decision problem: ensuring or improving data quality consumes human, monetary and other resources that could instead be devoted to other purposes. To measure data quality directly in terms of its effect on decisions seems impossible now; as a "step in the right direction", we will describe a path that quantifies data quality by means of effects on statistical inferences drawn from the data.

The basis of that path is data confidentiality, a setting in which official statistics agencies (or other organizations) deliberately decrease data quality in order to preserve privacy of data subjects and confidentiality of the dataset, by lowering disclosure risk. This is possibly the only, context in which inference-based measures of data quality have been studied scientifically. In effect, therefore, our approach to understanding uncontrollable data quality effects is to build on extensive knowledge about controllable effects.

Dennis Shasha, Courant Institute of Mathematical Sciences

Title: Data Quality is Bad? Deal With It

Data quality issues may come as a surprise to database researchers who imagine that every record in a database reflects facts that have been carefully curated. Bad data does not surprise anybody in the life sciences, physics, or in adversarial settings however. Seeing how people in those settings deal with such issues may suggest new tools and perspectives in the data setting.

This talk presents examples from four areas -- biology, physics, secure banking, and drug evaluation -- and tries to derive lessons for data quality.

Zhiqiang Tan, Rutgers University

Title: Understanding and improving propensity score methods

Consider estimating the mean of an outcome in the presence of missing data or estimating population average treatment effects in causal inference. The propensity score is the conditional probability of non-missingness given explanatory variables.

In this talk, we will discuss propensity score methods including doubly robust estimators. The focus will be to understand these methods in comparison with others and to show recent advances of these methods.

Mike Wish, AT&T Labs Research

Title: Views, Issues and Illustrative Results on Data Quality and Information Utility

I come from a hybrid background of psychology, mathematical models and statistics, and computer science, all of which influence my work. My views about data, measurement, mining and modeling were strongly influenced by Clyde Coombs (Theory of Data) and the distinguished statistician next door at Murray Hill, John Tukey.

My conception of data quality is broader than typical, including the value or utility of the resulting information as well as the conclusions that can be mined from it. While reasonable accuracy, reliability, stability, consistency, reproducibility, etc. are necessary, they are insufficient for financial, scientific or other value. What counts much more than individual trustworthy elements is capturing and integrating the right data and using the collective framework for question answering, explanation, diagnosis, decision support and action with a feedback loop for continuous improvement.

Many years ago I used a Monte Carlo approach with multidimensional scaling to show that extremely noisy proximity data can be robust and yield very meaningful, stable results. Also, a relatively poor or subjective dependent variable may still be sufficient for creating a useful, valid metric that is substantial better than the "truth set."

For many problems I deal with, it is important that the data and measures, no matter how quantitative, objective and accurate, reflect customer perceptions, behavior and impact, and be timely enough to take action. I always ask what will or should be done differently, depending on results, and don't get involved if this can't be answered. I have seen innumerable instances where very precise, objective data measures the wrong things, or is processed so inappropriately that the value and investment suffers.

More attention is often given to refining a single data set rather than providing multiple methods on diverse data sets to understand generalizability, limitations and longer term stability. Using multiple types of data sources and methods has been critical for increasing reliability and validity (concurrent and predictive, convergent and discriminant) and identifying method variance. Understanding relationships and interactions among variables is essential to avoid confounding and artifactual results.

Our approach to data has yielded a long list of models, methods and techniques that have been widely used by AT&T marketing, customer care, network, operations and finance. I will discuss some aspects of these models, along with key issues and illustrations. Because of the extreme proprietary nature of more recent work, I will not provide specific details, but instead illustrate steps entailed in going from initial data capture to understanding, validation and decisions. These include new types of customer data such as Mark the Spot, pre-processing and dealing with extreme events, resolving data discrepancies from multiple sources, targeting for inter-dependent services, and measuring and improving network performance.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on January 18, 2011.