DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data

November 3 - 4, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Parni Dasu, AT&T Labs, tamr@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.

Dhammikai Amaratunga, Javier Cabrera, and Nandini Raghavan
Johnson & Johnson, Rutgers, Johnson & Johnson

Title: Preprocessing Microarray Data

DNA Microarray technology is one of the most promising tools for obtaining gene expression data. Whereas previously existing technologies were capable of analyzing only a few genes at a time, with DNA microarrays one can analyze thousands of genes simultaneously. Since genes tend to act together in clusters or pathways, this new technology is a quantum leap in our ability to study genes. However, as is the case with other data-mining enterprises, one has to contend with various issues that arise in a technology that, although rapidly evolving, is still in its infancy.

In this talk we will briefly describe the data collection process for DNA microarray data, focusing on the issues and challenges that it presents for data analysis. We will also discuss our approach to the various types of problems that are standard in such data, starting with a quality assessment of the arrays, intra-array issues like adjusting spot intensities and transformations and finally, inter-array issues like removing non-linear array effects, correcting extraneous effects, flagging outliers and summarizing across replicates.

T. Bonates, P. Hammer, A. Kogan, and I. Lozina, RUTCOR, Rutgers University

Title : Maximum Patterns and Outliers in the Logical Analysis of Data (LAD)

A maximum pattern covering a (say, positive) observation is an interval in Rn containing the given observation and the maximum number of other positive observations in a dataset, without containing any of its negative observations. An algorithm identifying maximum patterns is described, and used for defining outliers. The increase of classification accuracy after removing outliers is demonstrated on several publicly available datasets frequently used as benchmarks in data mining.

Jiawei Han, University of Illinois at Urbana-Champaign

Title: Data Mining: A Powerful Tool for Data Cleaning

It is highly desirable to have data mining process performed on cleansed data to mine interesting data characteristics, models, correlations, outliers, etc. However, real-world data needs to be cleaned before mining, which is often more costly and time consuming than knowledge mining itself. Fortunately, data mining can serve as a powerful tool for data cleaning based on statistics, correlation, classification and cluster analysis.

Dasu and Johnson's book "Exploratory Data Mining and Data Cleaning" and their SIGMOD'03 tutorial have presented a comprehensive overview on this theme. In this talk, instead of repeating major points addressed there, I will present some of our recent studies related to this theme, including (1) object identification and merging: identifying objects by data mining and statistical analysis, (2) new measures for correlation prediction based on association properties, (3) mining noisy data across multiple relations, and (4) effective document classification in the presence of substantial amount of noise.

Jon Hill, British Telecommunications

Title : A $220 Million Success Story

Four years ago we initiated Information Management in BT and had a lack of interest and resistance to our approach. Subsequently, we have turned this around to a position where we now hold a significant budget to undertake Information Management initiatives from a demanding internal customer base. The presentation describes our approach to introducing Information Management, the problems we encountered and where we were successful.

Since starting Information Management we have built up a capability within BT to analyse information quality problems and deliver solutions. We use a number of information quality tools and expertise in applying their use. Last financial year we managed to deliver $220 million business benefit. The main contribution to this success has been the way we manage projects and gain business buy in. The presentation will indicate where we have achieved these benefits and some examples of the projects delivered.

Theodore Johnson, AT&T Labs

Title: Bellman - A Data Quality Browser

The most difficult and time consuming task in a data analysis is usually to understand the data and its many peculiarities. An enterprise database will contain hundreds to thousands of tables and very many exceptions to normal processing rules, so understanding a data set is not an easy task. After many experiences with unfortunate surprises which ruined our analyses, we decided to build a tool which would `map' a database and automate many data exploration tasks. In this talk we discuss the tools that we built, Bellman, and its technology, applications, and future.

Bing Liu, University of Illinois at Chicago

Title: Web page cleaning for Web data mining

The rapid expansion of the Internet has made the Web a popular place for disseminating and collecting information. Web data mining thus becomes an important technology for discovering useful knowledge or information on the Web. However, useful information on the Web is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright and privacy notices, etc. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper automated information gathering and Web data mining, e.g., Web page clustering, classification, information retrieval and information extraction. In this talk, we will show that Web page noise can seriously harm Web data mining. Cleaning Web pages before mining is very important for many Web mining tasks. We will also describe a few techniques to deal with the cleaning problem, and present some results to show that cleaning is able to improve the accuracy of data mining significantly.

Renee Miller, University of Toronto

Title: Managing Inconsistency in Data Exchange and Integration

Data exchange is the problem of taking data structured under a source schema and creating an instance of an independent target schema that reflects the source data as accurately as possible. Data exchange is important in many real world applications involving the translation or migration of data between database systems, applications, or enterprises. Data integration is the problem of providing an integrated, virtual view of a set of heterogeneous sources that can be used for query answering. In both problems, the data to be exchanged or integrated may contain errors or may be inconsistent with the target or integrated schema. Due to the autonomy of the data sources or the sheer size and complexity of the data, manual cleaning and reconcilation may not be possible. In this work, we consider techniques for managing and querying inconsistent data that has been exchanged or integrated.

S. Muthukrishnan, Rutgers University and AT&T Research

Title: Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases

Internet Service Providers (ISPs) use real-time data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network- specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.), and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective.

In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach -- both in principle and in practice -- to face data quality problems in network traffic databases.

We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality. In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.

Joint work with Flip Korn and Yunyue Zhu.

Ron Pearson, Thomas Jefferson University

Title: The Data Cleaning Problem -- Some Key Issues and Practical Approaches

This talk presents a broad survey of some of the issues that arise in cleaning large datasets prior to detailed analysis, often by historically "standard" methods that exhibit poor performance in the presence of various data anomalies. The talk considers a variety of different types of data anomalies, including outliers, missing data, misalignments, and the presence of noninformative variables in the dataset. Emphasis is given to common working assumptions and their appropriateness, sources of these data anomalies, and practical methods for dealing with them.

R.K. Pearson and M. Gabbouj, Thomas Jefferson University and Tampere University of Technology

Title: Relational Nonlinear FIR Filters

The general problem of cleaning up relational databases is extremely important in practice. This paper introduces an extension of the class of nonlinear finite impulse response (FIR) filters intended for these applications.

Gregg Vesonder, Jon Wright, and Parni Dasu, AT&T Labs - Research

Title : Life Cycle Datamining

In our datamining projects most of the actual effort focuses not on the "traditional" aspects of datamining, acquiring information and knowledge from data, but on lifecycle issues such as acquisition, data quality, preparation, distribution, storage, and archiving. The talk will explore the systems oriented challenges that are part of the datamining life cycle and how we are addressing it using artificial intelligence techniques.

Grace Zhang, Morgan Stanley

Title : Data Quality in Trading Surveiilancs

We report on the Data Quality efforts of the Trading Surveillance and Analysis project -- a project that supports regulatory surveillance of equity trading. The data challenges faced here include mismatched attribute values across multiple systems, mismatched timestamps for inter-related facts, inconsistency among multiple fields on the same record, and inconsistency among fields across records. Other specific challenges that we have encountered include dealing with outages in feeds from data vendors, and verifying fact data against referential data systems. We describe in our presentation the details of these challenges and steps that we take to identify, measure and correct for these problems.

Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 20, 2003.