DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data

November 3 - 4, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Parni Dasu, AT&T Labs, tamr@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.

The word "data" has taken on a broad meaning in the last five years. It is no longer a set of numbers or even text. New data paradigms include data streams characterized by a high rate of accumulation, web scraped documents and tables, web server logs, images, audio and video, to name a few. Well-known challenges of heterogeneity and scale continue to grow as data are integrated from disparate sources and become more complex in size and content.

While new paradigms have enriched data, the quality of data has declined considerably. In earlier times, data were collected as a part of pre-designed experiments where data collection could be monitored to enforce data quality standards. The data sets themselves were small enough that even if data collection was unsupervised, the data could be quickly scrubbed through highly manual methods. Today, neither monitoring of data collection nor manual scrubbing of data is feasible due to the sheer size and complexity of the data.

An additional challenge in addressing data quality is the domain dependence of problems and solutions. Metadata and domain expertise have to be discovered and incorporated into the solutions, entailing an extensive interaction with widely scattered experts. This particular aspect of data quality makes it difficult to find general one-size-fits-all solutions. However, the process of discovering metadata and domain expertise can be automated through the development of appropriate tools and techniques such as data browsing and exploration, knowledge representation and rule based programming.

Many disciplines have taken piecemeal approaches to data quality. The areas of process management statistics, data mining database research and metadata coding have all developed their own ad hoc approaches to solve different pieces of the data quality puzzle. These include statistical techniques for process monitoring, treatment of incomplete data and outliers, techniques for monitoring and auditing data delivery processes, database research for integration, discovery of functional dependencies and join paths, and languages for data exchange and metadata representation.

We need an integrated end-to-end approach within a common framework, where the various disciplines can complement and leverage each other's strengths. In this workshop, our broad objective is to bring together experts from different research disciplines to initiate a comprehensive technical discussion on data quality, data cleaning and treatment of noisy data. Specifically,

* To provide an overview of the existing research in data quality

* To present data quality as a continuous, end-to-end concept

* To discuss and update the definition of data quality, to develop metrics for measuring data quality

* To emphasize data exploration, data browsing and data profiling for validating schema specific constraints and identifying aberrations

* To focus on disciplines such as knowledge representation and rule based programming for capturing and validating domain specific constraints

* To highlight applications, case studies

* To present research tools and techniques

* To identify research problems in data quality and data cleaning

Workshop Format

The format of the workshop will be a combination of invited talks, contributed papers and posters. Invited and contributed talks will be published in the workshop proceedings.

Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on May 16, 2003.