In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid or limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach -- both in principle and in practice -- to face data quality problems in network traffic databases.
We propose a continuous data quality monitoring approach
based on probabilistic, approximate constraints (PACs).
These are simple, user-specified rule templates with open
parameters for tolerance and likelihood.
We rely on statistical techniques to derive effective
parameter values from the data, and show how to apply them
for monitoring data quality.
In principle, our PAC-based approach can be applied to
data quality problems in any data feed.
We present PACMAN, which is the system that manages PACs for
the entire aggregate network traffic database in a large ISP,
and show that it is very effective in monitoring data quality
problems.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2003/2003-19.ps.gz