Dhammikai Amaratunga, Javier Cabrera, and Nandini Raghavan Title: Preprocessing Microarray Data
DNA Microarray technology is one of the most promising tools for obtaining gene expression data.
Whereas previously existing technologies were capable of analyzing only a few genes at a time,
with DNA microarrays one can analyze thousands of genes simultaneously. Since genes tend to act
together in clusters or pathways, this new technology is a quantum leap in our ability
to study genes. However, as is the case
with other data-mining enterprises, one has to contend with various issues that arise in a
technology that, although rapidly evolving, is still in its infancy.
In this talk we will briefly describe the data collection process for DNA microarray data,
focusing on the issues and challenges that it presents for data analysis. We will also discuss
our approach to the various types of problems that are standard in such data, starting with a
quality assessment of the arrays, intra-array issues like adjusting
spot intensities and transformations and finally,
inter-array issues like removing non-linear array effects, correcting extraneous effects,
flagging outliers and summarizing across replicates.
Title : Maximum Patterns and Outliers in the Logical Analysis of Data (LAD)
A maximum pattern covering a (say, positive) observation is
an interval in Rn containing the given observation and the
maximum number of other positive observations in a dataset,
without containing any of its negative observations. An
algorithm identifying maximum patterns is described, and used
for defining outliers. The increase of classification accuracy
after removing outliers is demonstrated on several publicly
available datasets frequently used as benchmarks in data mining.
Title: Data Mining: A Powerful Tool for Data Cleaning
It is highly desirable to have data mining process performed on
cleansed data to mine interesting data characteristics, models,
correlations, outliers, etc. However, real-world data needs to be
cleaned before mining, which is often more costly and time
consuming than knowledge mining itself. Fortunately, data mining
can serve as a powerful tool for data cleaning based on
statistics, correlation, classification and cluster analysis.
Dasu and Johnson's book "Exploratory Data Mining and Data
Cleaning" and their SIGMOD'03 tutorial have presented a
comprehensive overview on this theme. In this talk, instead of
repeating major points addressed there, I will present some of our
recent studies related to this theme, including (1) object
identification and merging: identifying objects by data mining and
statistical analysis, (2) new measures for correlation prediction
based on association properties, (3) mining noisy data across
multiple relations, and (4) effective document classification in
the presence of substantial amount of noise.
Title : A $220 Million Success Story
Four years ago we initiated Information Management in BT and had a lack of interest and resistance to our approach. Subsequently, we have turned this around to a position where we now hold a significant budget to undertake Information Management initiatives from a demanding internal customer base. The presentation describes our approach to introducing Information Management, the problems we encountered and where we were successful.
Since starting Information Management we have built up a capability within BT to analyse information quality problems and deliver solutions. We use a number of information quality tools and expertise in applying their use. Last financial year we managed to deliver $220 million business benefit. The main contribution to this success has been the way we manage projects and gain business buy in. The presentation will indicate where we have achieved these benefits and some examples of the projects delivered.
Title: Bellman - A Data Quality Browser
The most difficult and time consuming task in a data
analysis is usually to understand the data and its
many peculiarities. An enterprise database will contain
hundreds to thousands of tables and very many exceptions
to normal processing rules, so understanding a data set is
not an easy task.
After many experiences with unfortunate
surprises which ruined our analyses, we decided to build
a tool which would `map' a database and automate many
data exploration tasks. In this talk we discuss the tools
that we built, Bellman, and its technology, applications,
and future.
Title: Web page cleaning for Web data mining
The rapid expansion of the Internet has made the Web a popular place for
disseminating and collecting information. Web data mining thus becomes an
important technology for discovering useful knowledge or information on
the Web. However, useful information on the Web is often accompanied by a
large amount of noise such as banner advertisements, navigation bars,
copyright and privacy notices, etc. Although such information items are
functionally useful for human viewers and necessary for the Web site
owners, they often hamper automated information gathering and Web data
mining, e.g., Web page clustering, classification, information retrieval
and information extraction. In this talk, we will show that Web page noise
can seriously harm Web data mining. Cleaning Web pages before mining is
very important for many Web mining tasks. We will also describe a few
techniques to deal with the cleaning problem, and present some results to
show that cleaning is able to improve the accuracy of data mining
significantly.
Renee Miller, University of Toronto
Title: Managing Inconsistency in Data Exchange and Integration
Data exchange is the problem of taking data structured under a source
schema and creating an instance of an independent target schema that
reflects the source data as accurately as possible. Data exchange is
important in many real world applications involving the translation or
migration of data between database systems, applications, or
enterprises. Data integration is the problem of providing an
integrated, virtual view of a set of heterogeneous sources that can be
used for query answering. In both problems, the data to be exchanged
or integrated may contain errors or may be inconsistent with the
target or integrated schema. Due to the autonomy of the data sources
or the sheer size and complexity of the data, manual cleaning and
reconcilation may not be possible. In this work, we consider
techniques for managing and querying inconsistent data that has been
exchanged or integrated.
Title: Checks and Balances: Monitoring Data Quality Problems in Network
Traffic Databases
Internet Service Providers (ISPs) use real-time data feeds of
aggregated traffic in their network to support technical as well
as business decisions. A fundamental difficulty with building
decision support tools based on aggregated traffic data feeds is
one of data quality. Data quality problems stem from network-
specific issues (irregular polling caused by UDP packet drops and
delays, topological mislabelings, etc.), and make it difficult to
distinguish between artifacts and actual phenomena, rendering
data analysis based on such data feeds ineffective.
In principle, traditional integrity constraints and triggers may
be used to enforce data quality. In practice, data cleaning is
done outside the database and is ad-hoc. Unfortunately, these
approaches are too rigid and limited for the subtle data quality
problems arising from network data where existing problems morph
with network dynamics, new problems emerge over time, and poor
quality data in a local region may itself indicate an important
phenomenon in the underlying network. We need a new approach --
both in principle and in practice -- to face data quality
problems in network traffic databases.
We propose a continuous data quality monitoring approach based on
probabilistic, approximate constraints (PACs). These are simple,
user-specified rule templates with open parameters for tolerance
and likelihood. We use statistical techniques to instantiate
suitable parameter values from the data, and show how to apply
them for monitoring data quality. In principle, our PAC-based
approach can be applied to data quality problems in any data
feed. We present PAC-Man, which is the system that manages PACs
for the entire aggregate network traffic database in a large ISP,
and show that it is very effective in monitoring data quality
problems.
Joint work with Flip Korn and Yunyue Zhu.
Title: The Data Cleaning Problem -- Some Key Issues and Practical Approaches
This talk presents a broad survey of some of the issues that arise in
cleaning large datasets prior to detailed analysis, often by historically
"standard" methods that exhibit poor performance in the presence of various
data anomalies. The talk considers a variety of different types of data
anomalies, including outliers, missing data, misalignments, and the
presence of noninformative variables in the dataset. Emphasis is given to
common working assumptions and their appropriateness, sources of these
data anomalies, and practical methods for dealing with them.
Title: Relational Nonlinear FIR Filters
The general problem of cleaning up relational databases is extremely
important in practice. This paper introduces an extension of the class
of nonlinear finite impulse response (FIR) filters intended for these
applications.
Title : Life Cycle Datamining
In our datamining projects most of the actual effort focuses not on the "traditional" aspects of datamining, acquiring information and knowledge from data, but on lifecycle issues such as acquisition, data quality, preparation, distribution, storage, and archiving. The talk will explore the systems oriented challenges that are part of the datamining life cycle and how we are addressing it using artificial intelligence techniques.
Title : Data Quality in Trading Surveiilancs
We report on the Data Quality efforts of the Trading Surveillance and Analysis
project -- a project that supports regulatory surveillance of equity trading.
The data challenges faced here include mismatched attribute values across multiple systems,
mismatched timestamps for inter-related facts, inconsistency among multiple fields on
the same record, and inconsistency among fields across records. Other specific challenges
that we have encountered include dealing with outages in feeds from data vendors, and
verifying fact data against referential data systems. We describe in our presentation
the details of these challenges and steps that we take to identify, measure and correct for
these problems.
Johnson & Johnson, Rutgers, Johnson & Johnson
T. Bonates, P. Hammer, A. Kogan, and I. Lozina, RUTCOR, Rutgers University
Jiawei Han, University of Illinois at Urbana-Champaign
Jon Hill, British Telecommunications
Theodore Johnson, AT&T Labs
Bing Liu, University of Illinois at Chicago
S. Muthukrishnan, Rutgers University and AT&T Research
Ron Pearson, Thomas Jefferson University
R.K. Pearson and M. Gabbouj, Thomas Jefferson University and Tampere University of Technology
Gregg Vesonder, Jon Wright, and Parni Dasu, AT&T Labs - Research
Grace Zhang, Morgan Stanley
Workshop Index
DIMACS Homepage