DIMACS Workshop on Data Mining in the Internet Age: Abstracts

DIMACS Workshop on Data Mining in the Internet Age

May 1 - 2, 2000
IBM - Almaden, San Jose, California

Organizers:: Rakesh Agrawal, IBM, ragrawal@almaden.ibm.com; Joan Feigenbaum, AT&T Labs - Research, jf@research.att.com; Prabhakar Raghavan, Verity Systems, praghava@verity.com; Jeff Ullman, Stanford University, ullman@db.stanford.edu
Presented under the auspices of the Special Focus on Next Generation Networks Technologies and Applications and the Special Year on Networks.
ABSTRACTS

1.

Google and the Importance of Search
Sergei Brin, Google.Com

Google is a large scale hypertextual web search engine.  It uses data mining
technologies to produce far more relevant search results than conventional 
search engines.  Since its founding, Google has scaled to over 1 billion 
documents and 30 million searches per day. This has created a number of 
software and hardware challenges.



2.

Integration of Data Mining and Relational Databases 
Surajit Chaudhuri, Microsoft Research

Abstract: In this talk, we discuss the key requirements and challenges in
integrating data mining technology in the traditional relational database
environment. We present the highlights of OLE-DB for Data Mining API that
makes it possible to support data mining models and operations using mining 
models in the relational framework. OLE-DB for Data Mining is available as
part of Microsoft SQL Server 2000 database product.




3.

                       PREDICTIVE DATA MINING
                                  with
                  MULTIPLE ADDITIVE REGRESSION TREES

                          Jerome H. Friedman
                          Stanford University


                               ABSTRACT

Predicting future outcomes based on past observational data is a
common application in data mining. The primary goal is usually
predictive accuracy, with secondary goals being speed, ease of use,
and interpretability of the resulting predictive model. New automated
procedures for predictive data mining, based on "boosting" (CART)
regression trees, are described.  The goal is a class of fast
"off-the-shelf" procedures for regression and classification that are
competitive in accuracy with more customized approaches, while being
fairly automatic to use (little tuning), and highly robust especially
when applied to less than clean data. Tools are presented for
interpreting and visualizing these multiple additive regression tree
(MART) models.


4.


            Mining Frequent Patterns without Candidate Generation


                              Jiawei Han
            Intelligent Database Systems Research Laboratory
                       School of Computing Science
                         Simon Fraser University
                        British Columbia, Canada
                        http://www.cs.sfu.ca/~han

Mining frequent patterns in transaction databases, time-series databases, 
and many other kinds of databases has been studied popularly in data mining 
research.  Most of the previous studies adopt an Apriori-like candidate set 
generation-and-test approach.  However, candidate set generation-and-test 
is still costly, especially when there exist prolific patterns and/or long 
patterns.  

Recently, we have proposed a frequent pattern tree (FP-tree) structure,
which is an extended prefix-tree structure for storing compact, crucial 
information about frequent patterns, and develop an efficient FP-tree-based 
mining method, FP-growth, for mining the complete set of frequent patterns
by pattern fragment growth.  Efficiency of mining is achieved with three 
techniques: (1) a large database is compressed into a condensed, smaller 
data structure, which avoids costly, repeated database scans, (2) our 
FP-tree-based mining adopts a pattern fragment growth method to avoid the 
costly generation of a large number of candidate sets, and (3) a 
partitioning-based, divide-and-conquer method is used to decompose the 
mining task into a set of smaller tasks for mining confined patterns 
in conditional databases, which dramatically reduces the search space.
Our performance study shows that the FP-growth method is efficient and 
scalable for mining both long and short frequent patterns, and is about 
an order of magnitude faster than the Apriori algorithm and also faster 
than some recently reported new frequent pattern mining methods.  If time 
permits, we will also report our recent work on the extensions of this 
method to mining frequent closed itemsets and mining sequential patterns.


5.

Cubegrades: Generalizing Association Rules 
Tomasz Imielinski (Rutgers Univ.)

Abstract

Cubegrades are generalization of association rules which represent how
a set of measures (aggregates) is affected by modifying a cube through
specialization (rolldown), generalization (rollup) and mutation (which
is a change in one of the cube's dimensions).  Cubegrades are
significantly more expressive than association rules in capturing
trends and patterns in data because they use arbitrary aggregate
measures, not just COUNT, as association rules do.  Cubegrades are
atoms which can support sophisticated ``what if'' analysis tasks
dealing with behavior of arbitrary aggregates over different database
segments. As such, cubegrades can be useful in marketing, sales
analysis, and other typical data mining applications in business.

In this paper we introduce the concept of cubegrades. We define them
and give examples of their usage. We then describe in detail an
important task for computing cubegrades:
 generation of significant cubes which is analogous to generating
frequent  sets.

We also demonstrate how to evaluate simple cubegrade queries and
conclude with a number of open questions and possible extensions of
the work.

Joint work with Leonid Khachiyan and Amin Abdulghani.


6.

Methodologies for Computational Gene Identification in Biological Sequences

Simon Kasif
Cambridge Research Laboratory (Compaq) and MIT Genome Center
 
 The current mode of biological exploration is changing from 
 a single experiment where an individual investigator 
 is exploring a single conjecture to hypothesis-free generation of data 
 using revolutionary genomic technologies and new computatonal 
 representations and algorithms for an "in-silico" exploration of 
 the biology of living organisms.

 Computational genomic data analysis has already inspired
 changes in theories of evolution in low-order organisms (prokaryotes); 
 identified and mapped thousands of genes; 
 provided order of magnitude improvements in the cost and speed of 
 experiments; helped progress towards identifying minimal living organisms 
 and suggested new methodologies for accurate diagnosis of 
 subtle forms of cancer by clustering gene expression profiles. 

 A recurring theme in this research is the need to
 build perhaps initially naive models of biological systems
 that allow researchers to extract meaningful information from data. 
 Without exception all work on computational gene finding makes use of
 assumptions about the organization of the genome and postulates basic
 gene models that are used as templates to extract putative genes. 
 This paradigm has generated thousands of gene predictions
 that serve as potential targets for treating disease or 
 other important bio-medical applications. 

 In this talk we review three techniques for gene identification in 
 biological sequences, and discuss the application of learning systems
 and computer science algorithms to this scientific challenge.  
 The three main techniques are based on gene identification by matching to 
 biological databases, comparative cross-genomic analysis, 
 and predictive data mining systems.
 Each approach generates deep algorithmic questions as well as many 
 algorithm engineering challenges.
 Finally, we review the design of a predictive gene finding system
 Glimmer that identifies genes in microbial DNA with 98% accuracy.
 http://www.tigr.org/softlab/glimmer/glimmer.html



7.

Classification with Pairwise Relationships:
Metric Labeling and Markov Random Fields
Jon Kleinberg, Cornell University

In a traditional classification problem, we wish to assign one of k
labels (or classes) to each of n objects, in a way that is consistent
with some observed data that we have about the problem.  An active
line of research in this area is concerned with classification when
one has information about pairwise relationships among the objects 
to be classified; this issue is one of the principal motivations for
the framework of Markov random fields, and it arises in fields such
as image processing, biometry, and hypertext document analysis.
In its most basic form, this style of analysis seeks to find a
classification that optimizes a combinatorial function consisting 
of assignment costs -- based on the individual choice of label we
make for each object -- and separation costs -- based on the pair
of choices we make for two related objects.

We formulate a general classification problem of this type, the
``metric labeling problem''; we show that it contains as special 
cases a number of standard classification frameworks, including
several arising from the theory of Markov random fields. From the
perspective of combinatorial optimization, our problem can be viewed
as a substantial generalization of the multiway cut problem, and
equivalent to a type of uncapacitated quadratic assignment problem.
We develop approximation algorithms with provable performance
guarantees for the metric labeling problem, including an algorithm 
that can be applied to an arbitrary metric on a set of k labels, 
and an arbitrary set of pairwise relationships over the objects 
to be classified.  For the special case in which the labels are 
endowed with the uniform metric -- all distances are the same -- 
our methods provide an approximation algorithm to within a factor of two.

This talk will be based on joint work with Eva Tardos.


8.


      Mining E-commerce Data: Challenges and Stories from the Trenches
                           Ronny Kohavi
                      ronnyk@bluemartini.com

The e-commerce domain can provide all the right ingredients for
successful data mining: large amounts of data (many records), rich
data with many attributes (wide records), clean data, and the
ability to take action and measure its effectiveness quickly.

In order to satisfy these requirements, however, merchandising and
marketing sites need to design their collection and analysis
architectures carefully, and must help simplify the knowledge
discovery process.

Architectural decisions in the design of the data mining software
made at Blue Martini Software will be motivated and discussed,
raising interesting challenges for future research.

Stories and lessons from data mining customer data  will be presented.




9.

Combining combinatorial and probabilistic methods in data mining

Heikki Mannila
Nokia Research Center and Helsinki University of Technology
         Heikki.Mannila@hut.fi

Data mining methods typically use either purely combinatorial
information (e.g., association rules) or purely probabilistic
descriptions (e.g., belief networks). Combining these two
techniques might yield useful new methods. We discuss some
abstract problems arising from such combination approach. We also 
describe one particular combination, using maximum entropy approach 
on top of frequent set information for query selectivity estimation.


10.

Tom Mitchell, CMU and Whizbang! Labs
Combining Labelled and Unlabelled Data for Web Mining

Most models of supervised learning consider only labeled training examples,
and ignore the potential role of unlabeled data. This talk considers the
question "when is it possible to use unlabeled data to increase accuracy in
supervised learning?"  This question was initially motivated by our research
on algorithms for learning to classify web pages (e.g., as student home
pages, faculty pages, etc.).  The question of how to use unlabeled data is
especially interesting for web page classification, given the hundreds of
millions of easily accessible web pages.  We present an algorithm and
experimental results demonstrating that unlabeled data can significantly
improve accuracy when learning to automatically classify web pages.  We then
identify a precise problem structure that is sufficient to assure unlabeled
data will improve learning accuracy.  Interestingly, the essential problem
structure is found in many natural learning problems faced by humans, such
as learning a semantic lexicon over noun phrases in natural language, and
learning to recognize objects from multiple sensor inputs.  These results
suggest that our understanding of human and animal learning might also be
improved by considering the potential role of unlabeled data in learning.



11.

Association Rules, Boosting, and Genome Mining
Shinichi Morishita (University of Tokyo)
moris@k.u-tokyo.ac.jp
http://www.gi.k.u-tokyo.ac.jp/~moris/

In this talk, we discuss how to efficiently compute significant 
association rules according to common statistical measures 
such as a chi-squared value or correlation coefficient. 
For this purpose, one might consider to use of the Apriori 
algorithm, but the algorithm needs major conversion, because
none of these statistical metrics are anti-monotone, 
and the use of higher support for reducing the search space 
cannot guarantee solutions in its the search space. 
We present a method of estimating a tight upper bound on 
the statistical metric associated with any superset of 
an itemset, as well as the novel use of the resulting 
information of upper bounds to prune unproductive supersets 
while traversing itemset lattices. The method is called 
AprioriSMP, and experimental tests demonstrate its efficiency. 

AprioriSMP is useful for classification tasks, because it has
a strong connection with AdaBoost that combines a number of
classifiers to make a highly accurate prediction. Finally 
we introduce several applications of AprioriSMP+AdaBoost 
for mining multiple sets of genes that are highly correlated 
with phenotype of interest.

Joint work with Jun Sese.


12.

Musings on the Extraction of Structure from the Web
	 Rajeev Motwani, Stanford University

One of the major challenges in dealing with the web is the unstructured or
semistructured nature of the data. There are major benefits in extracting
the structure implicit in the web or extracting a subset of the data that
can be structured easily. I will give a personal view of some attempts in
this direction. The focus will be on the problems and proposals for
attacking these problems. 

URL:

	 http://theory.stanford.edu/~rajeev   

BIO:

Rajeev Motwani is an associate professor of computer science at Stanford
University, where he also serves as the director of graduate studies. He
obtained his Ph.D. in computer science from University of California,
Berkeley in 1988, and his BTech in computer science from Indian Institute
of Technology, Kanpur in 1983. His research interests include: databases
and data mining, web search and information retrieval, robotics, and
theoretical computer science. He is a co-author of the book, Randomized
Algorithms, published by Cambridge University Press in 1995. Motwani has
received the Arthur P. Sloan Research Fellowship, the National Young
Investigator Award from the National Science Foundation, the Bergmann
Memorial Award from the US-Israel Binational Science Foundation, and an
IBM Faculty Award. 







13.

Privacy-Preserving Data Mining
Ramakrishnan Srikant (IBM Almaden Research Center)
http://www.almaden.ibm.com/cs/people/srikant/index.html


A fruitful direction for future data mining research will be the
development of techniques that incorporate privacy concerns.
Specifically, we address the following question.  Since the primary task
in data mining is the development of models about aggregated data, can
we develop accurate models without access to precise information in
individual data records?

We consider the concrete case of building a decision-tree classifier
from training data in which the values of individual records have been
perturbed.  The resulting data records look very different from the
original records and the distribution of data values is also very
different from the original distribution.  While it is not possible to
accurately estimate original values in individual data records, we propose
a novel reconstruction procedure to accurately estimate the
distribution of original data values.  By using these reconstructed
distributions, we are able to build classifiers whose accuracy is
comparable to the accuracy of classifiers built with the original data.

Joint work with Rakesh Agrawal.


14.

Using Data Mining for XML Data Storage and Compression
Dan Suciu (AT&T Labs -- Research)

Abstract:

In this talk I will describe two applications of data mining.  The
first is XML data storage: given a large XML data instance, derive a
relational schema for it, then store the XML data in a relational data
database system.  The advantage is that one can use performant
database engines to process the XML data.  The challenge however is to
find the "best" relational schema.  While this leads to a interesting
optimization problem, the complexity of the associated decision
problem is NP complete in the size of the XML data instance.  As a
practical substitute, we applied data mining to the XML data instance
to derive a "good" relational schema, even if suboptimal.

The second application deals with XML data compression.  In XMill we
achieve about twice the compression ration of gzip, at about the same
speed, when compressing XML data.  XMill gets the best compression
when the user provides it with "hints" about the data: which fields
are integers, or dates, or IP addresses, etc.  A natural next step
would be to use data mining to derive these hints automatically.
While part of the problem relies in just recognizing a few simple
datatypes, we observed that the best compression improvements are
achieved when "interesting" facts about certain CDATA values are
exploited.  This leads naturally to an interesting data mining problem
on strings, that I will describe.


15.

What have I learned at SurroMed?
Shalom Tsur (SurroMed)

Bioinformatics is a modern term used to denote the computational aspects
of biology. Traditionally, the term is used to denote computations
involving genomic data—DNA sequence data and the concepts derived from
these such as genes or gene types. With the completion of the sequencing
of the human genome, the field rapidly moves beyond the genomic domain
into the expression products of genes such as proteins and in general,
all of the derived information that eventually will lead to a complete
knowledge framework explaining human development and disease. The
pharmaceutical industry is among the prominent consumers of this
knowledge, which it hopes, will be useful for modern drug development.

The presentation will focus on the question: “to what extent can we
apply the database methods that traditionally have been applied in the
past, in a range of different domain data, to this new and rapidly
evolving domain?” in particular, the integration of diverse biological
information sources and data mining. The author will provide an account
of his own experiences in this respect, based on his involvement at
SurroMed, Inc.


16.

Massive Data Sets in Astronomy
MICHAEL S. VOGELEY 
(Department of Physics, Drexel University, vogeley@drexel.edu)

Sky surveys from ultraviolet to radio wavelengths now underway promise
to yield Terabyte-class data sets that will revolutionize the mode in
which astronomical research is conducted. These multiwavelength data,
including images, spectra, and measured properties of tens of millions
of galaxies, stars, and quasars, will be seamlessly archived in
object-oriented databases that may be queried by astronomers from
around the world. Development of sophisticated query mechanisms and
data mining tools will be important to allow complex searches in this
vast multi-parameter space. I will focus on results and prospects for
the Sloan Digital Sky Survey, a project to digitally map half of the
northern sky from ultraviolet to near-infrared wavelengths and to
obtain spectra of the one million brightest galaxies.
Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on July 6, 2000.
DIMACS Workshop on Data Mining in the Internet Age

May 1 - 2, 2000 IBM - Almaden, San Jose, California

ABSTRACTS

May 1 - 2, 2000
IBM - Almaden, San Jose, California