2001-2006 Special Focus on Data Analysis and Mining: Overview
Theoretical and algorithmic approaches to data analysis have played a
central role in the development of modern methods for handling data.
Now, however, the massive amounts of data gathered in important modern
applications ranging from the Internet to credit card fraud detection
to astronomy and medicine have dramatically changed the requirements
for algorithms and provide ample motivation for a great deal of new
theoretical development. We need methods for data analysis and mining
that scale to the huge volumes of data that we are getting and can
expect to get in such applications. DIMACS is planning a
special focus devoted to data analysis and data mining, with emphases
on the development of theoretical and algorithmic approaches to the
massive data-mining problems that we face today, on the increasingly
abstract formulations and models of data mining questions that are
being seen in current research, and on the connections between
theoretical approaches and practical applications.
The emphasis of this special focus will be on unifying promising
approaches to data analysis and data mining that come from many
distinct communities of researchers. The topics of interest include
methodologies and algorithms for data mining, including clustering,
discriminant analysis, enumerative methods, and multidimensional
scaling; the increasingly abstract formulations and models of data mining
questions using logical methods, conceptual clustering,
learning and discovery that are critical in data mining and
in particular for automatic, intelligent decision making; and the
special problems that arise from applications to such important areas
as fraud and intrusion detection, web mining, medical and scientific
databases, marketing, and natural language data.
THEMES/MOTIVATION
The field of data analysis has a long history with roots in
traditional statistical analysis and the development of artificial
intelligence. Theoretical analysis of databases has allowed the
precise formulation of questions about them and in turn, coupled with
a strong effort in algorithms, has led to many powerful techniques for
collecting, storing, consolidating, processing, correcting, and
retrieving data, for learning, and for finding previously undiscovered
patterns.
The emergence of new and powerful data collection technologies has led
to the creation of massive amounts of data, often distributed,
shared, partially unknown, or having specialized structures.
Traditional data analysis tools are incapable of handling the
sheer size and complexity of these gigantic data sets. There is a
great need to develop new methods and algorithms that can handle these
data sets. We need to develop theoretical underpinnings for managing
and reasoning about data, and we need new tools for finding patterns
or displaying useful summaries of the data.
Among the topics we shall emphasize in this special focus are the following:
- Massive quantities of disparate data can be stored on a loosely
coupled infrastructure such as the World Wide Web, and so we need
search and retrieval techniques for distributed, semi- or unstructured
``databases.''
- Increasingly, the types of data of interest are very
different from the strings of text or numerical data toward which much
of classical data structures and algorithms are oriented. Thus, we
need new theoretical approaches to represent and work with data items
of particular, complex, application-dependent form
and to integrate data from multiple, heterogeneous sources.
- There is now an increased need for systems that can make
intelligent, automatic decisions based on the analysis of data. It is
crucial that we understand theoretically the behavior of systems that
are permitted to act autonomously, in order to guarantee desired
results.
- Most algorithms currently used in data mining do not scale
well when applied to very large data sets, often because they rely on
random access to the data sets, which scales only while the data sets
fit entirely in relatively small main memories.
- Data mining requires dramatic new methods of
pattern recognition, including clustering, classification,
association, sequence discovery, and visualization.
- Because of the sheer quantity of the data arising in various
applications or because of their urgency, it becomes infeasible to
store the data in a central database for future access and, therefore,
it becomes necessary to make computations involving the data, and
decisions about the data (like what to keep), during an initial scan
as the data ``stream'' by.
Opportunities to Participate: The Special Focus will include:
- Workshops: A variety of workshops and mini-workshops are
being planned
- Working Groups: Interdisciplinary "working groups" will explore special focus research topics.
- Seminar Series: There will be a mix of research talks and
practitioner presentations.
- Visitor Programs: Applications for research and graduate
student visits to the center are invited. Some funds are available
for travel and local support.
- Postdoctoral Positions: We are hopeful that several
postdoctoral positions will be offered in this area.
- Publications: We anticipate that a variety of
publications, including AMS-DIMACS volumes, technical reports,
abstracts and notes on the WWW, and DIMACS modules will result from
the special focus.
Index of Special Focus on Data Analysis and Mining
DIMACS Homepage
Contacting the Center
Document last modified on July 19, 2005.