DIMACS 2001-2006 Special Focus on Data Analysis and Mining: Overview

2001-2006 Special Focus on Data Analysis and Mining: Overview

Theoretical and algorithmic approaches to data analysis have played a central role in the development of modern methods for handling data. Now, however, the massive amounts of data gathered in important modern applications ranging from the Internet to credit card fraud detection to astronomy and medicine have dramatically changed the requirements for algorithms and provide ample motivation for a great deal of new theoretical development. We need methods for data analysis and mining that scale to the huge volumes of data that we are getting and can expect to get in such applications. DIMACS is planning a special focus devoted to data analysis and data mining, with emphases on the development of theoretical and algorithmic approaches to the massive data-mining problems that we face today, on the increasingly abstract formulations and models of data mining questions that are being seen in current research, and on the connections between theoretical approaches and practical applications.

The emphasis of this special focus will be on unifying promising approaches to data analysis and data mining that come from many distinct communities of researchers. The topics of interest include methodologies and algorithms for data mining, including clustering, discriminant analysis, enumerative methods, and multidimensional scaling; the increasingly abstract formulations and models of data mining questions using logical methods, conceptual clustering, learning and discovery that are critical in data mining and in particular for automatic, intelligent decision making; and the special problems that arise from applications to such important areas as fraud and intrusion detection, web mining, medical and scientific databases, marketing, and natural language data.

THEMES/MOTIVATION

The field of data analysis has a long history with roots in traditional statistical analysis and the development of artificial intelligence. Theoretical analysis of databases has allowed the precise formulation of questions about them and in turn, coupled with a strong effort in algorithms, has led to many powerful techniques for collecting, storing, consolidating, processing, correcting, and retrieving data, for learning, and for finding previously undiscovered patterns.

The emergence of new and powerful data collection technologies has led to the creation of massive amounts of data, often distributed, shared, partially unknown, or having specialized structures. Traditional data analysis tools are incapable of handling the sheer size and complexity of these gigantic data sets. There is a great need to develop new methods and algorithms that can handle these data sets. We need to develop theoretical underpinnings for managing and reasoning about data, and we need new tools for finding patterns or displaying useful summaries of the data.

Among the topics we shall emphasize in this special focus are the following:

Massive quantities of disparate data can be stored on a loosely coupled infrastructure such as the World Wide Web, and so we need search and retrieval techniques for distributed, semi- or unstructured ``databases.''
Increasingly, the types of data of interest are very different from the strings of text or numerical data toward which much of classical data structures and algorithms are oriented. Thus, we need new theoretical approaches to represent and work with data items of particular, complex, application-dependent form and to integrate data from multiple, heterogeneous sources.
There is now an increased need for systems that can make intelligent, automatic decisions based on the analysis of data. It is crucial that we understand theoretically the behavior of systems that are permitted to act autonomously, in order to guarantee desired results.
Most algorithms currently used in data mining do not scale well when applied to very large data sets, often because they rely on random access to the data sets, which scales only while the data sets fit entirely in relatively small main memories.
Data mining requires dramatic new methods of pattern recognition, including clustering, classification, association, sequence discovery, and visualization.
Because of the sheer quantity of the data arising in various applications or because of their urgency, it becomes infeasible to store the data in a central database for future access and, therefore, it becomes necessary to make computations involving the data, and decisions about the data (like what to keep), during an initial scan as the data ``stream'' by.

Opportunities to Participate: The Special Focus will include:

Workshops: A variety of workshops and mini-workshops are being planned
Working Groups: Interdisciplinary "working groups" will explore special focus research topics.
Seminar Series: There will be a mix of research talks and practitioner presentations.
Visitor Programs: Applications for research and graduate student visits to the center are invited. Some funds are available for travel and local support.
Postdoctoral Positions: We are hopeful that several postdoctoral positions will be offered in this area.
Publications: We anticipate that a variety of publications, including AMS-DIMACS volumes, technical reports, abstracts and notes on the WWW, and DIMACS modules will result from the special focus.

Index of Special Focus on Data Analysis and Mining

DIMACS Homepage

Contacting the Center
Document last modified on July 19, 2005.