DIMACS Summer School Tutorial on New Frontiers in Data Mining

August 13 - 17, 2001
Rutgers University, Piscataway, NJ

Organizers:
Dimitrios Gunopulos, University of California at Riverside, dg@cs.ucr.edu
Nikolaos Koudas, AT&T Labs - Research, koudas@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.

This "summer school" tutorial program is aimed at providing background, vocabulary, and theoretical methodology to non-specialists in data mining and to others who wish to explore this field and at bringing together students, postdocs, and researchers working on algorithms for data mining with those working in various applications areas. More specifically, we aim to introduce the attendees to the fundamental theoretical/algorithmic issues that arise in data mining and its applications.

Data mining is an exciting new field of computer science research, encompassing several diverse techniques for analyzing large datasets. The goal of data mining is to obtain new, interesting and actionable pieces of information. Vast amounts of data are accumulated in diverse application domains, including bioinformatics, epidemiology, business, physical sciences, web applications, and networking. Data mining research is stimulated by hard real life problems in analyzing data in all those areas. Data mining is fundamentally an interdisciplinary field, borrowing and combining techniques from theory, statistics, databases and machine learning, and ultimately producing new approaches.

A goal of this tutorial is to bring together students, postdocs, and researchers from the fields of data mining, bioinformatics, networking, and the web, and to facilitate the collaboration between fields, as well as to introduce the field of data mining to those who are not yet working in it or are not yet working in it from an algorithmic point of view.

In the tutorial we concentrate on new research directions that are currently emerging in the field: data mining applications in bioinformatics, networking, and the web. We will explore new problems that come up in these areas, identify common threads among the various applications, and consider new paradigms, methods and techniques that are being developed to address these problems. In the tutorial we will emphasize the algorithmic aspects of analyzing large datasets. There are different general ways to approach this problem, such as approximate algorithms and data summarization techniques. We will look at new techniques on stream processing and online algorithms, and their applications to specific problems.

Biological research is undergoing a major revolution as new technologies, such as high-throughput DNA sequencing and DNA microarrays, are creating large amounts of data. New techniques in analyzing such data are important in the understanding of biological processes. Many bioinformatics problems can be formulated as generalized searching problems in a large space. We will look at general lattice search techniques with different constraints, as well as new string algorithms. We will also look at applications of classification techniques in the area.

Networking and telecommunications applications produce large amounts of data that can be mined for various properties of interest. Time series data prevail in such domains and algorithms for time series matching, sequential pattern identification are of great interest. We will concentrate on incremental and one pass algorithms for networking problems and explore the connection between these problems and similar incremental and one pass problems arising in the biological sciences.

The web has emerged as a vast datastore, containing diverse pieces of information. We will examine recent approaches to mine information on the World Wide Web, including efficient web searching and web site personalization efforts. We will also look at data and resource management issues in the web environment, with emphasis on bioinformatics and telecommunications applications.

Next: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on June 12, 2001.