Center for Communications Research (CCR), Princeton, NJ

**Organizers:****Bob Grossman**,chair, University of Illinois and Two Cultures Group, grossman@uic.edu**Paul Kantor**, Rutgers University, kantorp@cs.rutgers.edu**Muthu Muthukrishnan**, AT&T Labs - Research and Rutgers University, muthu@cs.rutgers.edu- Workshop Coordinator:

- Dolores Koch, DJK@idaccr.org

Stephen G. Eick, Visual InsightsTitle: Mining Clickstream Data

Gary William Flake, NEC Research InstituteTitle: Web Mining for Hyperlinked Communities

Robert Grossman, University of Illinois at Chicago and Two Cultures GroupTitle: An Introduction to High Performance Data Mining for Homeland Defense

Paul Kantor, Rutgers UniversityTitle: Data Fusion in Message Filtering

David D. Lewis, Independent ConsultantTitle: Classification and Mining of Text Data

Natural language text violates every precept of good data modeling: it is ambiguous, redundant, high-dimensional, and has complex and poorly defined structure and semantics. This tutorial will present methods for organizing and analyzing text data in the face of these problems.

You will learn about techniques for classifying natural language of all sorts and finding patterns within and among textual items. Examples will be presented from a variety of real-world text classification and mining applications. The emphasis will be on statistical and machine learning techniques, with some discussion of the role of linguistic processing.

Andrew Moore, Carnegie Mellon UniversityTitle: Introduction to Data Mining

This tutorial will survey the fundamental ideas from computer science and statistics that combine together to give data mining. We will see examples of some of the most popular data mining algorithms in action and we will briefly review the key points that make them work. We'll visit decision trees, association rules, non-linear regression, Support Vector Machines and Bayesian approaches. We will then pay careful attention to some classic pitfalls of careless data mining, including overfitting, counfounding causation with correlation, and multiple testing. We will finish with some examples of the kind of work currently taking place in academic and industrial data mining research, including a survey of approaches used to scale statistical data mining to very large data sets, and a couple of exotic models being used in current Homeland Defense data mining applications (Surveillance for bioweapon use, and terrorist network analysis).

S. Muthu Muthukrishnan, AT&T Labs - Research and Rutgers UniversityTitle: Streaming Data

Say you go fishing. There are many different types of fish and you catch a bounty. At the end of the day, how can you quickly estimate the number of fish types that are only a few in your catch or those that are the most abundant? Suppose now that your colleague goes fishing and also gets a bountiful catch: how do you quickly check if the fish types you have caught are similar? These problems are interesting when you can not remember all the fish types in your catch, and do not have the time to sort through them for answering the questions above.

Fishing puzzles above are examples of problems that arise when one mines streaming data logs. The Theoretical Computer Science community has recently produced models, algorithms and complexity results to reason about processing streaming data, be it for estimating statistical parameters, clustering or accurate summarization of the streaming signals. In this tutorial, I will present an overview of this emerging theory. Problems become hard when one throws some of the catch back into the pond!

Robert Schapire, AT&TTitle: Machine Learning Algorithms for Classification

Machine learning studies the design of computer algorithms that automatically make predictions about the unknown based on past observations. Often, the goal is to learn to categorize objects into one of a relatively small set of classes. This tutorial will introduce some of the main state-of-the-art machine learning techniques for solving such classification problems, namely, decision trees, boosting and support-vector machines. The tutorial will also discuss some of the key issues in classifier design, including avoidance of overfitting.

Manfred Warmuth, University of California, Santa CruzTitle: On-line Learning

We consider learning from examples. We start with a parameterized model class and a loss function that assigns each example and model a non-negative loss.

The on-line algorithm sees one example at a time and incurs a loss on the current example based on its current model. This model (hypothesis) is updated on-line as more examples are seen by the learner. The best fixed model is chosen off-line. It is the model in the class with the smallest (total) loss on all examples.

The loss of the on-line algorithm on a sequence of examples is typically larger than the loss of the best off-line model. However, the goal of the on-line learner is to minimize the additional loss of the on-line algorithm over the loss of the best off-line model. Thus the off-line model serves as a comparator. Bounds relating the on-line loss to the best off- line loss are called relative loss bounds. Such bounds quantify the price of hiding the future examples from the learner.

We will review several methods for deriving on-line algorithms and proving relative loss bounds for them.

Finally we discuss the case when the off-line comparator is allowed to ``shift'' over time. Now the off-line algorithm is allowed to partition the data stream into a small number section and choose the best hypothesis for each section. In contrast, the on-line algorithm does not know the best partition of the data. The algorithms is faced with a dilemma: Should it learn from the recent examples or should it hold on to its old hypothesis.

We show how to resolve this dilemma in various ways and give algorithms with good bounds on the additional loss of the on-line algorithm over the loss of the best ``shifting'' off-line model.

Various applications of these methods will be discussed.