Cluster and data stream analysis, March 2006. Tutorial at DIMACS Workshop on Data Mining and Epidemiology.

Clustering is an important tool in machine learning and data mining. It allows features and correlations in the data to be identified and requires few parameters and little detailed information about the data. The results can be used to generate hypotheses, aid in visualization, or reduce the data to a few prototypical points. This 'unsupervised learning' technique has many variants and many perspectives. I will give an algorithmic view, describing some of the most popular clustering algorithms and identifying their pros and cons, including hierarchical clustering, k-means, expectation maximization (EM) and k-center approximation algorithms.

When the input data is too large to conveniently hold in memory, or is being constantly updated, it is necessary to view the data as a massive stream. In recent years the “data stream” model has become a popular way to handle massive data sources. I will outline some of the key properties of data streams, and illustrate this with some of the recent work in clustering on data streams.

bib | slides ] Back

This file was generated by bibtex2html 1.92.