DIMACS Working Group on Streaming Data Analysis and Mining I

Working Group on Streaming Data Analysis and Mining I

Working Group Meeting: November 6 - 8, 2001

Public Workshop: Monday, November 5, 2001

Location: DIMACS Center, CoRE Building, Rutgers University

Organizers:

Adam Buchsbaum, AT&T Labs, alb@research.att.com Presented under the auspices of the Special Focus on Data Analysis and Mining.

Working Group on Streaming Data Analysis and Mining Home Page.

This material is based upon work supported by the National Science Foundation under Grant No. 0100921

The Problems of Streaming Data Analysis and Mining

The use of the computer in scientific research and as an essential ingredient in commercial systems has led to the proliferation of massive amounts of data. Researchers in a myriad of fields face daunting computational problems in organizing and extracting useful information from these massive data sets. Because of the sheer quantity of the data arising in various applications or because of their urgency, it becomes infeasible to store the data in a central database for future access and, therefore, it becomes necessary to make computations involving the data, and decisions about the data (like what to keep), during an initial scan as the data ``stream'' by. This working group will be concerned with datamining in a ``streaming'' environment.

Examples of applications requiring immediate decision making based on current information are intrusion detection and fault monitoring. Data must be analyzed as it arrives, not off-line after being stored in a central database, because the problems involved are so urgent from a time-to-react point of view. Some other applications require such quick reactions for theoretical (as well as practical) reasons because of the issues involved in processing and integrating the massive amounts of data generated by a myriad of continuously operating sources. For example, external memory algorithms are motivated by the fact that classical algorithms do not scale when data sets do not fit in main memory. At some point, data sets become so large as to preclude most computations that require more than one scan of the data, as they stream by.

Transactional and time-series applications exemplify current streaming data analysis systems. Transactional applications exploit data recording individual events correlating two or more discrete entities. Examples are phone calls between people (data also studied by the multidimensional scaling working group) and purchases over a credit-card network. One common problem is to maintain behavioral profiles of individual entities (customers, for example). Goals include flagging aberrant transactions, i.e., those not indicated by the models and thus potentially being fraudulent; and detecting paradigm shifts in prevailing trends. In these applications as well as others, analysis of data streams also engenders difficult new problems in data visualization. For example, how is time-critical information best displayed? Can automatic response systems be created to deal with common cases?

Time-series applications exploit sequences of unitary observations taken over time. Examples are reports from sensors monitoring network equipment; inventory levels of parts in a warehouse; positions of objects in successive celestial surveys; and records of prices of commodities, stocks, and bonds. Analyzing and mining time-series data presents many new challenges. What are similar time-series data? How can they be clustered, e.g., to isolate seminal events that cause many simultaneous or near-simultaneous disruptions among the observed elements? How can we find interesting trends?

We intend to exploit recent attention to streaming models of data analysis by the theoretical community as well as recent successes in real-time or near-real-time analysis by practitioners. We will bring together an interdisciplinary group of researchers to share their ideas and experiences, in the hopes of initiating new approaches to and motivating others to attack core problems in streaming data analysis.

Existing Systems and Prior Work

We will build on past DIMACS activities and present DIMACS strengths in various areas related to streaming data analysis and mining. We will make use of the DIMACS partnership with AT&T Labs to draw upon the expertise of a variety of researchers at that organization and of the systems that they have developed and we will investigate whether these systems might apply to other applications or more general approaches to streaming data analysis. Analysis of data streams is used at AT&T to detect many types of fraudulent behavior; stolen calling cards and impending bill defaults are two examples. Similarly, the JAM system developed at Columbia detects fraudulent activity in financial systems. See [A7, A8, A18]. Gecko, developed by Hume, Daniels, and MacLellan [A13], is a system that audits the largest billing process in AT&T. Using data feeds from passive taps throughout the system, Gecko tracks individual billing records from the network (originating phone call) through settlement (account collection). The main purpose is to detect abnormal processing of records which, if uncorrected, can lead to significant revenue loss. Many open problems remain in such work. For example, how can many heterogeneous data feeds be integrated in one system? How are relevant queries best expressed, giving the user sufficient power while implicitly restraining him/her from incurring unwanted computational overhead? Can domain-specific programming languages be used to provide uniform interfaces to disparate streaming datamining applications?

Critical to the proper functioning of today's massive computer and communications networks is real-time network monitoring. Network monitoring infrastructure creates a sequence of network events and alarms, the correlation of which can facilitate the location of root problems. This is an example of a time-series application. Analysis of data streams is used to detect network faults when they occur, in order to direct timely corrective action. Papers dealing with this topic are [A6, A14, A20]. Security trace audits of IP logs are used to detect and react to network attacks, e.g., denial of service attacks. Such an event might trigger some automatic monitoring system, which then prompts intense analysis of recent logs. See for example the papers [A5, A17] for a description of and approaches to this problem. Among the open problems in this area are the following. How are the results of such monitoring systems best reported and visualized? To what extent can they incur fast and safe automated responses?

The amount of data being collected through scanners at supermarkets and other retail outlets is awesome and marketing research faces the task of making use of these gigantic data sets. Of great interest in marketing is research on ``market basket'' models and in particular on what items tend to be bought concommitantly. This area of research is moving from off-line settings to more on-line scenarios. For example, performed hourly or even more frequently, such analysis can be used for ``just-in-time'' provisioning of markets. Recent on-line approaches to such market basket models are discussed in the papers by Ullman [A21] and Ganti, Gehrke, and Ramakrishnan [A11].

The DIMACS ``Special Year on Massive Data Sets'' (1997-1999) featured a workshop on Astrophysics and Algorithms motivated by the huge data sets arising from sky surveys in optical and infrared wavelengths, microwave background anisotropy satellite experiments, helioseismology data, gravitational radiation detection experiments, and results from N-body/hydrodynamical simulations. That special year led to the beginnings of collaborations between computer scientists and astronomers, dealing with the ``paradigm shift'' in astronomy toward a situation where many researchers spend their time datamining a ``digital sky'' compiled from a vast array of multi-wavelength sky surveys. Current large scale cosmological simulations generate data of order 100 GB/per simulation. With the advent of higher bandwidth and faster computers, distributed data sets in the petabyte range are being collected. The problem of obtaining information quickly from such databases requires new and improved mathematical methods. Parallel computation and scaling issues are important areas of research. Techniques such as decision trees, vector-space methods, Bayesian and neural nets, and data compression have been utilized. Some relevant references are [A1, A2, A4, A3, A10, A12, A16, A19, A22].

Financial markets provide time-series data, in particular, the prices of commodities, stocks, and bonds, as functions of time. Spotting and exploiting trends in such data is of great interest, e.g., to trading firms willing to risk capital and to companies that want to hedge against fluctuating international currencies. While DIMACS has not been heavily involved in financial modeling, we have organized a recent workshop on the topic and some of our members are actively involved in the field. Some of the general issues involving streaming data analysis for time series data, such as issues of similarity, clustering, and trend-spotting, present serious challenges in modeling of this kind of data and will provide another motivating application for our working group. See [A19, A15].

Some of the Issues

We would like to gather at DIMACS researchers and developers who work on streaming datamining systems, for the purpose of exchanging ideas, problems, methods, and conjectures, and initiating joint research activities. We intend to include theoreticians and practitioners from multiple disciplines (e.g., computer science, statistics, astrophysics). Some of the more specific issues to be addressed by the group have already been mentioned in the previous two subsections. Among the broader questions to be addressed are the following: What is the scope of streaming data analysis? What are the relevant problems in data collection, transmission, processing, and visualization? How do current theoretical models for analyzing massive data streams correlate to methods used in practice? What are the core problems faced by practitioners, and how can theoretical insights be used to attack them? Our primary goals are to build bridges between various communities working on problems in streaming data analysis and mining; to initiate collaborative research; to develop a list of core problems that will motivate future work by the communities; and to produce an infrastructure for sharing results as we go forward.

Working Group home page

DIMACS Homepage

Contacting the Center
Document last modified on April 18, 2001.