Working Group on Streaming Data Analysis and Mining Home Page.
This material is based upon work supported by the National Science Foundation under Grant No. 0100921
Examples of applications requiring immediate decision making based on current information are intrusion detection and fault monitoring. Data must be analyzed as it arrives, not off-line after being stored in a central database, because the problems involved are so urgent from a time-to-react point of view. Some other applications require such quick reactions for theoretical (as well as practical) reasons because of the issues involved in processing and integrating the massive amounts of data generated by a myriad of continuously operating sources. For example, external memory algorithms are motivated by the fact that classical algorithms do not scale when data sets do not fit in main memory. At some point, data sets become so large as to preclude most computations that require more than one scan of the data, as they stream by.
Transactional and time-series applications exemplify current streaming data analysis systems. Transactional applications exploit data recording individual events correlating two or more discrete entities. Examples are phone calls between people (data also studied by the multidimensional scaling working group) and purchases over a credit-card network. One common problem is to maintain behavioral profiles of individual entities (customers, for example). Goals include flagging aberrant transactions, i.e., those not indicated by the models and thus potentially being fraudulent; and detecting paradigm shifts in prevailing trends. In these applications as well as others, analysis of data streams also engenders difficult new problems in data visualization. For example, how is time-critical information best displayed? Can automatic response systems be created to deal with common cases?
Time-series applications exploit sequences of unitary observations taken over time. Examples are reports from sensors monitoring network equipment; inventory levels of parts in a warehouse; positions of objects in successive celestial surveys; and records of prices of commodities, stocks, and bonds. Analyzing and mining time-series data presents many new challenges. What are similar time-series data? How can they be clustered, e.g., to isolate seminal events that cause many simultaneous or near-simultaneous disruptions among the observed elements? How can we find interesting trends?
We intend to exploit recent attention to streaming models of data analysis by the theoretical community as well as recent successes in real-time or near-real-time analysis by practitioners. We will bring together an interdisciplinary group of researchers to share their ideas and experiences, in the hopes of initiating new approaches to and motivating others to attack core problems in streaming data analysis.
Critical to the proper functioning of today's massive computer and communications networks is real-time network monitoring. Network monitoring infrastructure creates a sequence of network events and alarms, the correlation of which can facilitate the location of root problems. This is an example of a time-series application. Analysis of data streams is used to detect network faults when they occur, in order to direct timely corrective action. Papers dealing with this topic are [A6, A14, A20]. Security trace audits of IP logs are used to detect and react to network attacks, e.g., denial of service attacks. Such an event might trigger some automatic monitoring system, which then prompts intense analysis of recent logs. See for example the papers [A5, A17] for a description of and approaches to this problem. Among the open problems in this area are the following. How are the results of such monitoring systems best reported and visualized? To what extent can they incur fast and safe automated responses?
The amount of data being collected through scanners at supermarkets and other retail outlets is awesome and marketing research faces the task of making use of these gigantic data sets. Of great interest in marketing is research on ``market basket'' models and in particular on what items tend to be bought concommitantly. This area of research is moving from off-line settings to more on-line scenarios. For example, performed hourly or even more frequently, such analysis can be used for ``just-in-time'' provisioning of markets. Recent on-line approaches to such market basket models are discussed in the papers by Ullman [A21] and Ganti, Gehrke, and Ramakrishnan [A11].
The DIMACS ``Special Year on Massive Data Sets'' (1997-1999) featured a workshop on Astrophysics and Algorithms motivated by the huge data sets arising from sky surveys in optical and infrared wavelengths, microwave background anisotropy satellite experiments, helioseismology data, gravitational radiation detection experiments, and results from N-body/hydrodynamical simulations. That special year led to the beginnings of collaborations between computer scientists and astronomers, dealing with the ``paradigm shift'' in astronomy toward a situation where many researchers spend their time datamining a ``digital sky'' compiled from a vast array of multi-wavelength sky surveys. Current large scale cosmological simulations generate data of order 100 GB/per simulation. With the advent of higher bandwidth and faster computers, distributed data sets in the petabyte range are being collected. The problem of obtaining information quickly from such databases requires new and improved mathematical methods. Parallel computation and scaling issues are important areas of research. Techniques such as decision trees, vector-space methods, Bayesian and neural nets, and data compression have been utilized. Some relevant references are [A1, A2, A4, A3, A10, A12, A16, A19, A22].
Financial markets provide time-series data, in particular, the prices of commodities, stocks, and bonds, as functions of time. Spotting and exploiting trends in such data is of great interest, e.g., to trading firms willing to risk capital and to companies that want to hedge against fluctuating international currencies. While DIMACS has not been heavily involved in financial modeling, we have organized a recent workshop on the topic and some of our members are actively involved in the field. Some of the general issues involving streaming data analysis for time series data, such as issues of similarity, clustering, and trend-spotting, present serious challenges in modeling of this kind of data and will provide another motivating application for our working group. See [A19, A15].