Data stream analysis presents many practical and theoretical challenges. Many critical applications require immediate (seconds) decision making based on current information: e.g., intrusion detection and fault monitoring. Data must be analyzed as it arrives, not off-line after being stored in a central database. Processing and integrating the massive amounts of data generated by a number of continuously operating, heterogeneous sources poses is not straightforward. At some point, data sets become so large as to preclude most computations that require more than one scan of the data, as they stream by. Analysis of data streams also engenders new problems in data visualization. How is time-critical information best displayed? Can automatic response systems be created to deal with common cases? Etc.
Speakers at the workshop discussed current work in all aspects of
data stream analysis: theoretical issues, including modeling;
practical issues, including work on existing systems; and bridges and
bottlenecks, both current and potential, between theory and
practice. The goal of the workshop and the ensuing working group was to
foster interdisciplinary collaborations among researchers studying
data streams from many disparate perspectives and application areas.
The use of the computer in scientific research and as an essential ingredient in commercial systems has led to the proliferation of massive amounts of data. Researchers in a myriad of fields face daunting computational problems in organizing and extracting useful information from these massive data sets. Because of the sheer quantity of the data arising in various applications or because of their urgency, it becomes infeasible to store the data in a central database for future access and, therefore, it becomes necessary to make computations involving the data, and decisions about the data (like what to keep), during an initial scan as the data ``stream'' by. This working group has been concerned with datamining in a ``streaming'' environment.
Examples of applications requiring immediate decision making based on current information are intrusion detection and fault monitoring. Data must be analyzed as it arrives, not off-line after being stored in a central database, because the problems involved are so urgent from a time-to-react point of view. Some other applications require such quick reactions for theoretical (as well as practical) reasons because of the issues involved in processing and integrating the massive amounts of data generated by a myriad of continuously operating sources. For example, external memory algorithms are motivated by the fact that classical algorithms do not scale when data sets do not fit in main memory. At some point, data sets become so large as to preclude most computations that require more than one scan of the data, as they stream by.
Transactional and time-series applications exemplify current streaming data analysis systems. Transactional applications exploit data recording individual events correlating two or more discrete entities. Examples are phone calls between people (data also studied by the multidimensional scaling working group) and purchases over a credit-card network. One common problem is to maintain behavioral profiles of individual entities (customers, for example). Goals include flagging aberrant transactions, i.e., those not indicated by the models and thus potentially being fraudulent; and detecting paradigm shifts in prevailing trends. In these applications as well as others, analysis of data streams also engenders difficult new problems in data visualization. For example, how is time-critical information best displayed? Can automatic response systems be created to deal with common cases?
Time-series applications exploit sequences of unitary observations taken over time. Examples are reports from sensors monitoring network equipment; inventory levels of parts in a warehouse; positions of objects in successive celestial surveys; and records of prices of commodities, stocks, and bonds. Analyzing and mining time-series data presents many new challenges. What are similar time-series data? How can they be clustered, e.g., to isolate seminal events that cause many simultaneous or near-simultaneous disruptions among the observed elements? How can we find interesting trends?
To exploit recent attention to streaming models of data analysis by the theoretical community as well as recent successes in real-time or near-real-time analysis by practitioners, we brought together an interdisciplinary group of researchers to share their ideas and experiences, in the hopes of initiating new approaches to and motivating others to attack core problems in streaming data analysis.
This material is based upon work supported by the National Science Foundation under Grant No. 0100921