DIMACS Summer School Tutorial on New Frontiers in Data Mining

August 13 - 17, 2001
Rutgers University, Piscataway, NJ

Organizers:
Dimitrios Gunopulos, University of California at Riverside, dg@cs.ucr.edu
Nikolaos Koudas, AT&T Labs - Research, koudas@research.att.com
Presented under the auspices of the Special Foci on Data Analysis and Mining and Computational Molecular Biology.

Abstracts:


1.

Data Quality Assurance in Network Databases
Chung-Min Chen and Munir Cochinwala, Telcordia Technologies

Operation Support Systems (OSS), which support
a telecommunication carrier's network
operations, usually maintain large databases that model
physical networks and their components.
The issue of data quality is to
ensure that the data are correct, current and
consistent in the databases. This is vital to the
efficiency and effectiveness of the operations.
In the talk, we will discuss related issues and
approaches on how to automate and assure network
database quality.


2. MINING VERY LARGE DATA STREAMS Pedro Domingos Department of Computer Science and Engineering University of Washington In many domains, data now arrives faster than we are able to mine it. In some cases (e.g., large networks), merely storing all the data produced would be prohibitively expensive. To avoid wasting this data, we must switch from the traditional ``one-shot'' data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this talk I will describe a general method for transforming batch data mining algorithms into data-stream ones and its application to decision tree induction, k-means clustering and the EM algorithm. I will provide analytical guarantees that these algorithms produce in finite time results equivalent to mining infinite data (to within epsilon, with probability one minus delta), and examples of their practical performance. For example, our decision-tree learner is able to incorporate one billion examples per day using off-the-shelf hardware. Its extension to non-stationary data leads to speedups of four orders of magnitude over traditional windowing methods, for similar predictive accuracy. Joint work with Geoff Hulten.
3. Trajectory Sampling for Direct Traffic Observation Matt Grossglauser, AT&T Labs - Research The Internet is vast and difficult to model. Estimation of the state of even a single operator's domain is hampered by scale and uncertainty. We discuss some implications for traffic measurement, which is a critical component for the control and engineering of IP networks. More specifically, traffic engineering, capacity planning, and troubleshooting can benefit from knowledge of the spatial flow of traffic through an operator's domain, i.e., the paths followed by packets between any ingress and egress point. We argue that existing traffic instrumentation techniques are inadequate, because they require network state estimation to infer the spatial flow of traffic. We propose a method that allows the direct observation of traffic flows through a domain by observing the trajectories of a subset of all packets traversing the network. The main advantages of the method are that (i) it does not rely on routing state, (ii) its implementation cost is small, and (iii) the measurement overhead is modest and can be controlled precisely. Joint work with Nick Duffield, AT&T Labs - Research.
4. Network Aware Clustering: Technique and Applications Balachander Krishnamurthy AT&T Labs-Research http://www.research.att.com/~bala/papers Being able to identify the groups of clients that are responsible for a significant portion of a Web site's requests can be helpful to both the Web site and the clients. It is beneficial to move content closer to groups of clients that are responsible for large subsets of requests to an origin server. A grouping of clients, called Clusters, that are close together topologically and likely to be under common administrative control were introduced last year, using "network-aware" techniques. Experimental results show that our entirely automated approach is able to identify clusters for 99.9% of the clients in a wide variety of Web server logs. Sampled results show that the identified clusters can be validated in over 90% of the cases. We are also able to detect unusual access patterns made by spiders and (suspected) proxies. In this talk I will discuss clusters and the range of applications they have been used for in the networking research community. This is joint work (primarily) with Jia Wang.
5. Prabhakar Raghavan Verity Inc. Social Networks from Web Mining to Enterprise Portals Social Networks have been recognized for some time as key mechanisms for information sharing and dissemination. We begin by reviewing both classical and recent (web-derived) mining and knowledge discovery algorithms, viewing them in the context of social networks. A recurrent phenomonon in these settings is the presence of power-law distributions. We postulate a stochastic model for these, and present empirical results on text frequency distributions that suggest new methods for mining text associations.
6. Rahul Singh, Exelixis An Overview of Computational Knowledge Discovery and Pattern Analysis Problems in Contemporary Drug Discovery and Design Exploring the relationship between the structure of a molecule and its bio-chemical properties constitutes the basis of drug discovery. State-of-the-art approaches to this investigation involve a combination of techniques that include physical enumeration and tests based on combinatorial chemistry and high-throughput screening as well as rational pharmaceutical design based on geometric and chemical characteristics of molecule-molecule interaction. Furthermore, understanding and optimizing factors like the effect of the compound on the body and the effect of the body on the compound are essential in developing a drug. Given the exploratory nature of drug discovery, the data volume, and the multiple data modalities, it is therefore, not surprising that the area is rich in algorithmic problems related to knowledge discovery, pattern analysis, and efficient computability. In this talk, I will attempt to provide an overview of the drug discovery process and present salient problems that are related to the aforementioned computational domains.

Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on July 24, 2001.