"Massive Data Sets and Graph Algorithms in Telecommunications Systems"


Ken Church
AT&T Laboratories

Data mining poses a number of interesting challenges for the algorithmic community that remain wide open. Imagine that we have a large steady stream of usage records indicating which customers are buying which services and when. This kind of data feed is usually developed for billing purposes, but many companies are discovering a broad range of other applications such as fraud detection, inventory control, provisioning scarce resources, forecasting demand, etc. The problems are interesting from an algorithmic point of view because the graph is very large. The telephone network, for example, has about 100 million customers (nodes), who purchase about 100 million calls (edges) per day. For forecasting purposes, it is often necessary to simplify the graph. One approach is to segment customers into types, e.g., fax vs. voice. As a first approximation, we could assume that fax lines usually talk to fax lines and voice lines usually talk to voice lines. Unfortunately, a simple connected components algorithm won't work because there are some exceptions such as wrong numbers. A clustering algorithm will probably work, but even then, one has to be careful because the better clustering algorithms are too slow for such a large graph. We will discuss a number of problems like the fax vs. voice question and draw analogies to the problem of clustering documents in Information Retrieval.