Interdisciplinary Seminar Series

Title: Detecting changes and anomalies in noisy text streams

Speaker: Jeremy Wright, Department of Innovative Services, Networking and Services Research Lab, AT&T Labs - Research

Date: Monday, February 15, 2010 12:00 - 1:00 pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ


CoCITe is a change-detection tool for analysing text streams. Frequencies of terms (or tokenized entities) change constantly for a variety of reasons. Some are inherent variation, such as cyclic effects on daily, weekly and seasonal time-scales. Others include step changes, trends and bursts in which the frequency of a term departs from its previous pattern, and it is primarily these to which we need to be alerted. A second source of inherent variation, but non-cyclic, is noise. Many text streams are noisy, some intensely so. Here I refer to noise in the frequency of terms, rather than within the terms themselves (such as the kind of word mis-spellings we see in customer care agent notes). Because change-detection is a statistical process, failure to take account of noise seriously degrades the quality.

The CoCITe models now take account of noise using mixture distributions (gamma-Poisson and beta-binomial mixtures). These model over-dispersion in which the variance exceeds the mean, which for some text streams may occur to three orders of magnitude. Efficient tests using these distributions have been devised and incorporated into the core procedure for optimizing the number and locations of changes in term-frequency. This version is now analysing changes and generating alerts in on-going streams of IVR call log and network alarm data.

Slides: Detecting changes and anomalies in noisy text streams