DIMACS Tutorial on Statistical and Other Analytic Health Surveillance Methods

Dates of Tutorial: June 17 - 20, 2003
DIMACS Center, CoRE Building, Rutgers University

David Madigan, Rutgers University, madigan@stat.rutgers.edu
Henry Rolka, CDC, hrr2@cdc.gov
Martin Kulldorff, University of Connecticut, martink@neuron.uchc.edu
Presented under the auspices of the Special Focus on Computational and Mathematical Epidemiology.


Michael Baron, University of Texas

Title: Sequential change-point analysis for the early detection of epidemics

In sequential change-point analysis, one would like to detect a change in distribution as soon as possible after it occurs while keeping the rate of false alarms to a minimum. Rich prior information and random nature of change points often justify Bayesian approach. These methods are used to predict epidemics at early stages by detecting a pre-epidemic trend.

A hierarchical Bayesian change-point model for influenza epidemics is proposed. Prior probabilities of a change point preceding a pre-epidemic trend depend on (random) factors that affect the spread of influenza. Theory of optimal stopping is then used to obtain Bayes stopping rules for the detection of pre-epidemic trends under the loss functions penalizing for delays and false alarms. The Bayes solution involves rather complicated computation of the corresponding payoff function. Alternatively, asymptotically pointwise optimal stopping rules can be computed easily and under weaker assumptions.

Both methods are applied to the 1996--2003 influenza mortality data published by CDC.

Michael Baron is an Associate Professor of Statistics at the University of Texas at Dallas. This year, he is also an Academic Visitor at IBM Research Division. His main research interests include sequential analysis, change-point problems, Bayesian inference. Occasionally supported by funding agencies, he applies results of his research in semiconductor manufacturing, epidemiology, developmental psychology, and energy finance. M. Baron published about 20 articles and refereed about twice as many. He serves as an associate editor of Sequential Analysis.

Allan Clark and Andrew Lawson, University of South Carolina

Title: Bayesian Spatial Health Surveillance

The statistical spatial analysis of small area health data is a topic that has developed greatly in the last 15 years. In this development, there has been considerable interest in the analysis of retrospective clustering of disease and relative risk estimation (disease mapping) in space-time. However, two areas that have seen relatively little consideration are 1) the development of a model-based approach to space-time clustering of disease; and 2) the monitoring (passive surveillance) or surveillance of small area health data. This presentation is split into two related components; firstly, we discuss a model-based approach to the analysis of space-time clustering of disease using a retrospective approach and discuss the extension to a prospective approach; secondly, we develop a simple surveillance model for space-time relative risk estimation via a change-in-variance approach. We demonstrate the retrospective analysis of clustering via an analysis of birth abnormalities in Tayside (Scotland); discussing possible extensions to a prospective analysis. Further we demonstrate the simple surveillance model via a prospective analysis of the estimation of space-time Lung cancer incidence rates in Ohio (USA) by the sequential fitting of two different space-time interaction models.

References: Clark, A.B. and Lawson, A.B. (2002) Spatio-temporal clustering of small area health data, Chapter 14, in Lawson, A.B. and Denison, D. (Eds). Spatial Cluster Modelling, Chapman & Hall, London (2002)

Lawson, A. B. (2003) Issues in the Spatio-Temporal Analysis of Public Health Surveillance Data, Chapter 11 in Brookmeyer, R. and Stroup, D. (eds) Monitoring the Health of Populations:Statistical Methods for Public Health Surveillance. Oxford University Press.

For the six years Dr Clark has been involved in the development of new methodology for the spatial statistical analysis of small area health data (spatial epidemiology) and in particular spatially-clustered health data. He has co-authored a number of papers in this area. He was a temporary WHO advisor on Disease Mapping in 1997.

Dr. Lawson has considerable and wide ranging experience in the development of statistical methods for spatial and environmental epidemiology. His research background has focussed since his PhD, on the analysis of focussed clustering (around putative sources of health hazard), on general disease clustering, and on issues in disease mapping. Developments by the participant in these areas has led, via European Union funding, to a World Health Organisation (WHO) workshop in 'Disease Mapping and Risk Assessment for Public Health Decision Making' (Rome 1997) and a subsequent edited volume of papers (Lawson et al (1999) Disease Mapping and Risk Assessment for Public Health, Wiley). Dr Lawson has also written two books (in press) in the areas (Lawson (2001) Statistical Methods in Spatial Epidemiology, Wiley ; Lawson and Williams (2001) An Introductory Guide to Disease Mapping, Wiley) .He has also been chief editor of two special issues of Statistics in Medicine focussing on Disease Mapping (1995, 2000). He is a member of the editorial board of the journals: Statistics in Medicine and Statistical Modelling.

Gregory Cooper, University of Pittsburg

Title: Bayesian Biosurveillance Using Causal Networks

This talk describes a method for detecting outbreaks that is based on Bayesian causal modeling of each individual in the population. The models of individuals can be linked by common causes of an outbreak (e.g., airborne anthrax) and by person-to-person spread of disease (e.g., smallpox). The linked models of individuals form a population model. Detection involves performing inference on the population model to derive the posterior probabilities of various types of outbreaks. Achieving computational tractability is a key challenge. The talk describes several approaches to address this challenge.

Gregory Cooper is an Associate Professor of Medicine and of Intelligent Systems at the University of Pittsburgh. He obtained a B.S. in Computer Science from MIT in 1977, a Ph.D. in Medical Information Sciences from Stanford in 1985, and an M.D. from Stanford in 1986. His primary research interests involve the application of decision theory, probability theory, Bayesian statistics, and artificial intelligence to biomedical informatics research tasks, including causal modeling and discovery from data, and computer-aided diagnosis and prediction. Since 2001 he has been developing detection algorithms for biosurveillance in collaboration with colleagues at the University of Pittsburgh and Carnegie Mellon University.

William DuMouchel, AT&T Labs

Title: Postmarketing Drug Adverse Event Surveillance and the Innocent Bystander Effect

The Multi-item Gamma Poisson Shrinker (MGPS) is an empirical Bayesian method for identifying unusually frequent counts in a large sparse frequency table. This presentation focuses on estimating associations among drugs and adverse event codes in databases of postmarketing reports of adverse drug reactions, as practiced by FDA and other safety researchers. Extended methods can be used to signal frequent itemsets with more than two items, such as combinations of two drugs and one AE, or syndromes of multiple AEs. Another extension allows us to focus on detecting differences between itemset frequencies in different subsets of the data, or from one time period to another. Recent research attempts to adjust drug-adverse event associations for the effects of concomitant medications-sometimes called the "innocent bystander problem."

William DuMouchel received the Ph.D. in Statistics from Yale University and has held a number of positions in academia and industry. His most recent academic appointment was as Professor of Biostatistics and Medical Informatics at Columbia University from 1994 to 1996. He currently holds the position of Technology Consultant at the AT&T Shannon Laboratory in Florham Park, New Jersey, conducting research on data mining, Bayesian modeling and other statistical methods. His methodology for detecting and measuring associations in transactional databases, called the Gamma-Poisson Shrinker (GPS), has been applied to adverse drug reaction databases by many researchers at the FDA and elsewhere.

Richard Ferris, Lincoln Technologies

Title: Detecting Multi-Item Associations and Temporal Trends Using WebVDNE/MGPS Application

MGPS* is a high-performance implementation of Dr. William DuMouchel's empirical Bayes approach for examining large databases to identify combinations of values that occur unusually frequently. For each combination occurring in the database, the method computes a single measure of "interestingness" based on a stable Bayes estimate of the ratio of the observed to the expected count. The measure has practical utility even in situations where the observed or expected count is small.

Initial applications have focused on facilitating safety signal detection in the post-marketing surveillance of therapeutic products at CDC, FDA, and several pharmaceutical manufacturers using large public and private data sources. Graphical techniques have been developed to assist with the interpretation of results. Examples will be shown that include simple rankings by signal strength, evolution of signals and safety profiles over time, and higher-order associations corresponding to multi-drug interactions and multi-symptom syndromes.

A Monte Carlo facility has been developed in collaboration with CDC to support simulation-based evaluation of the method's operating characteristics (sensitivity and specificity). The facility, which will be briefly demonstrated, works by automating the creation and analysis of synthetic databases containing signals of predetermined strength.

The development, enhancement and application of MGPS has been supported by grants, contracts, and cooperative agreements with the Centers for Disease Control and Prevention (NIP), the Food and Drug Administration (CDER and CBER), the National Institutes of Health (NCRR), and the Defense Advanced Research Projects Agency (EELD).

*MGPS: "Multi-item Gamma Poisson Shrinker"


Richard Ferris has over 20 years experience in the pharmaceutical industry. His accomplishments include the development of the first IVR-based drug management system in 1989. Richard is a member of the CDISC ODM and lab teams and the author of the CDISC ODM viewer. He works for Lincoln Technologies as a software developer and trainer. Formerly Richard worked at PHT, Covance and E.R. Squibb and Sons.

Marianne Frisén, Goteborg University

Title: Statistical Issues in Online Surveillance

Examples will be given of the need in medicine of continual observation of time series, with the goal of detecting an important change in the underlying process as soon as possible after it has occurred. Timelyness and the control of false alarms are important aims. An overview of methods and optimality issues will be given. A computer program will be demonstrated. Ways to handle complicated real life problems will be illustrated.

Marianne Frisén is professor of statistics at the Statistical Research Unit at Göteborg University, Sweden. Mail: Marianne.Frisen@Statistics.GU.SE The research on quick and safe detection of changes is described on: http://www.Statistics.GU.se/forskbquick.html

Dunrie Greiling, Terraseer

Title: Surveillance and Pattern Recognition Using TerraSeer Software

Clusters of events can be caused by stimuli, or they can arise by chance. Monitoring spatial and temporal patterns of health events has always been important for epidemiology, though it is receiving renewed emphasis as part of bioterror surveillance. Newly developed software tools provide new means to detect and analyze spatial and temporal patterns. TerraSeer's ClusterSeer software can be used to evaluate disease clusters and/or non-disease events. You can determine whether a cluster is significant, where it is located, and when it arose, providing insight into the origin, causes, and correlates of the event. Dr. Greiling demonstrate a surveillance analysis using ClusterSeer.

Dr. Dunrie Greiling is TerraSeer's Director of Corporate Communication. In this role, she coordinates the online help, documentation, and website communications for the company. She also leads software trainings and oversees software support. Recently, she co-authored two papers on spatial clustering in cancer cases on Long Island and its relationship with spatial patterns in air toxics. She also works in software development at TerraSeer's innovation partener, the R&D company BioMedware. At BioMedware, she is currently PI on a grant to perform spatial analysis for assessing the colocalization of proteins in confocal fluorescence microscopy images.

Rick Heffernan, MPH

Title: Syndromic Surveillance in New York City

Rick Heffernan received a Masters of Public Health at Columbia University and is completing a doctoral dissertation in epidemiology at Yale University. He currently heads New York City's syndromic surveillance analysis unit, which carries out daily monitoring of emergency room visits, ambulance transports and pharmacy sales for early detection of disease outbreaks. His interests include the design of infectious disease surveillance systems, analysis of surveillance data, and automated methods for aberration and cluster detection.

Lynette Hirschman, MITRE

Title: Text Mining for Surveillance II: Extracting Epidemiological Information from Free Text

This session will examine text-mining tools for health surveillance. Such information may arise as part of a medical record, a drug hot-line encounter, or an emergency room visit. Text mining and information extraction systems make it possible to access and transform information contained in free text into categories useful as input to a surveillance system. These techniques must be tailored to the source and the intended use. For example, hospital records store information about individual patient encounters in a highly structured format, while newswire can be scanned for global information on disease outbreaks. We will review some of the systems for processing these different kinds of text, the available resources (such as standardized terminology lists) and measures of performance. We will then illustrate these features using MiTAP, MITRE's system for monitoring infectious disease outbreaks. MiTAP focuses on providing 24x7 global information access for use by medical experts and individuals involved in humanitarian assistance and relief work. Multiple information sources are automatically captured, filtered, translated, summarized, and collected in a news server. Information extraction modules automatically extract and annotate the text, allowing it to be sorted into topic-specific news groups by disease, region, or information source. This information is then stored in a searchable archive. MiTAP processes 2000 to 10,000 messages daily, delivering up-to-date information to hundreds of regular users. The MiTAP system is available to registered users at http://mitap.sdsu.edu.

Lynette Hirschman is Chief Scientist for the Information Technology Center at the MITRE Corporation in Bedford, MA. She received a B.A. in Chemistry from Oberlin College in 1966, a M.A. in German Literature from University of California, Santa Barbara, in 1968, and a Ph.D. in formal linguistics from University of Pennsylvania in 1972, under Aravind Joshi.

As Chief Scientist for the Information Technology Center at MITRE, Dr. Hirschman is responsible for technical oversight of the Center research portfolio in human language technology. She is now also leading MITRE's activities in Biotechnology, including research in computational biology and bioinformatics. She is Principal Investigator on an internally funded effort for text mining applied to biological literature and she has been one of the organizers of the Text Mining SIG for the International Society for Computational Biology (ISCB). She is now working with an international group of collaborators to organize a Challenge Evaluation for text mining for biology, applied to automating curation for various biological databases (SWISS-PROT, BIND, FlyBase).

Dr. Hirschman has served as Principal Investigator for several research efforts funded by the Defense Advanced Research Projects Agency (DARPA). These include the Translingual Information Detection, Extraction and Summarization (TIDES) program where MITRE developed a rapid prototype system MiTAP (MITRE Text and Audio Processing) for the capture, processing and presentation of multilingual news related to disease outbreaks and humanitarian disaster relief efforts. She was also the Principal Investigator for DARPA Communicator, where MITRE served as chief engineer, working with MIT, IBM, AT&T, and CMU to develop a shared architecture for spoken dialogue systems. She is currently Principal Investigator on an internal project on Reading Comprehension, pursuing ground-breaking research on "learning to read, reading to learn, teaching to learn" - getting computers to "learn" from educational materials designed for people, in particular, reading comprehension tests.

Before joining MITRE in 1993, Dr. Hirschman held research and management positions at NYU (working on medical language processing and informatics, Unisys (logic grammars and information extraction), and MIT (spoken language systems). She has also taught graduate courses at the University of Pennsylvania, New York University, and Boston University. She is the author of over 100 publications in the areas of human language systems, dialogue systems, logic grammars, and recently, bioinformatics and text mining for biology.

Lori Hutwagner, G. Matthew Seeman, Tracee Treadwell, CDC

Title: Early Aberration Reporting System - EARS Empowering Local Health Departments

Public health surveillance data is highly varied. As such the analyses of these different formats of data must also vary. This presentation will demonstrate the application of different analytical aberration detection methods to various source-type data. These data are from both traditional and non-traditional public health data sources.

Two aberration detection methods used for traditional data will be described; historical limits and a non-traditional Cumulative Sum (CUSUM). Examples of these two aberration detection methods will be show as they are applied to the Nationally Notifiable Diseases Surveillance System (NNDSS) and Hazardous Substances Emergency Events Surveillance (SEES) data. Different methods, such as C1, C2, C3, are appropriate for non-traditional public health surveillance data such as emergency department syndromes, emergency call (911), and hazardous substances data. These methods are currently being applied to data collected at the city, county and state level.

The examples of the application of the aberration detection methods will demonstrate that aberration detection methods can be useful if they are properly understood. The use of simulated data sets demonstrates the strengths and weakness of the different methods. However, valuable information is obtained from problems that arise from application to real time data.

Lori Hutwagner, received her masters degree from the Georgia Institute of Technology in 1989. She joined the CDC in 1990 with the National Center for Infectious Diseases where she worked on aberration detection methods for Salmonella isolates. She has recently completed work with the Epidemiology Program Office where she applied aberration detection methods to the Nationally Notifiable Disease Surveillance System. In 1999 she began working with the Bioterrorism Preparedness and Response Program on developing aberration detection methods for their national "drop in surveillance" system and has started implementing these methods in various local sites through the US.

Martin Kulldorff, University of Connecticut

Title: Scan Statistics for Disease Surveillance

Scan statistics can be used for a variety of disease surveillance problems. In this talk we give a general overview as well as detailed examples. Topics covered include one-dimensional scan statistics for temporal surveillance, spatial scan statistics for geographical surveillance, space-time scan statistics for the early detection of localized disease outbreaks, and tree-based scan statistics for database surveillance.

Martin Kulldorff, University of Connecticut

Title: A Space-Time Permutation Scan Statistic for Spatial Disease Surveillance

We present and illustrate a space-time permutation based scan statistic for prospective disease surveillance for the early detection of localized disease outbreaks. The method is designed for situation when no denominator population at risk data is available, using only the temporal and spatial information of disease cases. Adjusting for the multiple testing inherent in the many potential times, locations and geographical size of an outbreak, it is able to both detect and evaluate the statistical significance of clusters found. The method is illustrated using syndromic surveillance data from New York City. This is joint work with F Mostashari, R Heffernan and J Hartman at the New York City Department of Health.

David D. Lewis, Ornarose, Inc. & David D. Lewis Consulting

Title: Text Mining for Surveillance I

Text mining, or data mining on textual or partially textual data, has been applied to customer relationship management, marketing, business process reengineering, intelligence and law enforcement, and biomedical research among other areas. I will review the major approaches to text mining, and its connections with natural language processing, information retrieval, and statistical analysis. I will then discuss in more depth technologies from information retrieval that are relevant to text mining, in particular text classification. Several forms of text data might be mined for signals of a bioterrorism attack or emerging infectious disease, and I will speculate a bit on the challenges these forms of data pose for text mining.

David D. Lewis, Ph.D. (www.daviddlewis.com) is a technology consultant based in Chicago, IL, as well as CEO of Ornarose, Inc., a data & text mining software startup. He has previously held research positions at AT&T Labs, Bell Labs, and the University of Chicago. Lewis has published more than 40 papers and 6 patents, has created several widely used test collections, and has served on committees that designed several US government evaluations of language processing technology.

Alan R. Shapiro, NYU School of Medicine

Title: Text Normalization for Health Surveillance

The ability to obtain timely and rich information makes the direct use of free text emergency department chief complaints or electronic medical records very attractive for health surveillance programs. Using free text is problematic, however, because the same concept can be represented differently not only by synonyms and morphological forms of the same word but also by unexpectedly large numbers of misspellings, typographical errors, truncations, and inadvertant concatenations with other words. An examination of the NYC Department of Health and Mental Hygiene Emergency Department Chief Complaint database and of the Emergency Medical Associates of New Jersey Electronic Medical Record database, respectively containing 2.5 and 3.5 million chief complaints, shows the remarkably widespread extent of orthographic corruptions that can potentially invalidate analytic procedures based on textual databases. For example, tokens denoting "palpitations" or "vomiting" each occur in over 300 different ways so that up to 15% of occurrences could be missed.

An approach to normalizing large text databases based on generalized string alignment algorithms has been developed. As implemented, the TextScrub program detects and corrects for variations due to concatenations, morphological stemming, cognitive spelling errors, and typographical mistakes including double-keying and transpositions. A component which recognizes and tabulates all the letter substitution errors made in a text database allows rapid construction of rules for recognizing corruptions due to phonetically-based misspellings or typographical errors. When applied to the NYC DOHMH chief complaint database, TextScrub correctly detected 1033 (3%) more cases of DIARRHEA while simultaneously incurring 199 fewer false positives than did the current NYC DOHMH text-processing algorithms for syndromic surveillance. Similar results were found with the EMA database. Although the capability for human oversight is provided, probability bounds can be derived which offer statistical tolerance limits on errors that would be incurred running the system under fully automatic operations.

Dr. Alan R. Shapiro received his MD from New York University School of Medicine and did subsequent clinical training in internal medicine at Stanford Medical Center and in anesthesiology at the Harvard Medical School-Beth Israel Hospital and Albert Einstein Medical Center. With training in statistics, epidemiology, and computer science at the University of North Carolina, Chapel Hill and Stanford, he has held joint professorships in medicine and statistics at the University of California, San Diego and the Medical University of South Carolina. Dr. Shapiro serves as a consultant in informatics and text mining to pharmaceutical and health care organizations and was a PricewaterhouseCoopers Global Technology Center Visiting Scholar and subject matter expert in the analysis of unstructured data. He is currently a Clinical Associate Professor of Medicine at New York University.

Galit Shmueli, University of Maryland

Title: Statistical Issues and Challenges Associated with Rapid Detection of Bio-Terrorist Attacks

Traditionally the type of data that have been collected and used for detecting outbreaks of an epidemic or bio-terrorist attack were medical and public health data. Although such data are the most direct indicators of symptoms, they tend to be collected, delivered, and analyzed days, weeks, and even months after the outbreak. By the time this information reaches decision makers it is often too late to treat the infected population or to react in some other way. In this talk we explore different sources of data, traditional and non-traditional, that can be used for detecting a bio-terrorist attack in a timely manner.

We start by focusing on exploring the potential of monitoring grocery sales and the usefulness of utilizing grocery purchase data alone for rapid detection of massive bio-terrorist attacks. The idea is to detect early signs of the epidemic before people arrive at medical facilities, assuming that self-treatment, which occurs earlier than medical treatment, manifests itself in grocery sales. We then discuss the practical and statistical issues that arise when combining data from several sources and integrating this system with the other medical and public health bio-surveillance systems.

We discuss the advantages and disadvantages of the different data sources and address the challenge of combining data from these various sources.

Galit Shmueli is a statistician, currently with the faculty of the Smith School of Business at the University of Maryland, College Park. She has been involved in a collaboration between the University of Pittsburgh and Carnegie Mellon University (where she spent 2 years) in an effort to create a framework for early detection of Bio-terrorism attacks. The group included researchers from the areas of epidemiology, public health, computer science, and statistics.

David S. Stoffer, University of Pittsburgh

Title: Spatio-Temporal Modeling for Biosurveillance

In this talk I will present a general method for the modeling of time series data collected at various locations. The general model is based on the state-space model, but can be spatially constrained to include a priori specified constraints. Also, I will discuss how to make the model robust.

The techniques will be demonstrated on the Pennsylvania portion of CDCs National Influenza Surveillance Effort data set. The CDC receives weekly mortality reports from 122 cities and metropolitan areas in the United States within 2-3 weeks from the date of death. These reports summarize the total number of deaths occurring in these cities/areas each week, as well as the number due to pneumonia and influenza.

D.S. Stoffer is Professor of Statistics and Biostatistics at the University of Pittsburgh. He has made seminal contributions to the analysis of categorical time series and won the 1989 American Statistical Association Award for Outstanding Statistical Application in a joint paper analyzing categorical time series arising in infant sleep-state cycling. He is currently a Departmental Editor for the Journal of Forecasting, an Associate Editor of the Annals of the Institute of Statistical Mathematics and coauthor (with RH Shumay) of the Springer text, "Time Series Analysis and Its Applications."

Steven Thompson, Pennsylvania State University

Title: Sampling

Sampling methods for health surveillance involves both spatial and network settings. A characteristic of spatial events of interest is their highly uneven distribution. For unpredictably clustered or rare spatial events, adaptive sampling strategies can be useful. An adaptive sampling design is one in which the procedure for selecting the sample can depend on values of the variable of interest observed during the survey. For example, whenever an unusually high number of events are observed in any sample units, neighboring units may be added to the sample and observed. Network aspects of health surveillance arise in situations such as finding people who have been exposed to an environmental hazard or toxic substance as well as finding hidden populations at risk for the spread of contagious diseases. Often the only practical way to obtain a sample of people in such situations is to follow social links from people in the sample to add more people to the sample. A variety of adaptive and link-tracing sampling and inference methods are now available for such investigations.

Daniel Wartenberg, UMDNJ-RW Johnson Medical School

Title: Understanding Disease Clusters

Disease clusters are complex statistical, epidemiological and public health phenomena. While tragedies for those involved, clusters present several challenges for those responding to them. This talk will explain briefly the perspective of each of the parties typically involved in cluster investigations: the public, the government (e.g., health department), the media and the scientist, and provide guidance for how cluster investigations are conducted typically, a context for interpreting results of cluster investigations, and suggestions for developing more effective, long-term strategies for addressing cluster concerns.

Daniel Wartenberg is a Professor in the Department of Environmental and Community Medicine at the Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey (UMDNJ) and the Division of Epidemiology in the UMDNJ School of Public Health. He serves as Program Leader of the Cancer Control Program at the Cancer Institute of New Jersey, and is a member of the Environmental and Occupational Health Sciences Institute. Dr. Wartenberg's main research interest is the development and application of novel approaches to the study of environmental risk, pollution, and public health, with particular emphasis on geographic variation, disease clustering and the application of Geographic Information Systems (GIS). His research includes the study of the health of flight attendants, nuclear workers, and Persian Gulf War veterans, investigation of health effects of exposure to incinerator emissions, pesticides, power lines, solvents and toxic chemicals, as well as methodologic developments in quantitative risk assessment. He also often works with communities on understanding and addressing local health concerns and apparent disease excesses.

Weng-Keen Wong, Carnegie Mellon University

Title: What's Strange About Recent Events (WSARE)

WSARE is an anomaly pattern detection system which stands for "What's Strange About Recent Events". Given two data sets -- one for recent events and the other for a baseline period, WSARE looks for groups whose proportions have changed significantly relative to the baseline. These groups are characterized by rules, such as "Gender = Male AND Home Location = NW", which identifies a subset of the population containing males living in the northwest region of the city. By using this rule-based approach on Emergency Department data, WSARE is able to identify anomalous patterns in time, space and demographics such as "recently there has been a significant upswing in the number of elderly patients from the southeastern region of the city with respiratory problems". WSARE reports the most anomalous rule found for the recent period, along with the p-value of this rule, which is obtained by a randomization test to guard against multiple-hypothesis testing errors.

Another feature of WSARE is the ability to determine the baseline distribution by taking into account the presence of different trends in health care data, such as trends caused by the day of week and by seasonal variations in temperature and weather. Creating the baseline distribution without taking these trends into account can lead to unacceptably high false positive counts and slow detection times. WSARE uses a Bayesian network which produces the baseline distribution by taking the joint distribution of the data and conditioning on attributes that are responsible for the trends.

Weng-Keen Wong is a Ph.D. candidate in Computer Science at Carnegie Mellon University. He obtained his B.S. in Computer Science from the University of British Columbia in 1997 and a M.S in Computer Science from Carnegie Mellon University in 2001. His Ph.D. thesis work is on data mining algorithms for early detection of disease outbreaks. His research interests also include clustering, Bayesian networks and reinforcement learning.

Previous: Announcement
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on June 10, 2003.