DIMACS Tutorial on Data Mining and Epidemiology

Dates: March 23 - 24, 2006
DIMACS Center, CoRE Building, Rutgers University

James Abello, DIMACS, abello@dimacs.rutgers.edu
Graham Cormode, Bell Laboratories, cormode@bell-labs.com
Presented under the auspices of the Special Focus on Computational and Mathematical Epidemiology.
James Abello, DIMACS / Rutgers University and Ask.com

Title: Graph Mining?

A variety of massive data sets exhibit an underlying structure that can be modeled as dynamic weighted multi-digraphs. Their sizes range from tens of gigabytes to petabytes. These include the World Wide Web, Internet Traffic and Telephone Call Detail. These data sets sheer volume brings with it a series of computational and visualization challenges due mainly to the I/O and Screen Bottlenecks. We present external memory algorithms for connectivity and minimum spanning trees together with heuristics for quasi-clique finding. We describe how hierarchy trees help us to cope in a unified manner with both, the I/O and screen bottlenecks. This line of research has suggested the need to look for "novel" graph representations in order to provide a user or a data-mining engine with informed navigation paths. We will discuss our results with graphs having on the order of 200 million vertices and several billion edges and we will point out some mathematical problems that have surfaced along the way. The overall goal is to extract useful information that can be brought into a user's palm top and to export these techniques to other mining domains. Related information can be obtained by accessing http://www.mgvis.com or emailing to abello@dimacs.rutgers.edu

Speaker Bio Sketch: James Abello received the PhD degree in Combinatorial Algorithms from the University of California, San Diego, and the MS degree in Operating Systems from the University of California, Santa Barbara. He is the recipient of a University of California President's Postdoctoral Fellowship in Computer Science and was awarded a UCSB Outstanding Teaching Award. James is the co-editor of External Memory Algorithms, Vol. 50 of the AMS-DIMACS series (with J. Vitter, 1999), The Kluwer Handbook of Massive Data Sets (with P. Pardalos and M. Resende, 2002) and an incoming AMS-DIMACS Volume on Discrete Methods in Epidemiology (with Graham Cormode, 2006). James research focus has been on Algorithms and Data Structures, Massive Data Sets, Algorithm Animation and Visualization, Combinatorial and Computational Geometry, Discrete Mathematics, and some applications in Petroleum Engineering and Biology. He has worked on the development of software systems like: MGV (A Massive Graph Visualizer, with J. Korn), AGE (An Animated Graph Environment, with T. Veatch), Mirage (An Interpreted Language for Algorithm Animation, with C. Smith), A Quasi-Clique Extractor (with S. Sudarsky) and Matrix Zoom (with F. Van Ham). James has held several academic positions and has been a senior member of technical staff at AT&T Shannon Laboratories and Bell Labs. He is currently a research associate at DIMACS, Rutgers University and a Senior Research Scientist at Ask.com. Information about some of James's current visualization research projects can be obtained by accessing http://www.mgvis.com

Brigham Anderson, Carnegie Mellon University

Title: Probability for Data Miners Tutorial

Probability lies at the heart of data mining. This tutorial aims to introduce these basic probabilistic foundations, with an emphasis on Bayesian concepts. After reviewing the axioms of probability, we will see how they can be used to create probabilistic models, which underlie many machine learning algorithms. We'll discuss the simplest models: the full joint density and the na´ve density, and show how they can be used to perform the common tasks of machine learning: inference, classification, anomaly detection, and clustering. Along the way, we will use simple Bayesian networks to characterize our models and our assumptions. A sample of the concepts that will be discussed are Sample spaces, Bayes rule, conditional independence, Bayes classifiers, na´ve Bayes classifiers, Bayes nets, hidden variables, generative models, and mixture models.

Michael Capalbo

Title: The Containment Problem

Consider the following problem, known as the Containment Problem. Let G be a network (e.g., social, computer network, etc), and let S_0 be any subset of the nodes of G, such that every node in S_0 is infected with a virus that spreads from each infected node to all of its nonvaccinated neighbors in one time step. Our allowed response is to vaccinate a limited humber a_l of nodes during each time step l=1,2, ... , t, for some integer t.

The Containment Problem asks us to find what a_l nodes to vaccinate each time step l to minimize the total number of nodes that eventually become infected, or in other words, devise an optimum vaccination strategy given our resources. The Containment Problem has applications in, for example, Computational Epidemiology.

Unfortunately, for general instances of the Containment Problem, there is no known tractable algorithm for returning an optimum vaccination strategy. The focus of this talk is to present a tractable algorithm that returns a vaccination strategy that is only slightly inferior to the optimum vaccination strategy (assuming that G satisfies a reasonable condition that we will state.

This work is joint work with James Abello, and is an ongoing research project.

Courtney D. Corley, University of North Texas

Title: Viruses and Computer Scientists

Recent world events such as the threat of bioterrorism, the outbreak of SARS and the spread of the H5N1 Avian Influenza have motivated computer scientists to begin looking at issues in the public health and epidemiology domains. In this talk, I will summarize current, exciting research in modeling and simulating the spread of viruses. I will also introduce basic knowledge about the structure and function of viruses, as well as discuss some of the epistemology surrounding whether viruses are "Alive" or "Dead".

Graham Cormode, Bell Laboratories

Title: Cluster and Data Stream Analysis

Clustering is an important tool in machine learning and data mining. It allows features and correlations in the data to be identified and requires few parameters and little detailed information about the data. The results can be used to generate hypotheses, aid in visualization, or reduce the data to a few prototypical points. This 'unsupervised learning' technique has many variants and many perspectives. I will give an algorithmic view, describing some of the most popular clustering algorithms and identifying their pros and cons, including hierarchical clustering, k-means, expectation maximization (EM) and k-center approximation algorithms.

When the input data is too large to conveniently hold in memory, or is being constantly updated, it is necessary to view the data as a massive stream. In recent years the "data stream" model has become a popular way to handle massive data sources. I will outline some of the key properties of data streams, and illustrate this with some of the recent work in clustering on data streams.

Nina H. Fefferman, DIMACS and Tufts University

Title: Selected Problems in Epidemiology:

Recent advances in technology have led to more accurate and more complete methods of biosurveillance. As a result, finding useful information within the sheer amount of data gathered has become difficult. Modern problems in public health and epidemiology require better techniques in data mining in order to make sense from the chaos. We'll discuss a couple of examples from U.S. public health and take a few minutes to discuss some of the trade-offs in monitoring necessitated by a lack of appropriate data mining methods.

Dmitriy Fradkin, Ask.com

Title: Using cluster analysis to determine the influence of epidemiological features on medical status of lung cancer patients

In this work we analyze lung cancer data, obtained from SEER, for 217,558 patients diagnosed in 1988-2000. Each patient is characterized by 23 epidemiological (essentially demographic) and 22 medical features. The main idea of this analysis consists in clustering the data in the space of epidemiological features only, and analyzing influence of the epidemiological classification on medical status of patients. The influence is estimated by using the T-test to determine differences in the distributions of medical features between clusters.

We partitioned the epidemiological part of data into 20 clusters. Out of 190 cluster pairs, there are 2 pairs with only 1 distinguishing medical feature and 4 pairs with 2 distinguishing features. All other pairs differ in at least 3 medical features. We also found some medical features that are not different in any pair of clusters, and some that take distinct values in many clusters.

Such analysis indicates which medical aspects are most affected by epidemiological status. On the other hand, it aids in finding epidemiological subpopulation (clusters) that are very different from others in their medical characterization.

This is a joint work with Dona Schneider and Ilya Muchnik

Tanya Kostova, LLNL

Title: The Mathematical Formulation of the Foot-and-mouth Disease Epidemic
Component of the Decision Support System Developed at LLNL

Development of a decision support system to evaluate the spread and economic impact of a possible introduction of foot and mouth disease in the US is under way. The model is an agent-based stochastic simulation. The agents are the individual animal facilities (farms, feedlots, sales-yards, etc.) and they can be thought of as the nodes of a network. The nodes can be infected via several methods and pass through the commonly accepted stages of infection (latent, subclinical, clinical, immune, etc.). I will present the basic mathematical formulation of the model, the type of input the model uses and the type of simulation output we will be obtaining. The code is under construction.

Robert Schapire, Princeton University

Title: Machine Learning Algorithms for Classification

Machine learning studies the design of computer algorithms that automatically make predictions about the unknown based on past observations. Often, the goal is to learn to categorize objects into one of a relatively small set of classes. This tutorial will introduce some of the main state-of-the-art machine learning techniques for solving such classification problems, namely, decision trees, boosting, support-vector machines and neural networks. The tutorial will also discuss some of the key issues in classifier design, including avoidance of overfitting.

Previous: Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on March 21, 2006.