DIMACS Computational and Mathematical Epidemiology Seminar Series

Title: Finding and Interpreting Local Models in Analysis of Epidemiological Data

Speaker: Dmitriy Fradkin, Rutgers University

Date: November 14, 2005, 12:00 - 1:30 pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ

Abstract:

Over the years many machine learning methods have been used to build predictive models based on epidemiological data. Frequently, the goal of constructing such models is not to make predictions about individual patients, but to discover relationships between variables and the outcome. The model is only an intermediate result (a description of some phenomena) that has to be studied and interpreted by domain experts.

We suggest a way for combining unsupervised clustering and modern classification methods that would allow experts to search for local structure in the data and detect locally significant features. The method consists of finding a region in the space of features (a cluster), obtaining its description (a model), building another model inside the region and comparing these two models with the global model constructed using all available data. The three types of models also define three (potentially overlapping) sets of significant features. These feature sets simultaneously provide a view of the whole data and of the regions.

The approach is illustrated by analysis of lung cancer survival data from records of 200,000 patients. This work also illustrates some of the difficulties in the analysis of large and complicated data.

This work was done with Dona Schneider and Ilya Muchnik.

see: DIMACS Computational and Mathematical Epidemiology Seminar Series 2005 - 2006