DIMACS Computational and Mathematical Epidemiology Seminar Series

Title: Using cluster analysis to determine the influence of epidemiological features on medical status of lung cancer patients.

Speaker: Dmitriy Fradkin, Ask.com

Date: March 13, 2006 12:00 - 1:30 pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ


In this work we analyze lung cancer data, obtained from SEER, for 217,558 patients diagnosed in 1988-2000. Each patient is characterized by 23 epidemiological (essentially demographic) and 22 medical features. The main idea of this analysis consists in clustering the data in the space of epidemiological features only, and analyzing influence of the epidemiological classification on medical status of patients. The influence is estimated by using the T-test to determine differences in the distributions of medical features between clusters.

We partitioned the epidemiological part of data into 20 clusters. Out of 190 cluster pairs, there are 2 pairs with only 1 distinguishing medical feature and 4 pairs with 2 distinguishing features. All other pairs differ in at least 3 medical features. We also found some medical features that are not different in any pair of clusters, and some that take distinct values in many clusters.

Such analysis indicates which medical aspects are most affected by epidemiological status. On the other hand, it aids in finding epidemiological subpopulation (clusters) that are very different from others in their medical characterization.

This is a joint work with Dona Schneider and Ilya Muchnik

see: DIMACS Computational and Mathematical Epidemiology Seminar Series 2005 - 2006