DIMACS Computational and Mathematical Epidemiology Seminar Series

Title: Influences on Breast Cancer Survival via SVM Classification in the SEER Database

Speaker: Ilya Muchnik, DIMACS

Date: December 6, 2004 11:30 - 12:50

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ


Influence estimation of epidemiological factors on a disease distribution over a human population is one of the central direction statistical epidemiology. For instance, it is well known that for breast cancer age and obesity have a strong on the arising of the disease. In presented work we use the length of survival time as a criterion from epidemiological perspective to study the influence on breast cancer of many factors available in SEER. We use the criterion in a boolean form by taking the threshold 3 years. Machine learning classification methods allow to build a classifier-predictor using a space of appropriate variables. Its change from perturbation only of one variable when other are not changed, can be considered as the variable influence estimate. So, we propose to compare the accuracy of the classifier on different regions of the classification space which have different values of the considered variable. Moreover, we use as the estimates "predicted" statistics determined on new data, which is not used in the classifier design.

SEER database contains 433,272 breast cancer cases from 1973 to 2000. Out of them 67,647 cases we could used (many other cases we couldn't just because they didn't be supported by the survival time values; many of them were related with different causes of death, etc.).

Out of 112 chosen variables our method has assigned only 40 epidemiological factors with significant influence on the survival time. Because we consider the method as a explorative analysis, found 40 factors have to be considered only as "candidates" to form the set of significant influence on the survival time factors.

This is joint work with Jixin Li and Dona Schneider.