Applications in Biology, Computer Science, Intrusion Detection, and Other areas

DIMACS Center, Rutgers University, Piscataway, NJ

**Organizers:****Mel Janowitz**, DIMACS, melj@dimacs.rutgers.edu**David Banks**, Duke University, banks@stat.duke.edu (IMS Representative)

**Program Committee:****David Banks**, Duke University, banks@stat.duke.edu**Stanley L. Sclove**, University of Illinois at Chicago, slsclove@uic.edu**William Shannon**, Washington University School of Medicine, shannon@ilya.wustl.edu

This meeting will be held partly as a joint meeting with the DIMACS workshop on
Clustering Problems in Biological Networks May 9 - 11, 2006.

The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.

Title: Dimension Reduction Problem: A Version of PARAMAP Extended to Deal With Large Data Sets

Dimensionality reduction techniques are used for representing higher dimensional data by a more meaningful lower dimensional structure. Such techniques have a wide range of applications, especially in the field of data mining. In this paper we will study Carroll's Parametric Mapping (abbreviated PARAMAP) (Shepard & Carroll, 1966). The PARAMAP approach relies on iterative minimization of a loss function measuring the smoothness or "continuity" (insofar as this can be defined, at least approximately, for a finite set of data) of the mapping from the lower dimensional representation and the original data. We will develop a measure of congruence, called "agreement rate", based on preservation of local structure between the input data and the mapped low dimensional embedding, and demonstrate the application of PARAMAP to various sets of artificial data. We will also compare a number of well-known dimension reduction approaches on various sets of real data. We will discuss an improvement in PARAMAP to deal with a larger number of data points by devising a method to split the data into a landmark and a holdout sample, and then extrapolate the holdout points from the landmark configuration. The former will be accomplished by using a procedure intended to find an initial seed for the K-means procedure (DeSarbo, Carroll, Clark, & Green, 1984) and the latter by applying PARAMAP to map the holdout points into the lower dimensional landmark configuration. Both the extrapolation technique and the concept of "agreement rate" are not limited only to dimension reduction or specifically to the PARAMAP algorithm, but can be generalized to other problems faced when dealing with large data sets of the type associated with the data mining approach to data analysis.

References

- Shepard, R. N. & Carroll, J. D. (1966). Parametric representation of nonlinear data structures. In P. R. Krishnaiah (Ed.), Multivariate Analysis (pp. 561-592). New York, NY: Academic Press.
- DeSarbo, W. S., Carroll, J. D., Clark, L. A., & Green, P. E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57-78.

Title: Graphical Modeling of Coded Text Data: Modeling and Measurement of Empathy

Text data is collected in many studies and experiments in the social sciences and behavioral sciences. Such data is often coded and simple analyses of the codes are performed. In this presentation, graphical models that include latent continuous variables, which imply log-multiplicative models for observed data, are proposed as a mean to analyze coded text data with the goal of measuring underlying constructs. Concomitant information is included in the models in two distinct ways. The models are generalizations and extensions of those presented by Anderson and Böckenholt (2000) and Anderson and Vermunt (2000). The method is illustrated using data from an experiment where participants viewed a computer animated vignette depicting verbal bullying in a classroom. The participants entered textual responses to an open ended question regarding what they would do if they were the teacher.

Title: Comparison of Classification Methods to Predict Complications to Liver Surgery

Often in medical practice, it is of interest to identify patients that are at high risk for complications or death. As a result, prediction models are useful tools in medical decision making. There are many statistical methods available to build prediction models. The aim of this study is to compare several prediction methods using a large institutional database of patients undergoing liver surgery. The database under study includes 2002 consecutive patients who underwent liver resection at Memorial Sloan Kettering Cancer Center from 1991 to 2002 and captures information on more than thirty preoperative risk variables. The classification models were built to predict high grade complications following surgery. The strategies employed were logistic regression, generalized additive models, and classification trees. Within each strategy several approaches were explored. Graphical representations of the models were constructed so as to make the models user friendly for clinicians. All models were developed on a training set and evaluated on the test set. The performance of the models was compared by analyzing the ROC curves from the test set. The stepwise logistic regression model had similar predictive accuracy to the generalized additive model and the classification trees. Other non-linear machine learning methods such as neural networks will be applied to this dataset to determine their predictive power.

Title: Cluster Analysis using Different Correlation Coefficients

Partitioning objects into closely related groups that have different states helps to understand the underlying structure in the data set treated. Several clustering algorithms with different kinds of similarity measure are commonly used to find either an optimal clustering or a close to original clustering. The retrieval ability of six chosen clustering algorithms of using shrinkage-based and rank-based correlation coefficients, which are known to be robust, is evaluated using Rand's C values. Rand's C values present how the resultant clustering is close to the original clustering. The recovery levels of clusters which are found by the six clustering algorithms with WLEs (weighted likelihood estimates) based correlation, are also obtained and compared to the results from using those correlation coefficients.

Title: Fitting Directed Tree Graphs to Asymmetric Proximities

A new type of graphical representation, the directed tree, is described for asymmetric proximity data. The directed tree is equivalent to a tree-structured directed graph, termed a bidirectional tree by Cunningham (1978). A methodology is described for fitting and drawing directed trees. The method, dubbed DITREES, begins by decomposing the asymmetric proximities into symmetric and asymmetric components (cf. Gower, 1977; Heiser & Busing, 2004). A mixture of two trees, with the same topology but different arc weights, is then fit to the n(n-1)/2 symmetric proximities, plus the n(n-1)/2 transformed antisymmetric residuals. The solution is represented by a directional tree, such that the length of each arc represents the corresponding parameter of the tree fit to the symmetrized proximities, and a directional component to each arc (represented visually by an arrowhead) represents the corresponding parameter of the tree fit to the transformed antisymmetric residuals.

Title: ACA Similarity Analysis Re-examined

Author cocitation analysis (ACA) employs a variety of clustering and scaling techniques for exploring patterns in bibliometric data. In a 2003 JASIST article, Per Ahlgren, Bo Jarneving, and Ronald Rousseau criticized the use of Pearson's r as a similarity measure in ACA, citing properties that they believe make r undesirable. Howard White's response to Ahlgren et al, and the subsequent exchange in the JASIST letters column suggest that behind this methodological controversy there lurk contrasting intuitions about both the reality ACA is supposed to reveal and its relationship to the bibliographic evidence on which ACA is based. A reframing of the ACA data analysis methodology puts some of these controversies in a new light.
**Weiwu Fang, Tu Sili, and Cai Xu**, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China

**Wang Yuncheng**, Shandong Agricultural University, Taian, China and

**Wu Weixing**, University of International Business and Economics, Beijing, China

Title: The Longest Ultraconserved Sequences and Evolution of Vertebrate Mitochondrial Genomes

Finally, some classification and computation skills are also discussed.

Title: Analyzing the Supreme Court Justice Decisions using Multidimensional Scaling (MDS) and Parametric Mapping (PARAMAP)

The purpose of this paper is twofold; to show how dimensionality reduction techniques can be used to interpret voting data, and to introduce the PARAMAP technique as a technique for performing dimensionality reduction. The paper was inspired by a New York Times article detailing agreement rates and swing voting patterns for the Supreme Court Justices.

We use data on the agreement rates between individual Justices and on the percentage swing votes for Justices over time. We use a multidimensional scaling (MDS) approach to create a 'voting space' representation of the Justices in one, two and three dimensions. We perform these analyses using KYST, to implement nonmetric MDS via an algorithm that minimizes a loss function called STRESS. We also use a nonlinear mapping technique called PARAMAP and compare the results with those obtained by KYST. We test the results using a metric that measures agreement between higher and lower dimensional solutions.

We find that in general PARAMAP solutions produce superior values of the agreement metric to the KYST solutions. We find in particular that a two stage optimization method gives highly interpretable and parsimonious solutions.

Title: An application of combinatorial methods introduced for verification of cluster analyses.

Assume that n independent and identically distributed k-dimensional continuous random variables have been observed. Under this assumption, any clusters obtained (if more than one such cluster is obtained) would be spurious. Various combinatorial measures have been introduced to use for purposes of testing whether clusters are "real" or spurious have been introduced. These methods are based on proximity analyses.These methods are now applied to studying urn models for boron atom contamination of semi-conductor materials. Essentially, if too many contaminants are too close, undesirable behavior of equipment occurs. This work is joint with John Zavada of the US Army Research Office.

Title: Standard Errors, Prediction Error and Model Tests in Additive Trees

Theoretical standard errors and confidence intervals are given for the estimates of branch lengths in psychometric additive trees, for a priori known tree topologies as well as for estimated tree topologies. A model test and an estimate of prediction error to compare different tree topologies are also provided. Results are derived under the assumption of uncorrelated dissimilarity values. The statistical inference theory proposed here differs from existing approaches by its use of the additive decomposition of a tree distance into binary components, which allows stating the problem in a linear model framework. Additive trees can be considered as a special case of feature network models, for which the same inference theory is valid. Therefore, results could also be extended to other special cases, such as the additive clustering or common features model.

Title: Learning and Classification in Networked Data

Networked objects such as web-pages, research papers, people and companies form a complex structure of interdependence. In this talk, I explore classifying such objects using only the network structure and ignoring intrinsic attributes of the objects. Networked objects differ from the traditional machine learning point of view where instances are independent. Networked objects clearly are not. Further, networked data has the property that often the instances learned from are connected to instances whose class is to be estimated. Therefore, known instances has dual roles, first as training examples and later as background knowledge. I will introduce NetKit, a network learning toolkit for statistical relational learning, and show its use in a univariate case study.

Title: Natural Requirements to Classification Methods

In many cases it is worthwhile to formulate some requirements to methods of solution of problems belonging to a class. Typically such requirements concern stability, convergence, domain of applicability, and so on. This report describes an attempt to formulate some natural requirements to classification methods and to analyze the most popular convenient methods from this point of view.

For simplicity let us consider the case where classification objects are points in Euclidian space and dissimilarity is defined as the distance between points. The result of classification (into two classes) consists of two sets of points. The border between these sets completely determines the result of classification. Omitting many formal details, it is possible to formulate three requirements:

- Essential changes of one or both sets far enough from the border between them must not change the classification (that means that the border must be the same after the changes).
- Arbitrary adding of a little (relatively to the cardinality of both sets) number of points to the space must not essentially change the result of classification.
- A reliable method must work correctly at least in simple unknown in advance situations. That means that the method must not use any additional external information about the considered set of points.

It is demonstrated that methods based on distances from some centers (like K-mean method) do not meet the first requirement. At the same time the method based on minimal distances between points (like single linkage and split methods) do not meet the second requirement. Both mentioned groups (and other methods) do not meet the third requirement. A new method is shortly described that appreciably meets the formulated requirements. The report is accompanied by some artificial and real examples.

Title: A Comparison of One Heuristic and One Globally Optimal Unidimensional Scaling Technique

Given the prevalence of large multivariate data sets in education and psychology research, combinatorial data analysis (CDA) techniques are useful for detecting relationships among large numbers of variables or objects. CDA has the ability to answer existing research questions as well as to provide a means to guide further research. However, finding an optimal solution to problems in CDA poses computational limitations and is most useful for data sets with less than 25 to 35 objects, depending on the method use. Alternatively, heuristic methods provide feasible solutions for larger matrices; however, these methods are well noted for issues of local optima. In this presentation, I compare two undimensional scaling techniques. The first method uses a branch-and-bound routine implemented in FORTRAN to find a globally optimal solution for up to 35 objects. The second technique uses a quadratic assignment routine implemented in MATLAB to find a heuristic solution. With the capability of achieving a globally optimal solution for up to 35 objects, I use 12 empirical proximity matrices to compare the branch-and-bound optimal solutions to those heuristic solutions produced by quadratic assignment. Furthermore, I investigate whether the quadratic assignment method does in fact produce a globally optimal solution after several random starts and how this heuristic compares in terms of CPU time.

Title: Using Cluster Analysis to Relate Subjective and Objective Pharmacovigilance Association Measures

The field of pharmacovigilance is concerned with the detection and interpretation of associations between drugs and adverse medical events that may or may not be caused or exacerbated by those drugs. In the U.S., the primary source of data for pharmacovigilance analysis is the FDA's Adverse Event Reporting System (AERS) database, which is organized by Individual Safety Reports (ISR's) listing the drugs a patient was taking, the adverse reactions they reported, and a limited amount of additional data (e.g., patient age, gender, reporting source information, etc.). Association between drugs and adverse events can be measured in a variety of ways, and this talk will consider three: two objective measures (the classical Reporting Ratio and the Statistical Unexpectedness, based on Fisher's exact test for association between drugs and adverse events), and one subjective measure (the Index of Suspicion, a numerical value between 0 and 1 computed from the AERS role code data, which classifies every drug appearing in an ISR as ``primary suspect,'' ``secondary suspect,'' ``interacting,'' or ``concomitant''). Examination of the AERS database for various combinations of drugs and adverse events suggests that these three association measures are related, but in a complicated manner. Thus, we are motivated to explore this relationship using partition-based cluster analysis. The talk will present a comprehensive assessment of clustering results obtained for representative sets of drug/adverse event pairs, examining the influence of variable transformations and scaling, the detection and treatment of outliers, and the use of different dissimilarity measures. Assessment of the number of clusters present in the data is based on a permutation reference strategy that has been discussed earlier (R.K. Pearson, T. Zylkin, J.S. Schwaber, and G.E. Gonye, ``Quantitative evaluation of clustering results using computational negative controls,'' Proc. 4th SIAM Intl. Conf. Data Mining, Lake Buena Vista, Fla., April 2004, pp. 188--199). Our ultimate objectives are first, to obtain useful insights into the reliability of spontaneous reporting systems like the AERS database and second, to construct an index that quantifies the tendency for some drugs to be subjectively blamed for adverse events even in the absence of objective evidence for an association with those events.

Title: A Likelihood Approach for Determining Cluster Number

Deciding where to cut the dendrogram produced by a hierarchical cluster analysis is known as a stopping rule. Heuristic approaches proposed for solving this problem have been based on statistics such as the proportion of variance accounted for by the clusters. The statistic is calculated on each of the sets of clusters produced by cutting the dendrogram at successive heights. The number of clusters in the set that optimizes the statistic estimates the true number of clusters. This is a reasonable ad hoc measure, but is not based on a probability model of cluster distributions. Therefore, the properties of statistical decision theory (e.g., Type I and II error) do not hold, and the associated P-values are meaningless. In this presentation we propose a novel stopping rule based on a probability model for graphical objects. The application of probability models to hierarchical trees is highly speculative, but is based on prior published work (Banks and Constantine 1998; Shannon and Banks 1999). We extend this prior work to derive a likelihood and a likelihood-ratio test (LRT) for determining the number of clusters in a dataset. We are aware that the criteria for the LRT (Lehmann 1999) are not fully met so P-values will be approximations at best, though bootstrap P-values might easily be estimated. We are beginning to contrast the likelihood-ratio test stopping rule with other existing ad hoc approaches.

Title: Using Data Mining Tools in an Urban Development Study

This paper investigates the nonlinearity in a data set for an urban development study. The data set contains some 1.4 million cases with 14 predictors and one target variable with binary outcomes. The relationships between the target and some of the predictive variables appear non-monotonic, hence the standard logistic regression, polynomial logistic regression, probit, and CLogLog may not be adequate in the predictive modeling, regardless of the use of different link functions. In this paper, we compare a number of modeling approaches, including Classification Tree, Neural Network, Genetic Algorithm, and Stochastic Gradient Boosting by the corresponding misclassification rates, variable importance, and a number of other model assessment measures. Geographic Information Systems (GIS) are used to create and process an urban growth database to estimate the possible relationship between urban growth and several growth, stimulant, and deterrent variables.

Previous: Contributed Papers

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on April 21, 2006.