DIMACS Workshop on Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications

g DIMACS Workshop on Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications

DIMACS Workshop on Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications

May 14 - 16, 2003
DIMACS Center, Rutgers University, Piscataway, NJ

Organizers:: Regina Liu, Rutgers University, rliu@stat.rutgers.edu; Robert Serfling, University of Texas at Dallas, serfling@utdallas.edu; Diane Souvaine, Tufts University, dls@eecs.tufts.edu; Yehuda Vardi, Rutgers University, vardi@stat.rutgers.edu

Presented under the auspices of the DIMACS Special Focus on Data Analysis and Mining and the DIMACS Special Focus on Computational Geometry and Applications.

Abstracts:

Greg Aloupis, McGill University

Title: On the Computation and Robustness of Some Data Depth Notions

Given a set of bivariate data sampled from an underlying non-parametric distribution, there exist several notions of data depth which are used to measure how central a given query point is. Some of the most familiar notions include simplicial, Oja and halfspace (Tukey) depth. In each case, the deepest point is a generalization of the univariate median. Recent developments concerning the complexity of computing depth or locating the deepest point will be reviewed. Desirable robustness properties for these depth notions, such as breakdown and monotonicity, will be discussed. Some new and recent open problems will be posed.

Michael Burr , Tufts University
Eynat Rafalin , Tufts University
Diane Souvaine , Tufts University

Title: A Note on Simplicial Depth

The simplicial depth of a point p in Rd relative to a set of points S = {s1; : : : ; sn} is the number of d simplices created by points of S that contain p. Either open or closed simplices create a function that is not continuous along segments between data points. We suggest a modification to the definition that maintains statistical significance but also produces a continuous function.

In addition, we will review problems related to simplicial depth contours and discuss possible improvements.

Biman Chakraborty and Probal Chaudhuri , National University of Singapore, ISI

Title: On Some Probabilistic Algorithms for Computing Tukey's Halfspace Depth

The halfspace depth of a d-dimensional point ø relative to a data set X1; : : : ;Xn ε IRd is defined as the smallest number of observations in any closed halfspace with boundary through ø. Let H be the hyperplane passing through ø and d - 1 data points Xi1; : : : ;Xid-1 . Then the multivariate Hodges's sign test (Hodges 1955) statistic is ob tained by maximizing the absolute difference between the number of data points that fall on opposite sides of the plane H. Chaudhuri and Sengupta (1993) showed that this is equivalent to finding halfspace depth at ø and the problem of computing depth of ø re duces to a combinatorial optimization problem over d - 1 subsets of data points. For bivariate data, the halfspace depth of a point can be computed in O(n log n) time (Ma tousek 1991, Rousseeuw and Ruts 1996). For d = 3, an exact algorithm is available to compute it in O(n2 log n) time (Rousseeuw and Struyf 1998). An analogous algorithm can be constructed to compute the exact depth of a point in dimension d > 3 with time complexity O(nd-1 log n). It appears that there is virtually no hope for solving such a problem exactly for high dimensional data. In this work, we develop and study some probabilistic algorithms (Chakraborty and Chaudhuri 2003) to approximate the halfspace depth of a point for large high dimensional data using techniques based on Markov chains. Our algorithms were motivated by well known probabilistic combinatorial optimization techniques like genetic algorithms (Goldberg 1989), simulated annealing (Kirkpatrick et al. 1983, Geman and Geman 1984) and Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970). We establish convergence of our algorithms to the true halfspace depth as the number of iterations in the Markov chain increases. We also demonstrate the performance of our algorithm when applied to real and simulated high dimensional data sets.

References

Chakraborty, B. and Chaudhuri, P. (2003) On the use of genetic algorithm with elitism in robust and nonparametric multivariate analysis.
Austrian Journal of Statistics, 32, 13-27.
Chaudhuri, P. and Sengupta, D. (1993) Sign tests in multidimension: Inference based on the geometry of the data cloud.
Journal of the American Statistical Association, 88,1363-1370.
Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
Goldberg, D.E.(1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, New York.
Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97-109.
Hodges, J.L. (1955) A bivariate sign test. The Annals of Mathematical Statistics, 26, 523-527.
Kirkpatrick, S., Gelatt, C.D. Jr. and Vecchi, M.P. (1983) Optimization by Simulated Annealing. Science, 220, 671-680.
Matousek, J.(1991) Computing the center of planar point sets. In: J.E. Goodman, R. Pollack, W. Steiger, editors,
Discrete and Computational Geometry: Papers from the DIMACS Special Year, American Mathematical Society, 221-230.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953) Equation od state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087-1092.
Rousseeuw, P.J. and Ruts, I. (1996) Algorith AS 307: Bivariate location depth. Ap- plied Statistics, 45, 516-526.
Rousseeuw, P.J. and Struyf, A. (1998) Computing location depth and regression depth in higher dimensions. Statistics and Computing, 8, 193-203.

Andreas Christmann , University of Dortmund, Dortmund, Germany

Title: On Aspects of Regression Depth and Methods Based on Convex Risk Minimization for Data Mining

The support vector machine (SVM) plays an important role in statistical learning theory and in analyzing high dimensional complex data sets for pattern recognition purposes, c.f. Vapnik (1998), Hastie, Tibshirani and Friedman (2001), and Scholkopf and Smola (2002). Consider a training data set {(xi; yi) ε IRd {-1; +1} i = 1; : : : ;n}. Major goals are the estimation of a functional relationship y = g(x) between the outcome y and pre dictors x = (x1; : : :; xd)0 ε IRd and the prediction of an unobserved outcome ynew based on an observed value xnew of the predictors. The function g is unknown. One needs the implicit assumption that the relationship between xnew and ynew is{at least almost{the same than in the training data set. Otherwise, it is useless to extract knowledge on g from the training data set. The classical assumption in machine learning is, that the points (xi; yi) of the training data were independent and identically generated from an underlying unknown distribution IP for the random variables (X; Y ). The quality of the predictor g(x) is measured by some loss function L(g(x); y). The goal is to find a predictor g(x) which minimizes the expected loss of g:

R(g) = EIPL(g(X); Y ):

The support vector machine is one important example for a consistent classification method based on convex risk minimization.

In the talk some connections between the support vector machine and regression depth for pattern recognition will be investigated. The regression depth method proposed by Rousseeuw and Hubert (1999) can successfully be used to measure the amount of overlap in binary regression models, c.f. Christmann and Rousseeuw (2001). Finally, some results of a sensitivity analysis of the support vector machine will be given.

References

A. Christmann and P.J. Roussseeuw (2001). Measuring overlap in logistic regression. Computational Statistics and Data Analysis, 37, 65{75.
A. Christmann, P. Fischer, and T. Joachims (2002). Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications. Computational Statistics, 17, 273{287.
T. Hastie, R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Springer, New York.
P.J. Rousseeuw and M. Hubert (1999). Regression depth. J. Amer. Statist. Assoc., 94, 388-433.
B. Scholkopf and A.J. Smola (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge.
V. Vapnik (1998). Statistical Learning Theory. Wiley & Sons, New York.

Jose H. Dula, University of Mississippi & the U.S. Census Bureau

Title: Fast Algorithms for Frames and Point Depth in Polyhedral Hulls

Consider a finite set A of n points in m-dimensional space. The convex hull of these multivariate points is a bounded polyhedron, P. The convex hull of fewer than the n points in A might be sufficient to describe P. If so, any point not necessary to the description of P is said to be redundant. The minimum cardinality subset of A needed to describe the polytope P is called the "frame" of A. Elements of the frame are the extreme points of the convex hull and represent geometric outliers of the set A. Typically the frame is a small subset of A especially when the number of elements in A is very large. The "frame problem" is the problem of identifying the frame elements of the set A. Traditionally this has been done by verifying the feasibility of a system of linear equations, one system for each point in A. New ingenious deterministic algorithms are able to reduce substantially this computational requirement. We can apply this concept of frame iteratively to arrive at a measure of "depth" for points in A. Points that are elements of the frame have depth 1. If the frame is removed from A and a new convex hull defined from what remains, points in this new frame have a depth of 2; and so on. By repeating this process it is possible to assign a depth to all the points in A and the maximum depth is an indication of distribution and dispersion.

Ryan T. Elmore, Thomas P. Hettmansperger, Fengjuan Xuan, Penn State University

Title: Data Depth and Mixture Models

The notion of a data depth in multivariate analysis is quite appealing as a descriptive measure in unimodal, elliptical data clouds. However, some measures of data depth can be misleading when the data arise from a non-standard distribution. For example, the interpretation of depth as a center-outward ranking of the data in some mixture or bimodal distributions. We will discuss what we mean by misleading in this context. A generalization of the simplicial depth, the relative depth in a local sense, is presented. This relative depth is shown to have a favorable interpretation in these non-standard situations. Further, we will discuss how existing data depth measures can be used as a diagnostic in detecting the presence of mixture-model data.

David Eppstein, University of California, Irvine

Title: Depth and Arrangements

We survey the connection between statistical notions of depth and robustness, and computational geometry notions of complexity in arrangements and arrangement construc tion algorithms.

Belen Fernandez-de-Castro, S. Guillas, W. Gonzalez Manteiga,Universidade de Santiago de Compostela, Spain

Title: Functional Samples and Bootstrap for Predicting SO2 Levels

In this paper, enhancements of several functional techniques are given in order to forecast SO2 levels near a power plant. The data are considered as a time series of curves. Assuming a lag one dependence, the predictions are computed using the functional kernel (with local bandwidth) and the linear autoregressive Hilbertian model. We carry out the estimation with a so-called historical matrix, which is a subsample that emphasizes uncommon shapes. A bootstrap method is introduced to evaluate the range of the forecasts, which uses Fraiman and Muniz's concept of data depth. Finally, we compare our functional techniques with neural networks and semiparametric methods, and find that the former models are often more effective.

Anil K. Ghosh and Probal Chaudhuri, Indian Statistical Institute, Calcutta, India

Title: Data Depth and Classification by Finite Dimensional Discriminating Surfaces

Classical linear and quadratic discriminant functions are two very well-known tools used in discriminant analysis. They are very useful in providing good lower dimensional views of class separability. Fisher's (Fisher, 1936) original approach in linear and quadratic discriminant analysis assumes normality of the underlying distributions and uses the sam ple moments to estimate the population parameters and the separating surfaces. These estimates, however, are highly sensitive to outliers and they are not reliable for heavy tailed distributions. Various alternatives to Fisher's original linear and quadratic discrmi nant functions have been studied in the literature (Duda, Hart and Stork, 2000; Hastie, Tibshirani and Friedman, 2001 and Ripley, 1996). Some of these techniques are based on sample moments (e.g., regularized discriminant analysis due to Friedman, 1989), some are based on specific distributional assumptions for the competing populations (e.g., logistic discriminant analysis), and several others are based on choosing discrminating hyperplanes through optimization of some aprpropriate smooth criteria (e.g. support vector machines due to Vapnik, 1998; exible discrminant analysis due to Hastie, Tibshirani and Buja, 1994 and techniques using neural networks).

In this talk we present two robust and distribution free alternatives for classification based on linear or nonlinear separating surfaces. One of these classifiers is motivated by Tukey's half-space depth (Tukey, 1975) while the other one uses the notion of regression depth due to Rousseeuw and Hubert (1999). The regression depth has recently been used for two-population classification problems by Christmann, Fischer and Joachims (2002). These depth-based methods assume some specific finite dimensional parametric form of the discriminating surface and uses the empirical distribution of the training data and its geometric features for building the classifier. These techniques are also motivated by direct minimization of the empirical misclassification rates of discriminating surfaces.

We demonstrate the performance of the proposed methodology using simulated as well as well-known bench mark data sets. Large sample properties of the poposed discriminant functions will also be discussed.

Bibliography

Christmann, A., Fischer, P. and Joachims, T. (2002) Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications. Comput. Stat., 17, 273-287.
Duda, R., Hart, P. and Stork, D. G. (2000) Pattern Classification. Wiley, New York.
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.
Friedman, J. H. (1989) Regularized discriminant analysis. Jour. Amer. Stat. Asso., 84, 165-175.
Hastie, T., Tibshirani, R. and Friedman, J. H. (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.
Rousseeuw, P.J. and Hubert, M. (1999) Regression depth (with discussions). Jour. Amer. Stat. Asso., 94, 388-402.
Tukey, J. (1975) Mathematics and the picturing of data. Proc. Inter. Cong. Math., Vancouver, 523-531.
Vapnik. V. N. (1998) Statistical Learning Theory. Wiley, New York.

Bogdan Georgescu, Department of Computer Science, Rutgers University
Ilan Shimshoni, Department of Industrial Engineering and Management, Technion-Israel Institute of Technology, Israel
Peter Meer, Department of Computer Science, Department of Electrical and Computer Engineering, Rutgers University

Title: Nonparametric Clustering of High-Dimensional Data

Representation of visual information through feature space analysis received renewed interest in recent years in large measure triggered by content based image retrieval ap plications. The most popular technique, k-means clustering, however, has two inherent limitations: the clusters are constrained to be radially symmetric and their number has to be known a priori. In nonparametric clustering methods, these limitations are elim inated but the amount of computation becomes prohibitively large as the dimension of the space increases. Such a nonparametric method was developed a few years ago in our laboratory, based on the mean shift property of kernel density gradient estimates. The clusters are defined by extracting the modes of the data distribution, and the technique was successfully applied to a large variety of computer vision tasks. Recently, we have extended the approach to high dimensional data using recent results from probabilistic approximation theory, and reduced the computational complexity by more than one order of magnitude. As an example, we compare three state-of-the-art texture representations (having dimensions between 4 and 48) as tools for image retrieval.

Rebecka Jornsten , Department of Statistics, Rutgers University

Title: Clustering and Cluster Validation via the Relative Data Depth

We present a new clustering algorithm DDVQ - the Data Depth Vector Quantizer. The cluster representatives are the multivariate medians. The decision boundaries between clusters are determined by equi-"gravity." The gravity with respect to a cluster is a weighted function of the average distance to other members of the cluster (silhoutte width), and the average L1-data depth with respect to the cluster. The number of clusters in a data set is chosen by maximizing the average relative gravity of observations with respect to the cluster allocation and nearest competing cluster (DDsil).

We apply DDVQ and the DDsil clustering criterion to several gene expression data sets, for the task of sample clustering as well as gene clustering. A companion visualization tool, the DDsil plot, can be used to identify outliers and tight/loose clusters. We also conduct a simulation study, comparing the performance of DDVQ and DDsil to PAM and the silhouette width and the gap statistic.

Stefan Langerman, Universite Libre de Bruxelles

Title: Optimizing Depth Functions

A depth based medians is often defned as the point that maximizes or minimizes the value of some depth function implicitly defned by an arrangement of hyperplanes, lines, or points. For the planar case, we present an effcient general approach to fInd the median exactly, for a wide range of depth functions
(e.g. Tukey depth, Oja depth, hyperplane depth) that possess certain useful properties.

(This is joint work with William Steiger.)

Jean-Claude Masse, Universite Laval, Quebec, Canada

Title: Tukey depth-based multivariate trimmed means

The Tukey depth is one of several depth functions that have been proposed to measure the degree of centrality of a multidimensional point with respect to a probability distribu tion. Informally speaking, one can think of points with high depth as being close to the"center" of the distribution and points with low depth as belonging to the tails. Given a d-dimensional dataset, a depth-based trimmed mean can thus be defned by averaging in some sense those points whose depth with respect to the empirical distribution is higher than a fixed level. Two approaches to Tukey depth-based trimming will be distinguished. Focussing on one of these approaches, properties such as equivariance, robustness, strong and weak convergence will be examined. Results of a simulation comparing the accuracy and robustness of several bivariate location estimators will also be presented.

Karen McGaughey, Kansas State University

Title: A Test for Equal Variances Using Data Depth

Simplicial data depth is a method for reducing the dimension of a multivariate data set to IR. Simplicial data depth is also applicable to univariate data sets, although no data reduction takes place. A new method for testing equality of variances using the ideas of simplicial data depth in the case of two samples of univariate data is developed and studied. In addition, a method for calculating univariate data depth using a rank transformation is introduced. Type I error rates and power curves are compared for existing tests for equality of variances and the data depth test using data simulated from the normal distribution and 5 nonnormal distributions. Examples of the application of this testing procedure, as well as an extension of the test to factorial treatment structures and samples of bivariate data are also illustrated.

Pat Morin, Carleton University, Canada

Title: Computing the Center of Area of a Convex Polygon

The center of area of a convex planar set X is the point p for which the minimum area of X intersected by any halfplane containing p is maximized. We describe a simple randomized linear-time algorithm for computing the center of area of a convex n-gon.

(This is joint work with Peter Brass and Laura Heinrich-Litan.)

Ivan Mizera, University of Alberta

Title: Beyond Multivariate Location: Halfspace Depth in Various Models of Data Analysis

While most of the current depth considerations involve multivariate location model, it is possible to define halfspace depth in various other data-analytic setups. The examples include depth in linear, multivariate linear, orthogonal and auto- regression, but also nonlinear and location/scale models. I will comment on various statistical and data-analytic aspects of the resulting procedures - in particular how far the maximum halfspace depth estimators can be regarded as a kind of medians in the investigated models.

Karl Mosler, Universitat zu Koln, 50923 Koln, Germany

Title: Data Analysis by Zonoid Depth

The zonoid depth is useful in multivariate data analysis in order to describe an em pirical distribution by trimmed regions, so-called zonoid regions. The zonoid regions range from the convex hull of the data to their mean and characterize the distribution in a unique way.

The deepest point, which equals the mean, depicts the location of the data. The volumes of the regions measure dispersion and dependence; their shape reects dependence and asymmetry. Both, the zonoid regions and the depth, have nice analytic and computational properties. They lend themselves to multivariate statistical inference, including nonparametric tests for location and scale. The uniform order of zonoid depth functions (= inclusion of zonoid regions) defines an order of dispersion that is slightly weaker than the multivariate convex order. This order has a broad range of applications in comparison problems in economics and other fields.

The talk starts with the notion of zonoid depth as a general convex data depth. It covers its principal properties, in particular, uniqueness, the projection property and various continuity properties of the depth and the trimmed regions. Then computability issues are discussed. Finally, recent applications to the classification problem and to outlier identification are presented.

References:

Mosler, Karl (2002). Multivariate Dispersion, Central Regions and Depth. The Lift Zonoid Approach. Springer, New York.

David M. Mount, University of Maryland

Title: Analyzing The Number of Samples Required for an Approximate Monte-Carlo LMS Line Estimator

One common method for computing the Least Median of Squares (LMS) robust line estimator is based on a simple Monte-Carlo approach. Some number of pairs of distinct points are sampled at random, and each pair of points determines the slope of a line. In O(n log n) time, it is possible to determine the narrowest strip whose bounding lines have this slope and which encloses a certain fraction (e.g., half) of the points. The narrowest such strip is returned over all the samples as the final approximation.

Although inliers are usually assumed to lie exactly on the line, this is rarely the case in practice. Thus the broader issue considered here is how many pairs of points should be sampled to guarantee that, with sufficiently high probability, aMonte-Carlo type estimate will return a sufficiently close approximation to the LMS line of fit? We consider this problem in the context of a number of different point distributions.

(This is joint work with Nathan S. Netanyahu and Eliyahu Zuck of Bar-Ilan University.)

Hannu Oja, University of Jyvaskyla, Finland

Title: On Data Based Distances Between Data Vectors and Between Hyperplanes

The angular distance between two data vectors and the distance between two data points are defined as the proportions of certain data based hyperplanes that separate the two points. Interchanging the roles of data vectors and hyperplanes the distances between two hyperplanes can be defined in a symmetric way. As an application, several different scatter matrix estimates are constructed which utilize these distances. The connections to certain L1 objective functions are pointed out.

Steve Portnoy, University of Illinois

Title: Maximal Depth Estimators of Regression Quantiles for Censored Data

Quantile regression methods offer a powerful and natural approach to analyzing statistical variability caused by heteroscedasticity in the data or inhomogeneity in the population. Classical regression quantile estimators can be computed efficiently by linear programming methods (see Portnoy and Koenker, Stat. Science, 1997, to see an algorithm that is strictly faster than least squares for large random problems). Recent work has developed an algorithm based on recursive reweighting that provides a valuable generalization of these regression quantile estimators to the case of censored data. It thus permits a exible approach to model survival times directly. It can offer important new insights when there is heterogeneity in the population, and it gives a valuable complement to the traditional Cox Proportional Hazards approach.

However, the Koenker-Bassett regression quantiles are not robust to outlying values of the design variables. Fortunately, it is possible to define a weighted version of the maximal depth median estimator to provide highly robust estimators of population regression quantiles. By recursively computing these estimators along a grid of probabilities and using the reweighting ideas above, it is possible to extend these maximal depth regression quantile estimators to the censored case. The basic ideas behind the algorithm, and some asymptotic arguments justifying the approach will be presented.

Sara Lopez-Pintado and Juan Romo, Departamento de Estadstica Universidad Carlos III de Madrid 28903, Getafe, Madrid, Spain

Title: A Defnition of Depth for Functional Observations

We propose a new concept of depth for functional data. Given a collection of curves, this definition of depth allows to measure the centrality of an observed function and it provides a natural center-outward order for the curves in the sample. The finite-dimensional version of this depth can be seen as an alternative definition of depth for multivariate data. We show that it basically satisfies the four key properties that a depth should verify as introduced by Liu (1990) and later analyzed by Serfling and Zuo (2000). These desirable properties are: invariance, maximality at center, monotonicity with respect to the deepest point and vanishing at infinity. The Law of Large Numbers and the Central Limit Theorem for this new depth are established using results on U-processes from Arcones and Gine (1993). Simulated results show that the trimmed mean for a sample of curves gives a better performance than the mean for contaminated models. Real data sets are used to illustrate the properties of our method.

Eynat Rafalin, Tufts University

Title: A computational tool for Depth-Based Statistical Analysis

A computational tool for Depth-Based Statistical Analysis We present a software tool for statistical research, developed at Tufts University, based on the notion of data depth that is designed for scientists with no computer science background. The algorithms used for coding the tool are efficient (many times optimal) and robust. A general description of the tool, including existing and future statistical modules will be presented.

We hope that this talk will enhance collaboration with researchers and will stir discussion on future developments.

Mario Romanazzi , "Ca' Foscari" University of Venice, Italy

Title: Data Depth in Multivariate Analysis: Dependence, Discrimination and Clustering

The data depth approach offers a nonparametric alternative to classical methods of multivariate analysis. We present some recent results about interdependence analysis and suggest depth-based procedures for discrimination and clustering. Interdependence analysis is commonly focused on linear relations and uses the pairwise covariances and correlations. Most popular techniques, including regression, principal components and canonical correlation, rely on specific functions of the covariance or correlation matrices. The results are optimal when the underlying distribution is elliptically contoured but the lack of robustness is a potential drawback in the analysis of real samples. On the other hand, the depth approach has a wider coverage than just the elliptical family and the functionals of of the inner depth regions are generally robust. When studying the interdependence of the components of a distribution F, useful information is conveyed by the comparison of the depth regions of F with the corresponding regions of F0, the independence distribution associated with F. Scalar measures of the degree of interdependence are based on the volume or the diameter of the depth regions. Graphical displays of the behaviour of such measures with respect to depth order or central regions coverage are also available.

Discriminant analysis looks for optimal rules of allocation of units to a distribution Fm belonging to a family of M alternative distributions {F1; : : :; FM}. Well-known parametric procedures are the linear or quadratic discriminant functions whereas a nonparametric procedure is the nearest-neighbour majority vote. A simple depth-based criterion is to assign the data point x to the distribution giving maximum depth to x. When the depth function is halfspace or simplicial depth and F1; : : : ;FM are elliptically contoured, the previous rule assigns x to the distribution with minimum Mahalanobis' distance which suggests that the results should be comparable with those produced by the linear or quadratic discriminant functions. However, more robustness is expected.

Cluster analysis aims to identify the components of a mixture distribution F = ∑ m=1 α m; ∑ α m = 1, where M, the number of components, is usually unknown. A different data-analytic formulation requires the partition of a set of n units into M sub- sets so as to maximize the internal omogeneity. We suggest an algorithm similar to the popular K-means. Starting from an initial M-groups partition, the units are iteratively reallocated, and the groups correspondingly updated, so as to maximize the average depth of each group. A remarkable benefit is that no distance function is required.

References

Hastie, T.,Tibshirani, R., Friedman, J., The Elements of Statistical Learning, Springer, 2001, New York.

Kaufman, L., Rousseeuw, P. J., Finding Groups in Data, Wiley, 1990, New York.

Romanazzi, M. (2002), Data depth and correlation, submitted.

Vera Rosta , Department of Mathematics and Statistics, McGill University, Canada

Title: Exact, Adaptive, Parallel Algorithms for Data Depth Problems

Different notions of data depths were introduced in non-parametric statistics as multivariate generalizations of rank methods to complement classical multivariate analysis. The location depth of a point p relative to a data set in d-dimension, is the smallest number of data observations in any closed halfspace through p. Computing the location depth of a point was shown to be NP-complete in 1978 by Johnson and Preparata. Complexity theory hardness results are analysing worst case time complexity. For statistical applications on the other hand it might be more important to have exact algorithms that compute data depth measures for actual data. We introduce a new approach to the computation of data depth measures, by proposing exact adaptive algorithms, whose time complexity depends on the data at hand. We developed and implemented the first exact, adaptive, highly parallelized, algorithms to compute the location depth of a point in Rd, and for the construction of the location depths regions boundaries using enumeration algorithm for hyperplane arrangements. Additional branch and bound type technique can shorten computation time for actual data. Similar approach can be used to compute regression depth.

(Joint work with K. Fukuda)

Robert Serfling, University of Texas at Dallas

Title: An Overview on Depth Functions in Nonparametric Multivariate Analysis Outlyingness, depth and quantile functions are playing increasingly significant roles in nonparametric multivariate inference. This overview will introduce the subject via discussion of historical antecedents, some general perspectives, recent and current developments, and various open issues. Particular points of focus will include qualitative criteria for depth and outlyingness functions, connections with quantile functions, structural and smoothness properties, nonparametric multivariate descriptive measures, contours and level sets, outlier identification, convergence behavior of sample versions, depth-based statistical functionals, connections with multivariate signs and ranks, depth-based multivariate trimmed means, computational issues, and application contexts. Some comparative analysis of various depth functions along with provocative comments might be made, for example: the likelihood depth is not a depth function!

Kesar Singh, Rutgers University

Title: Comparing Multivariate Scale Using Data Depth: Testing for Expansion or Con- traction

Consider the test of scale difference for two multivariate distributions. The sample with the smaller scale would tend to cluster more tightly around the center of the com- bined sample. The center-outward ranking induced by data depth thus gives rise to a nonparametric rank sum test for scale comparisons. This can be viewed as multivariate generalizations of the Siegel-Tukey and Ansari-Bradley rank tests for testing the equality of variance between two samples in the univariate setting. We introduce such a multivariate rank test, investigate its properties, present simulation results, and provide a comparison study under normality. We also propose ways to address the issue of large number of ties. Furthermore, we generalize this rank test to the testing of scale difference in the setting of multiple multivariate samples.

(Joint work with Regina Liu.)

Suresh Venkatasubramanian , AT&T Labs - Research

Title: Real-time Computation of Data Depth Using the Graphics Pipeline

For many problems, using data depth notions to cluster and organize data is important not only as an analysis tool, but as a visualization tool. Depth contour plots can give a quick idea as to the shape of a distribution and suggest further exploratory directions. Thus, especially for time-varying or moving data, being able to visualize the results of a data depth computation is very valuable.

Modern graphics hardware is most commonly used for displaying complicated geometric scenes, but is now also being used as a fast stream coprocessor, and has been shown to provide dramatic improvements in running times for a variety of computations. In this talk, I will present a series of examples demonstrating how the graphics pipeline can be used to compute depth contours for various depth measures efficiently. In fact, this approach leads to the fastest implementation for computing the half-space depth contours of a set of points in two dimensions, and also allows us to render depth contours for moving points.

(This is joint work with Shankar Krishnan and Nabil Mustafa.)

Jim Wang, University of Texas at Dallas

Title: A Depth-Based Kurtosis Functional

A nonparametric multivariate kurtosis functional defined over multivariate distribu- tions is introduced as a particular transform of the spread functional associated with any family of depth-based central regions. The functional is defined for each probability level as the relative difference between equiprobable volumes taken toward the "center and toward the tails," starting from the "shoulders" of the underlying multivariate distribution. For an elliptically symmetric distribution, this functional characterizes the probability distribution within affine equivalence. This supports the use of its sample version for goodness-of-fit testing. Plots of the functional for various typical univariate and multivariate distributions will be shown. In the univariate case, the functional reduces to a quantile-based measure given by Groeneveld and Meeden (1984). For this case, new results on the asymptotic behavior of the sample version are presented, which support the use of the sample kurtosis functional for a competitive new test of univariate normality. Extension to the multivariate case is partially accomplished, with some open technical and computational issues remaining that involve the underlying spread functional. Some related tailweight measures and other kurtosis measures will also be discussed.

Yijun Zuo , Michigan State University

Title: Computation of Projection Depth and Related Estimators

In this talk, we review some prevailing notions of data depth and address the related computing issues. Especially, we focus on the computing problem of the projection depth and related location/scatter estimators for bivariate and multivariate data. Exact as well as approximate algorithms for the projection depth are discussed. Examples are given to illustrate the algorithms. Some open computing problems related to the projection depth are also presented.

Previous: Announcement

Next: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on May 12, 2003.