MDS is widely used in the social and behavioral sciences. Its goal roughly is to take a multivariate data set and represent it in a low dimensional Euclidean space so as to minimize any distortion of the data. Often this is a representation in 2 dimensions. At its first meeting, the working group explored nonlinear and nonmetric versions of MDS, fitting of various non-Euclidean representations in both the two- and three- way cases, and the need to develop techniques that can be applied to massive data sets. This last problem, of dealing with massive data sets, is difficult because it will require the development of entirely new techniques, since most of the existing ones are extremely computationally intense and so tend to limit the size of data arrays quite severely. One promising approach involves the random deletion of a substantial portion of the data. Preliminary results indicated that as much as 60\% could be deleted without a serious effect on the output. Other approaches involve using heuristic approaches to get close to the solution and then trying to refine the output of the heuristic. This is work done by Willem Heiser and his colleagues from Leiden University. Since one well-known approach to fitting two-way Euclidean MDS models involves a singular value decomposition (SVD) of a derived matrix of scalar products, and since methods already exist for implementing the SVD on very large matrices, one approach, taken by the (unfortunately recently deceased) Mark Rorvig and David Dubin in some collaborative work with Douglas Carroll involved applying methods for SVD of massive data sets to solving this particular version of MDS for the case of extremely large matrices of proximities. This would involve proximity data on a very large number of stimuli or other objects. Various approaches are being explored for extending such approaches to other, more complex MDS models and methods.
The main accomplishment of the first meeting of this group was the development and enhancement of cross-disciplinary research efforts. Here are the highlights of these endeavors.
Larry Hubert (Psychology, University of Illinois), Phipps Arabie and Douglas Carroll (Graduate School of Management, Rutgers) together with Michael Brusco (School of Business, Florida State University) are all exploring various mathematical programming techniques to fit MDS models, including various possible collaborative efforts.
David Dubin (Library Science, University of Illinois) Douglas Carroll (Rutgers) and Michael Trossett (Math., William and Mary) are all exploring various approaches to MDS of massive data sets including, as already alluded to, possible extensions of some already established research in this area.
There was also the start of or enhancement of collaborations among
academic participants and industrial scientists such as Anil
Chaturvedi of Kraft Foods and Andreas Buja of AT\&T Laboratories.
Talks were primarily focused on algorithmic aspects of MDS. Primary emphasis was given to use of algorithms not typically used for fitting MDS models, such as linear and mixed integer programming, nonlinear or dynamic programming, and other optimization methods that have not been traditionally applied to fitting MDS and related models-- especially when these can be used to fit models not easily amenable to more traditional optimization techniques, such as various gradient based procedures. Another class of algorithmic issues considered had to do with approaches for increasing the size of data sets MDS and related methods can deal with, either via improvements in existing algorithms aimed at speeding them up considerably, as well as enabling them to deal with larger data sets, or by use of heuristic methods that may not precisely optimize a well defined criterion of fit, but may allow dealing with much larger data sets efficiently. Papers of this type can generically be classified as papers on MDS and related techniques for Massive Data Sets (MDS for MDS), or ``Data Mining'' in the context of MDS and related methodology.
The class of MDS and related models that were dealt with included, in addition to spatial models for proximity data, multidimensional or multiattribute models for preferential choice or other multivariate data, non-spatial or discrete models, such as tree structure, (overlapping or non-overlapping) clustering models, or ``hybrid'' models combining aspects of continuous spatial and discrete non-spatial models (e.g., a model for proximity data in which proximities are related to a sum of distances from an MDS-like spatial model and an ultrametric or path length metric defined on one or more tree structures; alternatively, the discrete component could consist of distances or distance-like measures defined on pairs of objects based on, say, an overlapping clustering structure).
This material is based upon work supported by the National Science Foundation under Grant No. 0100921