Public Workshop: Friday, August 10, 2001
Location: DIMACS Center, CoRE Building, Rutgers University
This material is based upon work supported by the National Science Foundation under Grant No. 0100921
The term ``datamining'' has come to describe any type of data analysis, with an emphasis on the need to discern new, useful information from the large quantities of data stored in today's databases. Though the term ``datamining'' is relatively new, the fundamental ideas behind it are not, and the potential approaches to it are grounded in the basic theoretical and algorithmic approaches to data analysis developed over the last several decades. What is new is the challenge of modifying these approaches and developing new ones that can scale to the very large size of the problems that are faced or that apply to new types of data and in new types of applications. Most algorithms currently used in datamining do not scale well when applied to very large data sets, often because they rely on random access to the data sets, which scales only while the data sets fit entirely in relatively small main memories. Multidimensional scaling (MDS) is a case in point. The methodology has its roots in work going back to the first half of the 20th century and its modern roots in work of Shepard [B35, B36], Kruskal [B29, B30], and others. Yet, the traditional algorithms of MDS do not always work efficiently with today's large data sets and the traditional models of MDS are not always the most appropriate for the complex data types that we are seeing today. There has been in recent years a major trend toward the development of totally new and different algorithmic procedures for MDS as well as the modification of traditional procedures; an emphasis on the development of new models of MDS and on the development of new algorithmic procedures for fitting these new models; and a stress on the application of MDS to new types of data and new fields. These new algorithms, new models and new applications are being developed from different points of view and by researchers with different backgrounds and interests. It is our plan to get the developers of new models of, new algorithmic approaches to, and new applications of MDS together to share ideas and explore new approaches. We will also include experts in the applications areas and individuals who have not necessarily worked on MDS before but are expert in methodologies such as combinatorial optimization that are increasingly relevant to MDS.
Among the specific areas of emphasis for this working group are challenges to MDS research from:
In marketing, where MDS has also been widely applied, MDS of proximities is used to develop perceptual maps of products, which can be used to provide greater insight into the nature of that product class, and to devise predictive models of consumer choice. For example, an MDS map of cars will generally reveal, among others, a ``sportiness'' and a ``luxuriousness'' dimension, as well as more specific dimensions having to do with styling, fuel economy, overall size, etc. Data on preferences for cars by consumers can be used to fit either a vector or ideal point model or one can map either subject vectors or ideal points into the multidimensional perceptual space, in order to account for each individual consumer's preferences in terms of these perceptual dimensions. One application to marketing involves what is sometimes called ``gap analysis''; i.e., looking for gaps or empty spaces in the product map which, if filled, would correspond to the most preferred product for a substantial subgroup of consumers. Such gaps might represent opportunities for a product manager to develop a new product or products that would command a significant market share, and thus result in significant profits to his or her company. As noted in the discussion of our streaming data analysis working group, massive data sets from supermarkets provide important challenges for datamining. We will involve in our working group an expert on the use of MDS in marketing in the food industry.
MDS has been used for a long time in the study of social networks. Recently, there has been much interest in social networks called telephone call graphs, where the vertices are telephone numbers and the edges correspond to calls between them. Such call graphs also arise in the fraud and intrusion detection applications to be studied by the working group in streaming data analysis and mining. These graphs are extremely large. Buja, et al. [B5] use MDS to visualize a part of a call graph consisting of millions of vertices (data from [B1]). Social networks also arise from scientific collaborations ([B38]). In [B5], large scientific collaboration networks are visualized using MDS methodology, clearly identifying clusters. We will include experts on these new, large social networks in our working group.
MDS is used regularly in biology, for example in ecology to compare communities or other biological assemblages, in biological taxonomy to discriminate between populations and/or species on the basis of morphometric or genetic data, and in evolutionary theory. An early attempt to apply MDS methods to ecology is in [B6]. Ecological data of a multidimensional quantitative, semiquantitative, and qualitative nature of all kinds are described in the recent book [B32]. Recently, MDS methods have been used in [B34] in conjunction with methods developed to study role assignments in social networks. This is a fascinating new idea and challenge for MDS and an exciting application of social-scientific methods in the biological sciences. We will bring together in our working group those who have worked in the theory of social roles and those who are working on applying this theory to ecology, as well as experts in MDS and other classification methodologies useful in ecology.
MDS-like models arise in the field of biomolecular conformation in biochemistry. This work, concerned with reconstructing the geometry of molecules from neighborhood/distance information, goes back to the 1970's ([B18]), but has found modern stimulus through the growing field of computational chemistry ([B20], [B22]). While there are only 20 amino acids, the understanding of their interrelations, similarities, and differences is very complex and amino acid databases include hundreds of quantitative properties of such amino acids. To summarize properties of amino acids compactly, a readily visualizable mapping of these properties would be very helpful and could benefit from MDS-like methods. Early approaches to this problem can be found in [B21]. We will bring together biologists, experts in classification/clustering working on biomolecular and protein databases, and MDS experts interested in biochemistry in our working group. DIMACS is a natural place for such interdisciplinary work; our ``Special Focus on Mathematical Support for Molecular Biology'' has brought together mathematicians, statisticians, computer scientists, molecular biologists, and computational chemists with great success since 1994.
Broader models of MDS developed by psychometricians, mathematicians, and statisticians have been applied in recent years to chemical data by a group of researchers called TRICAP (TRIlinear models in Chemistry and Psychology). A recent special issue of the journal Chemometrics (Vol. 14, #3, 2000) dealt with the methodology for and applications of multilinear models and the challenges for development of new, more sophisticated methods for handling 3-way and multi-way generalizations of factor and/or components analysis. Applications included data about polynuclear aromatic hydrocarbons, pesticides, and fluorescence. Our working group will include individuals who are interested in the relation between MDS and chemometrics.
The field of graph layout/graph drawing has become a very important field from the point of view of information visualization. There is now an annual graph drawing conference and the drawing of large graphs arises in many contexts, including communication networks, electrical networks, wiring diagrams, etc. Graphs with thousands of vertices arise frequently in such applications. The paper [B5] describes the use of interactive graphics in MDS to handle rather large data sets in a graph layout context. (See [B33] for a predecessor paper.) DIMACS has sponsored a variety of programs on graph layout, including summer tutorial programs on network visualization, and is well connected to the research community in this field. We will involve people with interests in graph layout and MDS in the working group.
Three-way MDS (or individual differences MDS) deals with a number of proximity matrices, one for each of a number of subjects or others sources of data, which can be viewed as a three-way array (e.g., subjects by stimuli by stimuli). We think of two modes -- a mode corresponds to a set of entities (in our example, subjects and stimuli). Thus, a method of analysis of such data is referred to as three-way, two-mode MDS. The most widely used approach to three-way MDS entails fitting some variant of the INDSCAL model (for INdividual Differences Multidimensional SCALing Model) ([B8]). This model assumes weighted Euclidean distances with a different pattern of dimension weights for each subject or other source of data corresponding to the third way of the data array.
In addition to spatial and discrete geometric models, MDS also considers a family of hybrid models comprising a mixture of spatial and discrete components; e.g., a mixture of an R-dimensional Euclidean spatial representation and a multiple tree structure.
The broad definition of MDS also includes models for two-way, three-way and higher-way multivariate data. Certain models were called ``unfolding models'' by Coombs, but are perhaps better described as ``ideal point'' models. Here, scale values representing preferences are assumed to be inversely related to distances between ideal points representing subjects (say) and stimuli in the two-way case, or subjects by occasions (say) and stimuli in the three-way case (or, more generally, combinations of all ways/modes other than the stimulus mode and stimuli in the general N-way case). Similar models apply when preference is replaced by any notion of ``dominance'' and preference scales by dominance scales. Another class of non-distance-based spatial models for N-way multivariate data comprise various bilinear, trilinear or multilinear models. Perhaps the simplest case, given two-way data (say subjects by stimuli preference data), is called the vector model. The two-way factor or principal components model can be viewed as a special case of the scalar products model underlying the vector model. In the case of three-way or higher-way data one general model assumed is the one called the CANDECOMP/PARAFAC model. This generalizes the bilinear scalar product model to a general multilinear model which can be viewed as a kind of generalized scalar product model. A more general multilinear model that is also considered is Tucker's three-mode factor analysis model, which can also be generalized to the multi-way case. A recent review of the field, based on the ``broad'' viewpoint described in this subsection, and with a discussion of examples of a wide variety of specific models and methods, is provided by Carroll and Arabie [B7].
The increasingly varied types of applications of MDS and the increasingly complex types of data to which MDS methods are applied has led to this proliferation in models for MDS. In this working group, we will discuss a variety of such models and pinpoint conditions under which data sets might be most appropriately modeled by different types of spatial, discrete geometric, or hybrid models.
There have also been attempts to unify approaches to these more general MDS models. A recent paper [B9] formulated a general model and methodology called ``CANDCLUS'' that includes a large number of these models as special cases. The authors are in the process of implementing one special case involving fitting a hybrid model combining INDSCAL structure and the individual differences (overlapping) clustering model INDCLUS. Algorithms implementing a number of other special cases will be pursued by this working group. More generally, other unifying approaches to MDS models will be a major topic of discussion.
Brusco and Stahl [B3, B4], Chen [B17] and Simantiraki [B37] have applied combinations of linear programming combined with simulated annealing and mixed integer-linear programming to solving MDS problems, particularly involving either unidimensional problems or MDS problems with city block or other L p metrics, via optimizing a Least Absolute Deviation (LAD) or Minimax Deviation fit measure.
One goal of this working group is to explore use of such approaches as dynamic, linear, mixed integer-linear, and other mathematical programming techniques instead of, or in combination with, continuous optimization, to fit various MDS and MDS-related models to both proximity and more general multi-way data. Most of these techniques are sufficiently computationally intensive that, even with relatively small problems, they will generally require considerable computer time, and some problems may be beyond reach via such methods without also utilizing such supplementary techniques as simulated annealing, or heuristic methods designed to produce very good starting points. Even good local-search heuristics can be computationally inefficient given today's large data sets and need new methods such as the ``morphing'' process of Brusco [B2]. DIMACS is a natural place in which to explore the use of methods of dynamic, linear, mixed integer-linear programming. The center has an active and well-known research group in this field and just recently concluded a ``Special Year in Large Scale Discrete Optimization''. In that special year, there was an emphasis on new and more powerful algorithms developed in recent years, practical implementations involving codes and systems for such discrete optimization problems, and on practical applications leading to the fast solution of real world problems such as airline and railway scheduling, chemical process design, and telecommunications network design. Some DIMACS members and others with expertise in combinatorial optimization, but who have not previously worked on MDS, will be involved in our working group.
In addition to such techniques as these, more sophisticated continuous optimization techniques like majorization have been used extensively to deal with various MDS problems. For example, Heiser [B23] has used this approach to devise a method for fitting a weighted city block metric to three-way proximity data; in effect, this results in a city block version of INDSCAL. While we will put a lot of emphasis on non-continuous optimization, we will also explore these more sophisticated continuous optimization techniques.
In the case of very large data sets, various heuristic methods and/or approximate techniques need to be pursued. Another theme of this working group will be approximation and heuristics. These methods will enable us to fit models while not guaranteeing precise satisfaction of an OLS, LAD or other explicit fit measure. For example, a general approach called CANDELINC ([B15, B13]) offers the promise of getting good approximate solutions with large data sets, as tested empirically (with fairly small data sets) in [B11, B12]. Heuristic methods have been fairly successful in fitting tree structure and multiple tree structure models to proximity data ([B14, B16]). Combining such heuristic and approximate methods with some optimization techniques only recently applied to these problems should lead to improved algorithms both for finding globally optimal solutions for relatively small datasets, as well as for finding suboptimal, but practically acceptable, solutions for larger sets of data, and the working group will pursue this idea.
Kruskal and Hart [B31] did some very early work on generating an approximate MDS solution for a very large data set by first selecting a ``representative'' subset of objects, fitting a small dimensional MDS solution to these representative points, and then ``mapping'' the remaining points into this small dimensional space based on their similarities to those representative points. Approaches such as this, in addition to some of the other heuristic and approximate methods discussed above, might be usefully applied to fitting various MDS and MDS-related solutions to very large data sets, and our working group will discuss them.
Shepard [B35, B36] and Kruskal [B29, B230] introduced ``nonmetric MDS'' to deal with the situation when the input data is only known up to order, but the spacing between data values is not meaningful. (Thus, there is no metric on the inherently unidimensional data values.) The traditional analytical solutions of MDS are then no longer feasible and one needs to be concerned with algorithms that only use ordinal properties of the data. Thus, one must not use any sort of mean, though various generalizations of medians are appropriate. Kruskal introduced gradient-based optimization procedures to seek parameter values optimizing a measure of fit, a normalized least squares measure of fit between distances and a monotonic transformation of the data. Since this pioneering work, an enormous literature has grown up entailing use of many different variants of gradiant based methods. However, virtually all the methods developed are subject to the problem that they find only a local optimum of an overall objective function, rather than the global optimum. In general, there has so far been no good way to assure that the global optimum has been obtained. The best one can do is run the algorithm being used from a number of different random starts and assume that if the solution obtained from a number of different starting configurations is the same, then this is probably the global optimum. This working assumption in practice is challenged by increasingly large data sets and thus the issue of nonmetric MDS bears revisiting by our working group.