DIMACS Working Group: Algorithms for Multidimensional Scaling II

June 11-12, 2003
Doubletree Hotel in Tallahassee, Florida

Organizers:: J. Douglas Carroll (chairman), Rutgers University, dcarroll@rci.rutgers.edu; Phipps Arabie, Rutgers University, arabie@andromeda.rutgers.edu; Larry Hubert, University of Illinois, lhubert@s.psych.uiuc.edu; Michael Trosset, The College of William & Mary, trosset@math.wm.edu; Mike Brusco, Florida State University, mbrusco@garnet.acns.fsu.edu; Mel Janowitz, DIMACS liaison, melj@dimacs.rutgers.edu

Presented under the auspices of the Special Year on Data Analysis and Mining.

Abstracts:

Ulas Akkucuk and J. Douglas Carroll, Rutgers Business School

Title: Nonlinear Mapping: Approaches Based on Optimizing an Index of Continuity and Applying Classical Metric MDS on Revised Distances

Dimensionality reduction techniques are used for representing higher dimensional data by a more meaningful lower dimensional structure. In this paper we will study two such approaches, namely Carroll's Parametric Mapping (to be abbreviated PARAMAP) (Shepard & Carroll, 1966) and a variation of ISOMAP (Tenenbaum, de Silva, & Langford, 2000). The former relies on iterative minimization of a cost function while the latter applies classical MDS after a preprocessing step involving the use of a shortest path algorithm to define approximate geodesic distances. We test the PARAMAP algorithm extensively on 62 points on the surface of a sphere with varying error levels and degrees of irregularity of spacing, as well as on the Swiss roll data which were initially used to test ISOMAP. We also compare the results to ISOMAP produced embeddings of the sphere and the Swiss roll. We develop a measure of congruence between the input data and the mapped low dimensional embedding based on the preservation of local structure, and compare the different approaches using this measure. This measure can also be used to test the statistical significance of an obtained configuration with the help of a randomization procedure we developed and application of simple probabilistic concepts (e.g., Chebyshev's inequality). The PARAMAP approach turns out to be very superior to the ISOMAP approach for both kinds of data on which we tested the programs, especially in the case of the sphere, ISOMAP produces mappings resembling a projection of the points into the lower dimensional space, much like what the essentially linear (or bilinear) classical MDS or PCA analysis would also do. PARAMAP, on the other hand, manages to preserve the local structure much better than ISOMAP does, even in the case of the Swiss roll data which were used in the original paper to show the effectiveness of ISOMAP (Tenenbaum et al., 2000). The difference in the mappings of the two algorithms is much more severe in the case of the sphere (which is a "closed" data set) than in that of the Swiss roll (which is an "open" data set). This suggests that ISOMAP in its current state would have limited use in such "open" or "semi-closed" data sets. In "closed" data sets, however, PARAMAP seems to be the only option. However, PARAMAP has its shortcomings in the sense that it requires extensive computer run time. Computational improvements in PARAMAP are a must if it is to be generalized to deal with a much larger number of points (say 1000 or more). Also ISOMAP and PARAMAP both could benefit from a procedure that would judiciously select a subset of points to be analyzed and then, after analysis, extrapolate or interpolate the rest. Our current research direction, is to work on improving computational efficiency of PARAMAP and also devise an efficient extrapolation or interpolation procedure.

References

Shepard, R. N., & Carroll, J. D. (1966). Parametric representation of nonlinear data structures. In P. R. Krishnaiah (Ed.), Multivariate Analysis (pp. 561-592). New York, NY: Academic Press.

Tenenbaum, J. B., DeSilva, V. & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319-2323.

Michael Brusco, Florida State University

Title: Applying Concordance and Partitioning Procedures to Identify Groups of Subjects with Similar MDS Structures

A common representation of data within the context of multidimensional scaling (MDS) is a collection of symmetric proximity (similarity or dissimilarity) matrices for each of M subjects. There are a number of possible alternatives for analyzing these data, which include: (a) conducting an MDS analysis on a single matrix obtained by pooling (averaging) the M subject matrices, (b) fitting a separate MDS structure for each of the M matrices, or (c) employing an individual-differences MDS model. We discuss each of these approaches, and propose a simple method for identifying groups of individual-subject matrices with concordant proximity structures. This method collapses the three-way data into a subject × subject dissimilarity matrix, which is subsequently partitioned using a branch-and-bound algorithm. Extensive Monte Carlo testing of the method reveals that a partition-diameter index is especially effective for recovery of the true subject cluster memberships present in the data.

Frank M.T.A. Busing, Leiden University

Title: PREFSCAL:A Program for 3-Way 3-Mode Multidimensional Unfolding

There are at least two reasons why unfolding is not commonly used as a data analysis method.

Almost since its introduction, unfolding suffers from degeneracy. Degeneracy is characterized by perfect but uninformative solutions of the unfolding analysis, where the distances between the two sets of objects are all identical. Different solutions, and sometimes causes, have been suggested, but these did not lead to a mature analysis method. The second reason is concerned with availability. Unfolding programs, claiming to produce nondegenerate solutions, are not widely or commercially available.

PREFSCAL is being developed in Leiden. The program minimizes penalized Stress, an approach that successfully avoids degenerate solutions in most circumstances. PREFSCAL aims at the same functionality as PROXSCAL, a companion program for three-way MDS distributed in the SPSS package CATEGORIES, with various three-way models, conditionalities, transformations, and restrictions. In the very near future, PREFSCAL will be available in SPSS as well.

References

Busing, F.M.T.A., Groenen, P.J.F, and Heiser, W.J. (2001). Avoiding

degeneracy in multidimensional unfolding by penalizing on the

coefficient of variation. (manuscript submitted for

publication).

Heiser, W.J. and J.J. Meulman (1999).

SPSS Categories 10.0, SPSS inc., Chicago.

James E. Corter, Columbia University

Title: Representing Asymmetric Proximities Using Trees and Directed Trees

Asymmetric proximity data is often analyzed by symmetrizing the matrix and using a standard two-way one-mode multidimensional scaling or clustering technique. This practice may conceal meaningful structure in the asymmetries. It is proposed here to use a mix of directed and undirected trees to represent asymmetric proximities. In this method, dubbed DITREES, the asymmetric proximity matrix is first symmetrized by averaging, and the residual terms are computed. A mixture of two trees, with the same topology but different arc weights, is then fit to the n(n-1)/2 symmetric proximities, plus the n(n-1)/2 transformed antisymmetric residuals. The two-tree solution is represented graphically by a hybrid undirected/directed tree, such that the length of each arc represents the corresponding parameter of the first tree fit to the symmetrized proximities, and a directional component to each arc (represented graphically by an arrowhead) represents the corresponding parameter of the second tree fit to the transformed residuals.

David Dubin, Gabriel Ripoche, and Les Gasser, University of Illinois

Title: Organizational Dynamics of Software Development

Mistakes, errors, and problems are a common element of working life, and a part of all settings at one time or another. In software production, the work of identifying and resolving errors, bugs and mistakes plays a large part. Operating system idiosyncrasies or seemingly random "glitches" in program runs keep designers from refining programs which are "almost perfect". In the worst cases, buggy software can present major threats to security, economic health, and even lives.

As complex software artifacts proliferate and become more central to?even ubiquitous in?peoples' lives, the social cost of software errors and bugs may increase. Since errors and bugs reduce the effectiveness with which we can build and use software systems, we're interested in understanding how bugs come about, how they can best be managed, and how people who build and use advanced software systems can organize their work to prevent, overcome, deal with, and accommodate problems.

Most accounts of software problems focus on flaws in technical design and usability. Surely better design, prototyping, and needs analyses can help. But there's clearly much more to the issue?specifically, the reliability of a software artifact is related to the structure of the technical and organization processes that produce it and to the technical and organizational infrastructures and constraints under which it is built and maintained. This research is probing several aspects of this mix. We're examining questions such as:

What is the detailed character of the practice of quality assurance in software teams?
What is the 'normal' role of errors, mistakes, problems, insecurity, and unreliability?
How can we capture, measure, visualize, assess (etc.) the relationships between multiple problems, multiple constraints, and multiple actors in a software artifact and its supporting organizational infrastructure?

It's difficult to analyze issues such as the ones above, without comprehensive, timeand project-specific data from large projects. With over several hundred thousand problem reports from a variety of open-source projects, widely-accessible open source bug repositories provides an extremely large and diverse dataset for analyzing issues like those above. Data has been captured over several years, allowing for analysis of processes and their evolution over time. These repositories can be analyzed quantitatively (in terms of numbers of events, event types, response types, dates, timelines, etc.) It can also be analyzed qualitatively, by comparatively examining the texts of bug reports, responses, analyses, etc.

A third approach is exploratory analysis of the data using methods such as classical metric multidimensional scaling [4], and similar visualization tools. Here the goal is to uncover patterns, regularities, or exceptions that may be worth investigating in detail using more formal qualitative or quantitative methods. In the tradition of numerous bibliometric studies [2, 6], we are investigating patterns of cocitation in Bugzilla, the repository for the Mozilla development project [1]. Our March 2002 snapshot of the Bugzilla database contains 128,823 bug reports and over 1.2 million comments representing feedback from over 20,000 registered users.

In contributing or responding to the report of a bug, Mozilla developers refer to earlier relevant reports by ID number; the extent to which a pair of bug reports are mentioned or cited together over the database forms a measure of association that can be serve as the input to MDS. We are interested in whether the data will provide insights related to the structure of Mozilla, to categories of bugs, to the organizational structure of the bug reporting system, or to the dynamics of communication among the contributors.

In applying MDS to data sets of this size the main problems are not computational. Efficient and scalable algorithms for metric MDS are known [5], and implementations of the necessary matrix decomposition routines are available in freely-distributed libraries for both single and multi-processor computing hardware. We therefore have no difficulty computing MDS solutions for thousands or tens of thousands of points. The main challenge of this type of analysis is discovering an enlightening solution space (or spaces) that can be readily visualized in few dimensions while accounting for enough of the variance. Taken as a whole, our data has no simple underlying spatial organization (unlike MDS applications in chemistry or molecular biology, where a three-dimensional structure is known to exist). The sparseness of our proximity matrix would indicate a large number of subsets of points, any of which might provide an insightful picture when scaled.

A further problem arises in the interpretation of solutions. In bibliometric studies researchers are usually familiar ahead of time with the highly-cited authors or papers that appear in solutions. Repositories like Bugzilla offer more of a challenge in accounting for why two reports may be related, and whether or not the reason is interesting. To meet this challenge we are attempting to integrate cocitation maps with visualizations of the much larger number of records that form the context of the proximity relations that produced them. Our first attempts at contextualization involve using MDS solutions as the initial reference point configurations for VIBE projections [3].

This material is based upon work supported by the National Science Foundation under Grant No. 0205346. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

[1] The Mozilla project website. World Wide Web address: http://www.mozilla.org/.

[2] BORNER, K., CHEN, C., AND BOYACK, K. Visualizing knowledge domains. Annual Review of Information Science and Technology 37 (2003).

[3] OLSEN, K. A., KORFHAGE, R. R., SOCHATS, K. M., SPRING, M. B., AND WILLIAMS, J. G. Visualisation of a document collection: The VIBE system. Information Processing and Management 29, 1 (1993), 69-81.

[4] TORGERSON, W. S. Theory and Methods of Scaling. Wiley, New York, 1958.

[5] TROSSET, M. W. Numerical algorithms for multidimensional scaling. In Classification and Knowledge organization, R. Klar and O. Opitz, Eds., Studies in Classification, Data Analysis, and Knowledge Organization. Springer-Verlag, Berlin, 1997, pp. 80-92.

[6] WHITE, H. D., AND MCCAIN, K. W. Bibliometrics. Annual Review of Information Science and Technology 24 (1989), 119-186.

Willem J. Heiser, Leiden University

Title: Interpretation of Between-Set Distances In Correspondence Analysis

There has been some debate about the correct interpretation of distances between row elements and column elements in a joint display of a correspondence table. The conventional view is that we can scale this joint display in such a way that either the distances between rows can be interpreted, or the distances between columns, but never directly the distances between rows and columns (Heiser and Meulman, 1981; Greenacre and Hastie, 1987). Carroll, Green and Schaffer (1986) proposed an alternative scaling of the coordinates for which they claimed that both between-set and within-set squared distances could be interpreted, but Greenacre (1989) has shown that this claim is not warranted.

Before any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to the unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. It is shown that a between-set distance interpretation is possible in any barycentric configuration or plot, in comparison with the distance to some specific supplementary points. The distance involved is not of the chi-squared type, but simply Euclidean. The result is equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.

References

Carroll, J.D., Green, P.E., & Schaffer, C.M. (1986). Interpoint distance comparisons in correspondence analysis. Journal of Marketing Research, 23, 271-280.

Greenacre, M.J. (1989). The Carroll-Green-Schaffer scaling in correspondence analysis: A theoretical and empirical appraisal. Journal of Marketing Research, 26, 358-365.

Greenacre, M.J. & Hastie, T. (1987). The geometric interpretation of correspondence analysis. Journal of the American Statistical Association, 82, 437-447.

Heiser, W.J. & Meulman, J. (1983). Analyzing rectangular tables with joint and constrained multidimensional scaling. Journal of Econometrics, 22, 139-167.

Lawrence J. Hubert, University of Illinois

Title: The Representation of Proximity Matrices by Tree Structures: A Tree Structure Toolbox (TST) for MATLAB

We present and illustrate the capabilities of a MATLAB Toolbox for fitting various classificatory tree structures to both symmetric (one-mode) and rectangular (two-mode) proximity matrices. The emphasis is on identifying ultrametrics and additive trees that are well-fitting in the L_{2} norm by heuristic extensions of an iterative projection (least-squares) strategy. The (additive) fitting of multiple tree structures is also addressed.

L.J. Hubert, University of Illinois
P. Arabie, Rutgers University
J.J. Meulman, Leiden University

Title: Multidimensional Scaling in the City-Block Metric: L1 and L2-Norm Optimization Methods Using MATLAB

This paper is a companion to Hubert, Arabie and Meulman (2002; and published in this same journal). The later provides a number of different least-squares (L2) optimization strategies for linear unidimensional scaling (LUS) within a MATLAB environment. The current paper develops extensions, again within a MATLAB context, to multidimensional scaling in the city-block metric using both an L2 and an L1 (least sum-of-absolute-deviations) loss function. Although L1 is an alternative to the use of L2, it doesn't appear to give any salient advantages; also; it is extremely expensive computationally to implement and sensitive to the possible ill-conditioning of the initial LUS task when formulated through its linear programming subtasks. A final generalization is given in the L2 context that incorporates optimal monotonic transformations of the original proximities, thus providing a readily available computational reasonable strategy for nonmetric multidimensional scaling in the city-block metric (in two and three dimensions) based on the given MATLAB functions provided.

Robert Michael Lewis, College of William & Mary

Title: Sensitivity Analysis of the STRAIN Objective

We discuss the first and second derivatives of the objective in the STRAIN approach to multidimensional scaling, in which one works directly with the squared dissimilarities. A nice feature of our approach is that the action of the derivatives on a symmetric matrix of dissimilarities is described in terms of the entire matrix, rather than in terms of derivatives of individual components. We also give a complete characterization of the spectrum of the Hessian of the STRAIN objective, and comment on various features of the Hessian. We conclude with some remarks on applying this second-order information to the STRAIN minimization problem.

(This is joint work with Michael Trosset.)

Alistair Morrison, University of Glasgow

Title: Hybrid Algorithms for force-directed Multidimensional Scaling

Force-directed algorithms for MDS have, in the past, suffered from high computational complexity and impractically long run times. Standard spring models, in particular, run in time proportional to O(N^3) due to the necessity to compute distances between every pair of objects at each iteration. This talk will cover algorithms developed and evaluated as part of my PhD toward the improvement of such techniques. Through the combination of several stochastic processes, a hybrid model has been created that executes in sub-quadratic time while still creating representative layouts in low-dimensional space. This initial hybrid model will be discussed, as well as the subsequent improvements made to further reduce complexity. Results of evaluation will be presented, detailing the models' performance on sets of both synthetic and real data.

Michael W. Trosset, College of William & Mary

Title: Multidimensional Scaling and Molecular Conformation

Historically, multidimensional scaling (MDS) has been used primarily to construct configurations of points in low-dimensional Euclidean spaces for the purpose of visualizing fallible measurements of the pairwise dissimilarities between a small number of objects. The problem of inferring 3-dimensional molecular structure from information about interatomic distances invites the application of various methods developed for MDS, but various features of the molecular conformation problem necessitate modification of traditional MDS strategies. I will review several salient features of the molecular conformation problem and survey several approaches to solving it. Two approaches, the APA algorithm developed at the University of Kentucky and work that I presented at the first DIMACS Workshop on Algorithms for MDS, are closely related to nonmetric MDS, which was originally dismissed as not relevant to molecular conformation. I will conclude by sketching some computational challenges for future research.

Suzanne Winsberg, IRCAM, Paris France

Title: Two Techniques for MDS: One for Large Data Sets with Missing Values;
the Other for Huge Data Sets Encountered When Data Mining

    
The first technique deals with the weighted extended Euclidean

model, EXSCAL, proposed by Winsberg and Carroll,(1989). This method has

been modified to deal with missing data using an EM algorithm. Missing

data designs might be used to accomodate large data sets. We present

an evaluation of the algorithm for different levels of missing data.

    The second method can deal with data coming from an extremely large

number of sources the very number of which makes the standard forms of

analysis very difficult. In such cases  encountered in data mining

it may be advantageous to aggregate the data. Normally in MDS the

dissimilarity matrix has entries consisting of a single value, d   , for
                                                                  ij
the dissimilarity between object i and object j in the two-way case

or a single value, d   , for the dissimilarity between object i and
                     ijk
and object j for source k, in the three-way case. Here, we deal with

aggregated data so that each entry in the dissimilarity matrix is

an interval, [ dmin  , dmax  ], where dmin  is the minimum value of
                    ij      ij             ij
the dissimilarity observed between object i and object j, and dmax
                                                                   ij
is tha maximum value of the dissimilarity observed between object i

and object j. This type of data, interval-type data results in a

symbolic data table, (see Bock and Diday, 2000). Of course classical

single-valued data tables are a special case of this type of data table.

We present the method, INTERSCAL, as well as analyses of artificial and

real data examples.

References:

Bock, H.H. and Diday,E. (Eds) (2000) Symbolic Data Analysis ,Springer,
Berlin.Winsberg, S. and Carroll, J.D. (1989) A quasi nonmetric method for
multidimensional scaling via an extended Euclidean model. Psychometrika,54,217-229.

Zhijun Wu, Iowa State University

Title: A Novel Geometric Build-Up Algorithm for Solving the Distance
Geometry Problem and Its Application to Multi-Dimensional Scaling

A multi-dimensional scaling problem can be formulated equivalently as a distance geometry problem, which can be solved using a singular value decomposition algorithm in order of n^2 to n^3 floating point operations, if the exact distances between all pairs of points for total n points are given. In this talk, I will present a novel algorithm called the geometric build-up algorithm for the distance geometry problem, which can find a solution to the problem in linear time. The algorithm can be extended to solving general problems with sparse and inexact distances as well. The techniques for handling the build-up errors, inconsistent distances, and distance bounds will be described.

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on December 1, 2003.