Ulas Akkucuk and J. Douglas Carroll, Rutgers Business School
Title: Nonlinear Mapping: Approaches Based on Optimizing an Index of Continuity and Applying Classical Metric MDS on Revised Distances
Dimensionality reduction techniques are used for representing higher dimensional data by a more meaningful lower dimensional structure. In this paper we will study two such approaches, namely Carroll's Parametric Mapping (to be abbreviated PARAMAP) (Shepard & Carroll, 1966) and a variation of ISOMAP (Tenenbaum, de Silva, & Langford, 2000). The former relies on iterative minimization of a cost function while the latter applies classical MDS after a preprocessing step involving the use of a shortest path algorithm to define approximate geodesic distances. We test the PARAMAP algorithm extensively on 62 points on the surface of a sphere with varying error levels and degrees of irregularity of spacing, as well as on the Swiss roll data which were initially used to test ISOMAP. We also compare the results to ISOMAP produced embeddings of the sphere and the Swiss roll. We develop a measure of congruence between the input data and the mapped low dimensional embedding based on the preservation of local structure, and compare the different approaches using this measure. This measure can also be used to test the statistical significance of an obtained configuration with the help of a randomization procedure we developed and application of simple probabilistic concepts (e.g., Chebyshev's inequality). The PARAMAP approach turns out to be very superior to the ISOMAP approach for both kinds of data on which we tested the programs, especially in the case of the sphere, ISOMAP produces mappings resembling a projection of the points into the lower dimensional space, much like what the essentially linear (or bilinear) classical MDS or PCA analysis would also do. PARAMAP, on the other hand, manages to preserve the local structure much better than ISOMAP does, even in the case of the Swiss roll data which were used in the original paper to show the effectiveness of ISOMAP (Tenenbaum et al., 2000). The difference in the mappings of the two algorithms is much more severe in the case of the sphere (which is a "closed" data set) than in that of the Swiss roll (which is an "open" data set). This suggests that ISOMAP in its current state would have limited use in such "open" or "semi-closed" data sets. In "closed" data sets, however, PARAMAP seems to be the only option. However, PARAMAP has its shortcomings in the sense that it requires extensive computer run time. Computational improvements in PARAMAP are a must if it is to be generalized to deal with a much larger number of points (say 1000 or more). Also ISOMAP and PARAMAP both could benefit from a procedure that would judiciously select a subset of points to be analyzed and then, after analysis, extrapolate or interpolate the rest. Our current research direction, is to work on improving computational efficiency of PARAMAP and also devise an efficient extrapolation or interpolation procedure.
References
Michael Brusco, Florida State University
Title: Applying Concordance and Partitioning Procedures to Identify Groups of Subjects with Similar MDS Structures
A common representation of data within the context of multidimensional scaling (MDS) is a collection of symmetric proximity (similarity or dissimilarity) matrices for each of M subjects. There are a number of possible alternatives for analyzing these data, which include: (a) conducting an MDS analysis on a single matrix obtained by pooling (averaging) the M subject matrices, (b) fitting a separate MDS structure for each of the M matrices, or (c) employing an individual-differences MDS model. We discuss each of these approaches, and propose a simple method for identifying groups of individual-subject matrices with concordant proximity structures. This method collapses the three-way data into a subject × subject dissimilarity matrix, which is subsequently partitioned using a branch-and-bound algorithm. Extensive Monte Carlo testing of the method reveals that a partition-diameter index is especially effective for recovery of the true subject cluster memberships present in the data.
Frank M.T.A. Busing, Leiden University
Title: PREFSCAL:A Program for 3-Way 3-Mode Multidimensional Unfolding
There are at least two reasons why unfolding is not commonly used as
a data analysis method.
Almost since its introduction, unfolding suffers from degeneracy.
Degeneracy is characterized by perfect but uninformative solutions of
the unfolding analysis, where the distances between the two sets of
objects are all identical. Different solutions, and sometimes causes,
have been suggested, but these did not lead to a mature analysis
method. The second reason is concerned with availability. Unfolding
programs, claiming to produce nondegenerate solutions, are not widely
or commercially available.
PREFSCAL is being developed in Leiden. The program minimizes penalized
Stress, an approach that successfully avoids degenerate solutions in
most circumstances. PREFSCAL aims at the same functionality as
PROXSCAL, a companion program for three-way MDS distributed in the
SPSS package CATEGORIES, with various three-way models,
conditionalities, transformations, and restrictions. In the very near
future, PREFSCAL will be available in SPSS as well.
References
James E. Corter, Columbia University Title: Representing Asymmetric Proximities Using Trees and Directed Trees
Asymmetric proximity data is often analyzed by symmetrizing the matrix and using a standard two-way one-mode multidimensional scaling or clustering technique. This practice may conceal meaningful structure in the asymmetries. It is proposed here to use a mix of directed and undirected trees to represent asymmetric proximities. In this method, dubbed DITREES, the asymmetric proximity matrix is first symmetrized by averaging, and the residual terms are computed. A mixture of two trees, with the same topology but different arc weights, is then fit to the n(n-1)/2 symmetric proximities, plus the n(n-1)/2 transformed antisymmetric residuals. The two-tree solution is represented graphically by a hybrid undirected/directed tree, such that the length of each arc represents the corresponding parameter of the first tree fit to the symmetrized proximities, and a directional component to each arc (represented graphically by an arrowhead) represents the corresponding parameter of the second tree fit to the transformed residuals.
Title: Organizational Dynamics of Software Development Mistakes, errors, and problems are a common element of working life, and a part of
all settings at one time or another. In software production, the work of identifying and
resolving errors, bugs and mistakes plays a large part. Operating system idiosyncrasies
or seemingly random "glitches" in program runs keep designers from refining programs
which are "almost perfect". In the worst cases, buggy software can present major
threats to security, economic health, and even lives.
As complex software artifacts proliferate and become more central to?even ubiquitous
in?peoples' lives, the social cost of software errors and bugs may increase.
Since errors and bugs reduce the effectiveness with which we can build and use software
systems, we're interested in understanding how bugs come about, how they can
best be managed, and how people who build and use advanced software systems can
organize their work to prevent, overcome, deal with, and accommodate problems.
Most accounts of software problems focus on flaws in technical design and usability.
Surely better design, prototyping, and needs analyses can help. But there's clearly
much more to the issue?specifically, the reliability of a software artifact is related
to the structure of the technical and organization processes that produce it and to the
technical and organizational infrastructures and constraints under which it is built and
maintained. This research is probing several aspects of this mix. We're examining
questions such as:
It's difficult to analyze issues such as the ones above, without comprehensive, timeand
project-specific data from large projects. With over several hundred thousand problem
reports from a variety of open-source projects, widely-accessible open source bug
repositories provides an extremely large and diverse dataset for analyzing issues like
those above. Data has been captured over several years, allowing for analysis of processes
and their evolution over time. These repositories can be analyzed quantitatively
(in terms of numbers of events, event types, response types, dates, timelines, etc.) It
can also be analyzed qualitatively, by comparatively examining the texts of bug reports,
responses, analyses, etc.
A third approach is exploratory analysis of the data using methods such as classical
metric multidimensional scaling [4], and similar visualization tools. Here the goal is to
uncover patterns, regularities, or exceptions that may be worth investigating in detail
using more formal qualitative or quantitative methods. In the tradition of numerous
bibliometric studies [2, 6], we are investigating patterns of cocitation in Bugzilla, the
repository for the Mozilla development project [1]. Our March 2002 snapshot of the
Bugzilla database contains 128,823 bug reports and over 1.2 million comments representing
feedback from over 20,000 registered users.
In contributing or responding to the report of a bug, Mozilla developers refer to
earlier relevant reports by ID number; the extent to which a pair of bug reports are
mentioned or cited together over the database forms a measure of association that can
be serve as the input to MDS. We are interested in whether the data will provide insights
related to the structure of Mozilla, to categories of bugs, to the organizational
structure of the bug reporting system, or to the dynamics of communication among the
contributors.
In applying MDS to data sets of this size the main problems are not computational.
Efficient and scalable algorithms for metric MDS are known [5], and implementations
of the necessary matrix decomposition routines are available in freely-distributed libraries
for both single and multi-processor computing hardware. We therefore have
no difficulty computing MDS solutions for thousands or tens of thousands of points.
The main challenge of this type of analysis is discovering an enlightening solution
space (or spaces) that can be readily visualized in few dimensions while accounting
for enough of the variance. Taken as a whole, our data has no simple underlying spatial
organization (unlike MDS applications in chemistry or molecular biology, where a
three-dimensional structure is known to exist). The sparseness of our proximity matrix
would indicate a large number of subsets of points, any of which might provide an
insightful picture when scaled.
A further problem arises in the interpretation of solutions. In bibliometric studies
researchers are usually familiar ahead of time with the highly-cited authors or papers
that appear in solutions. Repositories like Bugzilla offer more of a challenge in
accounting for why two reports may be related, and whether or not the reason is interesting.
To meet this challenge we are attempting to integrate cocitation maps with
visualizations of the much larger number of records that form the context of the proximity
relations that produced them. Our first attempts at contextualization involve using
MDS solutions as the initial reference point configurations for VIBE projections [3].
This material is based upon work supported by the National Science Foundation
under Grant No. 0205346. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.
References
Willem J. Heiser, Leiden University
Title: Interpretation of Between-Set Distances In Correspondence Analysis
There has been some debate about the correct interpretation of distances between row elements and column elements in a joint display of a correspondence table. The conventional view is that we can scale this joint display in such a way that either the distances between rows can be interpreted, or the distances between columns, but never directly the distances between rows and columns (Heiser and Meulman, 1981; Greenacre and Hastie, 1987). Carroll, Green and Schaffer (1986) proposed an alternative scaling of the coordinates for which they claimed that both between-set and within-set squared distances could be interpreted, but Greenacre (1989) has shown that this claim is not warranted.
Before any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to the unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. It is shown that a between-set distance interpretation is possible in any barycentric configuration or plot, in comparison with the distance to some specific supplementary points. The distance involved is not of the chi-squared type, but simply Euclidean. The result is equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.
References
Lawrence J. Hubert, University of Illinois Title: The Representation of Proximity Matrices by Tree Structures:
A Tree Structure Toolbox (TST) for MATLAB
We present and illustrate the capabilities of a MATLAB Toolbox for
fitting various classificatory tree structures to both symmetric (one-mode)
and rectangular (two-mode) proximity matrices. The emphasis is on identifying
ultrametrics and additive trees that are well-fitting in the L_{2} norm by heuristic
extensions of an iterative projection (least-squares) strategy. The (additive)
fitting of multiple tree structures is also addressed.
L.J. Hubert, University of Illinois Title: Multidimensional Scaling in the City-Block Metric: L1 and L2-Norm Optimization Methods Using MATLAB
This paper is a companion to Hubert, Arabie and Meulman (2002; and published in this same journal). The later provides a number of different least-squares (L2) optimization strategies for linear unidimensional scaling (LUS) within a MATLAB
environment. The current paper develops extensions, again within a MATLAB
context, to multidimensional scaling in the city-block metric using both an L2 and an L1 (least sum-of-absolute-deviations) loss function. Although L1 is an alternative to the use of L2, it doesn't appear to give any salient advantages; also; it is extremely expensive computationally to implement and sensitive to the possible ill-conditioning of the initial LUS task when formulated through its linear programming subtasks. A final generalization is given in the L2 context that incorporates optimal monotonic transformations of the original proximities, thus providing a readily available computational reasonable strategy for nonmetric multidimensional scaling in the city-block metric (in two and three dimensions) based on the given MATLAB functions provided.
Robert Michael Lewis, College of William & Mary Title: Sensitivity Analysis of the STRAIN Objective
We discuss the first and second derivatives of the objective
in the STRAIN approach to multidimensional scaling, in which
one works directly with the squared dissimilarities. A nice
feature of our approach is that the action of the derivatives
on a symmetric matrix of dissimilarities is described in terms of the
entire matrix, rather than in terms of derivatives of individual
components. We also give a complete characterization of
the spectrum of the Hessian of the STRAIN objective, and comment on
various features of the Hessian. We conclude with some remarks on
applying this second-order information to the STRAIN minimization
problem.
(This is joint work with Michael Trosset.)
Alistair Morrison, University of Glasgow
Title: Hybrid Algorithms for force-directed Multidimensional Scaling
Force-directed algorithms for MDS have, in the past, suffered from high
computational complexity and impractically long run times. Standard
spring models, in particular, run in time proportional to O(N^3) due to
the necessity to compute distances between every pair of objects at each
iteration. This talk will cover algorithms developed and evaluated as
part of my PhD toward the improvement of such techniques. Through the
combination of several stochastic processes, a hybrid model has been
created that executes in sub-quadratic time while still creating
representative layouts in low-dimensional space. This initial hybrid
model will be discussed, as well as the subsequent improvements made to
further reduce complexity. Results of evaluation will be presented,
detailing the models' performance on sets of both synthetic and real data.
Michael W. Trosset, College of William & Mary Title: Multidimensional Scaling and Molecular Conformation
Historically, multidimensional scaling (MDS) has been used
primarily to construct configurations of points in
low-dimensional Euclidean spaces for the purpose of visualizing fallible
measurements of the pairwise dissimilarities between a small number of
objects. The problem of inferring 3-dimensional molecular structure from
information about interatomic distances invites the application of various
methods developed for MDS, but various features of the molecular
conformation problem necessitate modification of traditional MDS
strategies. I will review several salient features of the molecular
conformation problem and survey several approaches to solving it. Two
approaches, the APA algorithm developed at the University of Kentucky and
work that I presented at the first DIMACS Workshop on Algorithms for MDS,
are closely related to nonmetric MDS, which was originally dismissed as
not relevant to molecular conformation. I will conclude by sketching some
computational challenges for future research.
Suzanne Winsberg, IRCAM, Paris France
Title: Two Techniques for MDS: One for Large Data Sets with Missing Values; References:
Bock, H.H. and Diday,E. (Eds) (2000) Symbolic Data Analysis ,Springer,
Zhijun Wu, Iowa State University
Title: A Novel Geometric Build-Up Algorithm for Solving the Distance A multi-dimensional scaling problem can be formulated equivalently
as a distance geometry problem, which can be solved using
a singular value decomposition algorithm in order of n^2 to n^3
floating point operations, if the exact distances between all pairs
of points for total n points are given. In this talk, I will
present a novel algorithm called the geometric build-up algorithm
for the distance geometry problem, which can find a solution
to the problem in linear time. The algorithm can be extended to
solving general problems with sparse and inexact distances as well.
The techniques for handling the build-up errors, inconsistent
distances, and distance bounds will be described.
David Dubin, Gabriel Ripoche, and Les Gasser, University of Illinois
P. Arabie, Rutgers University
J.J. Meulman, Leiden University
the Other for Huge Data Sets Encountered When Data Mining
The first technique deals with the weighted extended Euclidean
model, EXSCAL, proposed by Winsberg and Carroll,(1989). This method has
been modified to deal with missing data using an EM algorithm. Missing
data designs might be used to accomodate large data sets. We present
an evaluation of the algorithm for different levels of missing data.
The second method can deal with data coming from an extremely large
number of sources the very number of which makes the standard forms of
analysis very difficult. In such cases encountered in data mining
it may be advantageous to aggregate the data. Normally in MDS the
dissimilarity matrix has entries consisting of a single value, d , for
ij
the dissimilarity between object i and object j in the two-way case
or a single value, d , for the dissimilarity between object i and
ijk
and object j for source k, in the three-way case. Here, we deal with
aggregated data so that each entry in the dissimilarity matrix is
an interval, [ dmin , dmax ], where dmin is the minimum value of
ij ij ij
the dissimilarity observed between object i and object j, and dmax
ij
is tha maximum value of the dissimilarity observed between object i
and object j. This type of data, interval-type data results in a
symbolic data table, (see Bock and Diday, 2000). Of course classical
single-valued data tables are a special case of this type of data table.
We present the method, INTERSCAL, as well as analyses of artificial and
real data examples.
Berlin.Winsberg, S. and Carroll, J.D. (1989) A quasi nonmetric method for
multidimensional scaling via an extended Euclidean model. Psychometrika,54,217-229.
Geometry Problem and Its Application to Multi-Dimensional Scaling
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on December 1, 2003.