Doubletree Hotel in Tallahassee, Florida

**Organizers:****J. Douglas Carroll**(chairman), Rutgers University, dcarroll@rci.rutgers.edu**Phipps Arabie**, Rutgers University, arabie@andromeda.rutgers.edu**Larry Hubert**, University of Illinois, lhubert@s.psych.uiuc.edu**Michael Trosset**, The College of William & Mary, trosset@math.wm.edu**Mike Brusco**, Florida State University, mbrusco@garnet.acns.fsu.edu**Mel Janowitz**, DIMACS liaison, melj@dimacs.rutgers.edu

**Ulas Akkucuk and J. Douglas Carroll**, Rutgers Business School

**Michael Brusco**, Florida State University

**Frank M.T.A. Busing**, Leiden University

Title: PREFSCAL:A Program for 3-Way 3-Mode Multidimensional Unfolding

There are at least two reasons why unfolding is not commonly used as a data analysis method.

**James E. Corter**, Columbia University

Title: Representing Asymmetric Proximities Using Trees and Directed Trees

Asymmetric proximity data is often analyzed by symmetrizing the matrix and using a standard two-way one-mode multidimensional scaling or clustering technique. This practice may conceal meaningful structure in the asymmetries. It is proposed here to use a mix of directed and undirected trees to represent asymmetric proximities. In this method, dubbed DITREES, the asymmetric proximity matrix is first symmetrized by averaging, and the residual terms are computed. A mixture of two trees, with the same topology but different arc weights, is then fit to the n(n-1)/2 symmetric proximities, plus the n(n-1)/2 transformed antisymmetric residuals. The two-tree solution is represented graphically by a hybrid undirected/directed tree, such that the length of each arc represents the corresponding parameter of the first tree fit to the symmetrized proximities, and a directional component to each arc (represented graphically by an arrowhead) represents the corresponding parameter of the second tree fit to the transformed residuals.

Title: Organizational Dynamics of Software Development

Mistakes, errors, and problems are a common element of working life, and a part of all settings at one time or another. In software production, the work of identifying and resolving errors, bugs and mistakes plays a large part. Operating system idiosyncrasies or seemingly random "glitches" in program runs keep designers from refining programs which are "almost perfect". In the worst cases, buggy software can present major threats to security, economic health, and even lives.

As complex software artifacts proliferate and become more central to?even ubiquitous in?peoples' lives, the social cost of software errors and bugs may increase. Since errors and bugs reduce the effectiveness with which we can build and use software systems, we're interested in understanding how bugs come about, how they can best be managed, and how people who build and use advanced software systems can organize their work to prevent, overcome, deal with, and accommodate problems.

Most accounts of software problems focus on flaws in technical design and usability. Surely better design, prototyping, and needs analyses can help. But there's clearly much more to the issue?specifically, the reliability of a software artifact is related to the structure of the technical and organization processes that produce it and to the technical and organizational infrastructures and constraints under which it is built and maintained. This research is probing several aspects of this mix. We're examining questions such as:

- What is the detailed character of the practice of quality assurance in software teams?
- What is the 'normal' role of errors, mistakes, problems, insecurity, and unreliability?
- How can we capture, measure, visualize, assess (etc.) the relationships between multiple problems, multiple constraints, and multiple actors in a software artifact and its supporting organizational infrastructure?

It's difficult to analyze issues such as the ones above, without comprehensive, timeand project-specific data from large projects. With over several hundred thousand problem reports from a variety of open-source projects, widely-accessible open source bug repositories provides an extremely large and diverse dataset for analyzing issues like those above. Data has been captured over several years, allowing for analysis of processes and their evolution over time. These repositories can be analyzed quantitatively (in terms of numbers of events, event types, response types, dates, timelines, etc.) It can also be analyzed qualitatively, by comparatively examining the texts of bug reports, responses, analyses, etc.

A third approach is exploratory analysis of the data using methods such as classical metric multidimensional scaling [4], and similar visualization tools. Here the goal is to uncover patterns, regularities, or exceptions that may be worth investigating in detail using more formal qualitative or quantitative methods. In the tradition of numerous bibliometric studies [2, 6], we are investigating patterns of cocitation in Bugzilla, the repository for the Mozilla development project [1]. Our March 2002 snapshot of the Bugzilla database contains 128,823 bug reports and over 1.2 million comments representing feedback from over 20,000 registered users.

In contributing or responding to the report of a bug, Mozilla developers refer to earlier relevant reports by ID number; the extent to which a pair of bug reports are mentioned or cited together over the database forms a measure of association that can be serve as the input to MDS. We are interested in whether the data will provide insights related to the structure of Mozilla, to categories of bugs, to the organizational structure of the bug reporting system, or to the dynamics of communication among the contributors.

In applying MDS to data sets of this size the main problems are not computational. Efficient and scalable algorithms for metric MDS are known [5], and implementations of the necessary matrix decomposition routines are available in freely-distributed libraries for both single and multi-processor computing hardware. We therefore have no difficulty computing MDS solutions for thousands or tens of thousands of points. The main challenge of this type of analysis is discovering an enlightening solution space (or spaces) that can be readily visualized in few dimensions while accounting for enough of the variance. Taken as a whole, our data has no simple underlying spatial organization (unlike MDS applications in chemistry or molecular biology, where a three-dimensional structure is known to exist). The sparseness of our proximity matrix would indicate a large number of subsets of points, any of which might provide an insightful picture when scaled.

A further problem arises in the interpretation of solutions. In bibliometric studies researchers are usually familiar ahead of time with the highly-cited authors or papers that appear in solutions. Repositories like Bugzilla offer more of a challenge in accounting for why two reports may be related, and whether or not the reason is interesting. To meet this challenge we are attempting to integrate cocitation maps with visualizations of the much larger number of records that form the context of the proximity relations that produced them. Our first attempts at contextualization involve using MDS solutions as the initial reference point configurations for VIBE projections [3].

This material is based upon work supported by the National Science Foundation under Grant No. 0205346. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

- [1] The Mozilla project website. World Wide Web address: http://www.mozilla.org/.
- [2] BORNER, K., CHEN, C., AND BOYACK, K. Visualizing knowledge domains.
Annual Review of Information Science and Technology 37 (2003).
- [3] OLSEN, K. A., KORFHAGE, R. R., SOCHATS, K. M., SPRING, M. B., AND
WILLIAMS, J. G. Visualisation of a document collection: The VIBE system.
Information Processing and Management 29, 1 (1993), 69-81.
- [4] TORGERSON, W. S. Theory and Methods of Scaling. Wiley, New York, 1958.
- [5] TROSSET, M. W. Numerical algorithms for multidimensional scaling. In Classification
and Knowledge organization, R. Klar and O. Opitz, Eds., Studies in Classification,
Data Analysis, and Knowledge Organization. Springer-Verlag, Berlin,
1997, pp. 80-92.
- [6] WHITE, H. D., AND MCCAIN, K. W. Bibliometrics. Annual Review of Information
Science and Technology 24 (1989), 119-186.

**Willem J. Heiser**, Leiden University

Title: Interpretation of Between-Set Distances In Correspondence Analysis

There has been some debate about the correct interpretation of distances between row elements and column elements in a joint display of a correspondence table. The conventional view is that we can scale this joint display in such a way that either the distances between rows can be interpreted, or the distances between columns, but never directly the distances between rows and columns (Heiser and Meulman, 1981; Greenacre and Hastie, 1987). Carroll, Green and Schaffer (1986) proposed an alternative scaling of the coordinates for which they claimed that both between-set and within-set squared distances could be interpreted, but Greenacre (1989) has shown that this claim is not warranted.

Before any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to the unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. It is shown that a between-set distance interpretation is possible in any barycentric configuration or plot, in comparison with the distance to some specific supplementary points. The distance involved is not of the chi-squared type, but simply Euclidean. The result is equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.

References

- Carroll, J.D., Green, P.E., & Schaffer, C.M. (1986). Interpoint distance comparisons in correspondence analysis. Journal of Marketing Research, 23, 271-280.
- Greenacre, M.J. (1989). The Carroll-Green-Schaffer scaling in correspondence analysis: A theoretical and empirical appraisal. Journal of Marketing Research, 26, 358-365.
- Greenacre, M.J. & Hastie, T. (1987). The geometric interpretation of correspondence analysis. Journal of the American Statistical Association, 82, 437-447.
- Heiser, W.J. & Meulman, J. (1983). Analyzing rectangular tables with joint and constrained multidimensional scaling. Journal of Econometrics, 22, 139-167.

**Lawrence J. Hubert**, University of Illinois

**L.J. Hubert**, University of Illinois

**P. Arabie**, Rutgers University

**J.J. Meulman**, Leiden University

**Robert Michael Lewis**, College of William & Mary

Title: Sensitivity Analysis of the STRAIN Objective

We discuss the first and second derivatives of the objective in the STRAIN approach to multidimensional scaling, in which one works directly with the squared dissimilarities. A nice feature of our approach is that the action of the derivatives on a symmetric matrix of dissimilarities is described in terms of the entire matrix, rather than in terms of derivatives of individual components. We also give a complete characterization of the spectrum of the Hessian of the STRAIN objective, and comment on various features of the Hessian. We conclude with some remarks on applying this second-order information to the STRAIN minimization problem.

(This is joint work with Michael Trosset.)

**Alistair Morrison**, University of Glasgow

Title: Hybrid Algorithms for force-directed Multidimensional Scaling

**Michael W. Trosset**, College of William & Mary

Title: Multidimensional Scaling and Molecular Conformation

Historically, multidimensional scaling (MDS) has been used primarily to construct configurations of points in low-dimensional Euclidean spaces for the purpose of visualizing fallible measurements of the pairwise dissimilarities between a small number of objects. The problem of inferring 3-dimensional molecular structure from information about interatomic distances invites the application of various methods developed for MDS, but various features of the molecular conformation problem necessitate modification of traditional MDS strategies. I will review several salient features of the molecular conformation problem and survey several approaches to solving it. Two approaches, the APA algorithm developed at the University of Kentucky and work that I presented at the first DIMACS Workshop on Algorithms for MDS, are closely related to nonmetric MDS, which was originally dismissed as not relevant to molecular conformation. I will conclude by sketching some computational challenges for future research.

**Suzanne Winsberg**, IRCAM, Paris France

Title: Two Techniques for MDS: One for Large Data Sets with Missing Values;

the Other for Huge Data Sets Encountered When Data Mining

The first technique deals with the weighted extended Euclidean model, EXSCAL, proposed by Winsberg and Carroll,(1989). This method has been modified to deal with missing data using an EM algorithm. Missing data designs might be used to accomodate large data sets. We present an evaluation of the algorithm for different levels of missing data. The second method can deal with data coming from an extremely large number of sources the very number of which makes the standard forms of analysis very difficult. In such cases encountered in data mining it may be advantageous to aggregate the data. Normally in MDS the dissimilarity matrix has entries consisting of a single value, d , for ij the dissimilarity between object i and object j in the two-way case or a single value, d , for the dissimilarity between object i and ijk and object j for source k, in the three-way case. Here, we deal with aggregated data so that each entry in the dissimilarity matrix is an interval, [ dmin , dmax ], where dmin is the minimum value of ij ij ij the dissimilarity observed between object i and object j, and dmax ij is tha maximum value of the dissimilarity observed between object i and object j. This type of data, interval-type data results in a symbolic data table, (see Bock and Diday, 2000). Of course classical single-valued data tables are a special case of this type of data table. We present the method, INTERSCAL, as well as analyses of artificial and real data examples.

References:

Bock, H.H. and Diday,E. (Eds) (2000) Symbolic Data Analysis ,Springer,

Berlin.Winsberg, S. and Carroll, J.D. (1989) A quasi nonmetric method for

multidimensional scaling via an extended Euclidean model. Psychometrika,54,217-229.

**Zhijun Wu**, Iowa State University

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on December 1, 2003.