### DIMACS Workshop on Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond A Conference in Memory of Yehuda Vardi

#### October 21 - 22, 2005 Lecture Hall 1/F, CoRE Bldg, Busch Campus, Piscataway, NJ

Organizers:
Regina Liu, Department of Statistics, Rutgers University, rliu@stat.rutgers.edu
Bill Strawderman, Department of Statistics, Rutgers University, straw@stat.rutgers.edu
Cun-Hui Zhang, Department of Statistics, Rutgers University, czhang@stat.rutgers.edu
This conference is co-sponsored by NSF, NISS, DIMACS and Rutgers University.

#### Abstracts:

Akshay Adhikari, Lorraine Denby, Jim Landwehr, and Jean Meloche, Data Analysis Research Dept, Avaya Labs

Title: Using Data Network Metrics with Graphics on the Topology to Explore Network Characteristics

Yehuda Vardi introduced the term "network tomography" and was the first to propose and study how statistical inverse methods could be adapted to attack important network problems (JASA, 91, pp 365-377, 1996). More recently, in one of his final papers, Vardi proposed notions of metrics on networks to define and measure distances between a network's links, its paths, and also between different networks (IEEE Signal Processing Letters, 11, pp 353-355, March 2004). In this paper, we apply Vardi's general approach for network metrics to a real data network by using data obtained from special data network tools and testing procedures that we have developed. We illustrate how the metrics notions help explicate interesting features of the traffic characteristics on the network. We also adapt the metrics in order to condition on traffic passing through a portion of the network, such as a router or pair of routers, and show further how this approach helps to discover and explain interesting network characteristics.

Peter Bickel, University of California, Berkeley

Title: Estimating Large Covariance Matrices

We discuss different notions of sparsity for covariance matrices and the different assumptions and goals underlying these. We proceed to introduce an OO dimensional "nonparametric " model for covariances. We generalize in this context a result of Bickel and Levina (2004, Bernoulli) on naive Bayes rules and show that if log(Dimension)/sample size ->0 covariance matrices and their inverses can be estimated consistently in a suitably uniform way by matrices requiring on the order of np rather than np^2 operations.

Ching-Shui Cheng, Academia Sinica, Taiwan, and University of California, Berkeley

Title: Some Recent Developments in the Theory of Fractional Factorial Design

Fractional factorial designs, in which only a fraction of all the possible factor-level combinations are observed, are an important class of designs with a rich theory. Since only a fraction is observed, it is not possible to estimate all factorial effects, which may be aliased in a complicated manner. I will give a selected review of some recent results on the selection and construction of fractional factorial designs under model uncertainty.

Jianqing Fan, Princeton University

Title: Nonparametric specifications tests for diffusions models in financial econometrics

We develop a specification test for the transition density of a discretely-sampled continuous-time diffusion process, based on a comparison of a nonparametric estimate of the transition density or distribution function to their corresponding parametric counterparts assumed by the null hypothesis. Using the closed form expansions for the transition density, we are able to consider a direct comparison of the two densities for an arbitrary specification of the null parametric model. Using three different discrepancy measures between the null and alternative transition density and distribution functions, we simultaneously test the model's assumptions on the drift and diffusion functions. Our approach does not impose the assumption that the alternative model is a one-factor diffusion model and allows multi-factor stochastic volatility models or any stationary Markovian processes. In the case of many financial time series, such as interest rates or currencies, we avoid the near non-stationarity that can affect tests based on the marginal density of the process. We establish the asymptotic null distributions of proposed test statistics and compute their power functions. The finite sample properties are critically investigated via simulation studies and are compared with the test statistic of Hong and Li (2005). Our approaches are illustrated by applications to treasury bill data and and implied volatility data.

(Based on the joint work with Yacine Ait-Sahahlia and Heng Peng)

Chao A. Hsiung, Chi-Chung Wen, Yuh-Jenn Wu and I-Shou Chang, National Health Research Institutes, Taiwan

Title: Shape Restricted Regression with Random Bernstein Polynomials and Applications

Shape restricted regression, including isotonic regression and concave regression as special cases, is studied using priors on Bernstein polynomials and Markov chain Monte Carlo methods. These priors have large supports, select only smooth functions, can easily incorporate geometric information into the prior, and can be generated without computational difficulty. Simulation studies and analysis of real datasets are conducted to illustrate the performance of this approach. As an application, we use the Bayesian regression to study the virus-gene expression time course data from microarray experiments.

Joon Sang Lee and Michael Woodroofe, The University of Michigan

Title: A Restricted Minimax Determination Of the Initial Sample

When the variance is known, a level $1-\alpha$ confidence interval of specified width $2h > 0$ for the mean of a normal distribution requires a sample of size at least $\eta = c2\sigma2/h2$, where $c$ is the upper $(1-{1\over 2}\alpha)^{th}$ quantile of the standard normal distribution. If the variance is unknown, then such an interval may be constructed using Stein's double sampling procedure in which an initial sample of size $m \ge 2$ is drawn and used to estimate $\eta$. Here is it shown that if the experimenter specify a prior guess, $\eta_0$ say for $\eta$, then $\sqrt{{1\over 2}(1+c2)\eta_0}$ is an approximately minimax choice for the initial sample size. The formulation is, in fact, more general and includes point estimation with equivariant loss as well as interval estimation.

Colin Mallows, Avaya Labs

Title: Deconvolution by simulation

Suppose we can measure an attribute of a link AB of a network, for example the delay, and the same attribute over the path ABC, but cannot measure (directly) the attribute over the link BC. Data suggests that all observations are independent of one another, and we postulate that ABC = AB + BC. The problem is to use samples of AB and ABC measurements to estimate the distribution of BC. We present a new way of doing this.

Micha Mandel, Department of Biostatistics, Harvard School of Public Health

Title: Biased Sampling Problems with Applications to Cross-Sectional Data

Biased Sampling Problems with Applications to Cross-Sectional Data

Cross sectional data, that is, data obtained by sampling individuals who are available at a given place and time instant, are often subject to biases due to the very presence of these subjects at sampling time. Besides the bias, longitudinal measurements are often subject to loss of information or missing data such as censoring in which not all events are observable. These distortions, when are not taken into account correctly, may leads to highly biased and inconsistent estimates. Two aspects or parameters are of interest when analyzing cross sectional data: the distribution of lifetime in the screened population and the entrance process to that population. In this talk, I explore several models which arise from different assumptions on the entrance process, lifetimes and censoring. Estimators for the unknown parameters are provided as well as algorithms to implement them. Their performance and usefulness in moderate and small sample sizes are tested by simulations.

Vijay Nair, University of Michigan, Ann Arbor

The term network tomography, first introduced in Vardi (1996), characterizes two classes of large-scale inverse problems that arise in the modeling and analysis of computer and communications networks. This talk will deal with active network tomography where the goal is to recover link-level quality of service parameters, such as packet loss rates and delay distributions, from end-to-end path-level measurements. Internet service providers use this to characterize network performance and to monitor service quality. This talk provide a review of recent developments, including the design of probing experiment, inference for loss rates and delay distributions, and applications to network monitoring. This is joint work with George Michailidis, Earl Lawrence, Bowei Xi, and Xiaodong Yang.

Jonathon Phillips, National Institute of Standards and Technology

Title: Overview of Automatic Face Recognition

Face recognition is an interdisciplinary with elements from pattern recognition, computer vision, computer graphics, psychology, neuroscience, evaluation methods, and statistics. I will present a brief overview of face recognition and how the above subjects influence face recognition. I will emphasis topics that have statistical components. These topics related to pattern recognition, machine learning, and analysis of algorithm performance.

Yosi Rinott, Hebrew University

Title: Inference on multi-phase survival processes with incomplete data

Consider a life consisting of several phases (e.g., a disease which progresses in phases, duration of service in a university where the phases are different ranks) and data obtained by intercepting the life process at a random time and following it for a limited time. The data is therefore biased and censored. Given information on the phase at which a subject is intercepted, and perhaps on the past of the process, the goal is to infer on the distribution of total life, and the joint distribution of the phases' durations. We use models (e.g., copulas) and nonparametric maximum likelihood estimation.

Based on joint works with Micha Mandel, Yehuda Vardi and Cun-Hui Zhang.

Larry Shepp, Rutgers University

Title: Statistical Thinking: From Tukey to Vardi and Beyond

In the '60's, John Tukey and his followers brought exploratory data analysis into statistical methodology partly as a revolt against what was perceived as an overly rigid and brittle mathematical modelling philosophy that held sway before them. Some problems seemed to demand such a purely data-driven approach where data mining methods in the absence of mathematical modelling is the driving philosophical methodology. One did not want to be biased by preconceived ideas about the origin of the data by formulating a model but instead allowed the data to speak for itself.

Vardi liked mathematical modelling and was very good at it. He also promoted data mining, depending on the problem and thus straddled both philosophies. He and I often debated these issues, and were often in friendly disagreement. I will try to argue by giving concrete examples of work of Vardi and others in statistics that the pendulum should again swing back a bit towards encouraging more mathematical modelling to obtain maximal benefit from the use of statistical procedures by allowing physics, biology, and other fields of science to enter the statistical problem formulation via mathematical modelling of the specific statistical problem at hand. I would argue that the solution to a specific problem ought to somehow depend on the problem itself, which is not the case with neural-nets and other data-driven approaches that live mostly or entirely within the data or training set of the problem.

Data-driven statistics has the danger of isolating statistics from the rest of the scientific and mathematical communities by not allowing valuable cross-pollination of ideas from other fields. To illustrate these ideas I will discuss among other concrete examples of statistics problems: a) emission tomography, b) pattern recognition of hand-written characters, c) sampling bias.

All these examples were frequently debated by Vardi and me. I will do my best to give Vardi's side as honestly as possible. Needless to say, I wish he were here to continue the debate.

Rick Vitale, University of Connecticut

Title: Gaussian Measure and Geometric Convexity

We survey various ways in which geometric ideas can illuminate Gaussian measure and vice versa. As time permits, we will consider inequalities, representation formulas, valuations, and extension of geometric functionals to infinite dimensional settings.

Zhiliang Ying, Columbia University

Title: Analysis of Panel Duration Models with Fixed Effects

The analysis of duration data has a long tradition in both methodological and empirical research in economics as well as in health sciences. In keeping with the features of most duration data and the requirements of empirical analysis, the recent important developments in duration research have been inference methods for multi-spell (panel) durations, which permit either individual heterogeneity or censored durations or both. Two broad classes of semiparametric models, i.e., the Cox-type proportional hazards model and extended linear model will be discussed in this talk. Unbiased estimating equations will be presented for the latter, with theoretical and numerical justifications. Also to be presented is an application to a well-known study

Bin Yu, University of California, Berkeley

Title: Network Tomography

Vardi's 1996 JASA paper started the field of Network tomography which now represents a class of ill-posed linear inverse problems using indirect measurements to infer useful characteristics of computer networks.

Origin-destination (OD) traffic estimation based on link counts is the original Network Tomography problem studied by Vardi (1996) and is very important for routing. In this talk, we will first give an overivew of network tomography and then concentrate on the OD estimation problem. We will review and compare our OD method (Gaussian model + Iterative Proportional Fitting) with that of the ATT group (gravity model + Mutual information regularization). The comparisons are made through the information theoretic geometry and a real Sprint network data set with validation. Finally we will survey the "second-generation" OD estimation techniques which go beyond the link counts, and present a new partial measurment approach called PamTram which uses minimal direct OD information to significantly reduce the relative error rate and computation.

Cun-Hui Zhang, Rutgers University

Title: Some Network Tomography and Species Problems

We discuss three estimation problems based on network data. In the first problem (joint work with Jiangang Fang and Yehuda Vardi), we propose ITGA, an iterative tomogravity algorithm for the estimation of network flow based on link data. We present certain experimental results to demonstrate the potential of ITGA. In the second problem (joint work with Jiangang Fang), we consider nonparametric estimation of delay distribution in general multicast experiments. We study a simple estimator based on observations with small delay in certain paths, and propose an improvement of it with faster convergence rates. The problem is related to deconvolution. In the third problem (joint work with Eric Kolaczyk, Fabien Viger, Alain Barrat, and Luca Dall'Asta), we treat estimation of network size based on source-destination probe data as a network species problem and propose leave-one-out and resampling estimators from respectively empirical Bayes and network scaling points of view.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center