DIMACS Workshop on Analysis of Information from Diverse Sources

May 16 - 17, 2013
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Min-ge Xie, Rutgers University, mxie at stat.rutgers.edu; Abel Rodriguez, University of California

Presented under the auspices of the Special Focus on Information Sharing and Dynamic Data Analysis and the Department of Statistics, Rutgers University.

Abstracts:

David Banks, Duke University

Title: Text Networks

The dynamics of the Wikipedia, political blogs, and computational advertising are all situations in which the analyst can draw upon two kinds of data: information on text in webpages, and network connectivity structure between pages. In principle, each kind of information can inform the joint analysis; for example, latent Dirichlet allocation analysis can identify topics in text, and the extent to which a particular node participates in a topic may be a covariate used in forecasting the formation of edges. Reciprocally, one may use connectivity patterns to sharpen inference on topic memberships. This talk describes several forays into this area, and points up some of the emerging challenges in joining the recent field of network modeling with text mining.

Ming-Hui Chen, University of Connecticut

Title: Development of Power Priors for Incorporating Historical Data with Applications

The power prior has emerged as a useful informative prior for the incorporation of historical data in a Bayesian analysis. We provide an overview of the development of power priors in this presentation. The properties of power priors will be examined and several key theoretical results in this development will be presented. The strategy for selecting a guide value for the power parameter and the recent applications of power priors in Bayesian design of clinical trials will also be discussed. Several examples are given to illustrate the use of power priors. This presentation is based on a series of joint work with Joseph G. Ibrahim and many other collaborators.

Brian Claggett, Harvard University

Title: Nonparametric inference for meta-analysis with fixed, unknown, study-specific parameters: A resampling of confidence distributions approach

Meta-analysis is a valuable tool for combining information from independent studies. However, most common meta-analysis techniques rely on distributional assumptions that are difficult, if not impossible, to verify. For instance, in the commonly used fixed-effects and random-effects models, we take for granted that the underlying study parameters are either exactly the same across individual studies or that they are realizations of a random sample from a population, often under a parametric distributional assumption. In this paper, we present a new framework for summarizing information obtained from multiple studies and make inference that is not dependent on any distributional assumption for the study-level unknown, fixed parameters, {theta_1, theta_K}. Specifically, we draw inferences about, for example, the quantiles of this set of parameters using study-specific summary statistics. This type of problem is quite challenging (Hall and Miller, 2010). We utilize a novel resampling method via the confidence distributions of theta's to construct confidence intervals for the above quantiles. We justify the validity of the interval estimation procedure asymptotically and compare the new procedure with the standard bootstrapping method. We also illustrate our proposal with the data from a recent meta-analysis of the treatment effect from an antioxidant on the prevention of contrast-induced nephropathy. (Joint work with Min-ge Xie and Tian Lu)

Siddhartha R. Dalal, RAND Corporation, Columbia University & DIMACS, Rutgers University

Title: Revolutionizing Policy Analysis Using "Big Data" Analytics

As policy analysis becomes applicable to new domains, it is being challenged by the "curse of dimensionality﹀hat is, the vastness of available information, the need for increasingly detailed and delicate analysis, and the speed with which new analysis is needed and old analysis must be refreshed. Moreover, with the proliferation of digital information available at one's fingertips, and the expectation that this information be quickly leveraged, policy analysis in these new domains are being handicapped without scalable methods.

I will describe the results of a new initiative I started at RAND which developed new methods to apply to these "big data" problems to create new information and to convert enormous amounts of existing information into the knowledge needed for policy analysis. The specific methods draw on crowdsourcing models, information analytics, and web technologies that have already revolutionized research in other areas. The specific examples I discuss are in medical informatics to find adverse effects of drugs and chemicals and for suicide prevention. I will also describe some of the statistical methods used including Nonparametric Bayes methods, multiple hypothesis and machine learning applied to NLP.

Lee Dicker, Rutgers University

Title: One-shot learning and big data with n=2

We model a "one-shot learning" situation, where very few (scalar) observations y_1,...,y_n are available. Associated with each observation y_i is a very high-dimensional vector x_i, which provides context for y_i and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of x_i is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. This is joint work with Dean Foster.

Ying Hung, Rutgers University

Title: Design and Analysis for Multifidelity Computer Experiments

Computer experiments refer to those experiments that are preformed using computers with the help of physical models and numerical methods, such as finite element analysis. In this talk, experimental design and modeling techniques are discussed for multifidelity computer experiments. The methods are illustrated by an example studying the effect of warpage on fatigue reliability of solder bumps.

Nick Jewell, UC-Berkeley

Title: Combining Single and Two Group Outcome Risks/Comparisons from Multiple Studies of Safety

Although an essential component of efficacy clinical trials, the assessment of adverse events is often not a primary outcome. As such, most unique trials have insufficient data to yield precise treatment comparisons regarding the frequency of adverse events. To achieve the latter, many 'similar' studies are often pooled to achieve greater precision. Here we consider some examples of pooling studies in both the single and multiple treatment scenarios with a view to illustrating common pitfalls in terms of both bias and variability.

Soumen Lahiri, North Carolina State University

Title: Combining information from different sources: A resampling approach

In this talk, we consider combining information from more than one source on an unknown functional parameter where the exact dependence structure of the data sources is not completely known. We propose a theoretical framework to estimate the functional parameter by combining the information from different sources and a bootstrap based methodology for uncertainty quantification of the combined estimator.

Dungang Liu, Yale School of Public Health

Title: Confidence distribution approaches to efficient meta-analysis of heterogeneous studies

Meta-analysis has been widely used to synthesize evidence from multiple studies for common hypothesis or parameters of interest. For evidence synthesis, it has been shown in recent years that confidence distribution, a -YN4distribution estimatorN! of the unknown parameter, is a useful and convenient tool. In this talk, I will present two confidence distribution approaches to meta-analysis. Both approaches are motivated by the heterogeneity among the studies, which is often seen in meta-analysis. We show that the proposed confidence distribution approaches can make use of all evidence, direct as well as indirect, and thus enable us to make efficient inference. Specifically, one approach focuses on fixed effects setting, and we show that (1) our approach is asymptotically as efficient as the maximum likelihood approach using individual participant data (IPD) from all the studies; (2) unlike IPD approach, our approach suffices to use summary statistics; (3) our approach is robust against misspecification of the working correlation structure of the parameter estimates. The other approach focuses on random effects setting, and we show that (1) the proposed approach can efficiently integrate all the studies in the network, even when individual studies provide comparisons for only some of the treatments; (2) unlike a commonly used Bayesian hierarchical model, the proposed approach is prior-free and can always provide a proper inference regardless of the between-trial covariance structure. (Joint work with Min-ge Xie, Regina Liu, Guang Yang, and David C. Hoaglin)

George Michailidis, University of Michigan

Title: Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

Reconstructing transcriptional regulatory networks is an important task in functional genomics. Data obtained from experiments that perturb genes by knockouts or RNA interference contain useful information for addressing this reconstruction problem. However, such data can be limited in size and/or expensive to acquire. On the other hand, observational data of the organism in steady state (e.g. wild-type) are more readily available, but their informational content is inadequate for the task at hand. We develop a computational approach to appropriately utilize both data sources for estimating a regulatory network.

The proposed approach is based on a three-step algorithm to estimate the underlying directed but cyclic network, that uses as input both perturbation screens and steady state gene expression data. In the first step, the algorithm determines causal orderings of the genes that are consistent with the perturbation data, by combining an exhaustive search method with a fast heuristic that in turn couples a Monte Carlo technique with a fast search algorithm. In the second step, for each ordering, a regulatory network is estimated using a penalized likelihood based method, while in the third step a consensus network is constructed from the highest scored ones. Extensive computational experiments show that the algorithm performs well in uncovering the underlying network and clearly outperforms competing approaches that rely only on a single data source. Further, it is established that the algorithm produces a consistent estimate of the regulatory network.

Natesh Pillai, Harvard University

Title: Efficiency of Bayesian procedures and the frequentist-Bayes connection in some high dimensional problems

I will present my recent results on the efficiency of some of the widely used Bayesian procedures in high dimensional problems. I will present three examples, each highlighting a different challenge. The first two involve constructing shrinkage priors and understanding their properties and comparing them with their frequentist analogues. The third example entails the validity of a model selection procedure called ABC, widely used by biologists. I will close with some thoughts on the design of MCMC algorithms. All of these examples will hopefully illustrate the need for more careful research on the efficiency and even validity of these widely used methods.

Gavino Puggioni, University of Rhode Island

Title: A Bayesian Nonparametric Approach for Spatial Point Processes

We propose a nonparametric method to estimate the intensity of a point process observed in space and time. The modeling procedure, treated as a dynamic density estimation problem, involves the specification of a prior based on a Dirichlet Process mixture of Normal distributions at each point in time. Temporal dependence is introduced through the atoms that evolve as Dynamic Linear Models. The methodology is complemented by an application to sea turtle nesting patterns observed at Juno Beach, FL from 1999 to 2001.

Abel Rodriguez, UC-Santa Cruz

Title: Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data

We introduce efficient Markov chain Monte Carlo methods for inference and model determination in multivariate and matrix-variate Gaus- sian graphical models. Our framework is based on the G-Wishart prior for the precision matrix associated with graphs that can be de- composable or non-decomposable. We extend our sampling algorithms to a novel class of conditionally autoregressive models for sparse estimation in multivariate lattice data, with a special emphasis on the analysis of spatial data. These models embed a great deal of flexibility in estimating both the correlation structure across outcomes and the spatial correlation structure, thereby allowing for adaptive smoothing and spatial autocorrelation parameters. Our methods are illustrated using a simulated example and a real-world application which con- cerns cancer mortality surveillance. Supplementary materials with computer code and the datasets needed to replicate our numerical results together with additional tables of results are available online. This is joint work with Adrian Dobra and Alex Lenkoski.

Chris Schmid, Brown University

Title: Bayesian Network Meta-Analysis for Unordered Categorical Outcomes With Incomplete Data

We develop a Bayesian multinomial network meta-analysis model for unordered (nominal) categorical outcomes that allows for partially observed data in which exact event counts may not be known for each category. This model properly accounts for correlations of counts in mutually exclusive categories and enables proper comparison and ranking of treatment effects across multiple treatments and multiple outcome categories. We apply the model to analyze 17 trials, each of which compares two of three treatments (high and low dose statins and standard care/control) for some combination of the six outcomes of fatal and non-fatal stroke, fatal and non-fatal myocardial infarction, other causes of mortality, or no event. We provide software code to implement the method.

Hao Wang, University of South Carolina

Title: Scaling It Up: Stochastic Graphical Model Determination under Spike and Slab Prior Distributions

Gaussian covariance graph models and Gaussian concentration graph models are two classes of models useful for uncovering latent dependence structures among multivariate variables. In the Bayesian literature, graphs are often induced by priors over the space of positive definite matrices with fixed zeros, but these methods present daunting computational burdens in large problems. Motivated by the superior computational efficiency of continuous shrinkage priors for linear regression models, I propose a new framework for graphical model determination that is based on continuous spike and slab priors and uses latent variables to identify graphs. I discuss model specification, computation, and inference for both covariance graph models and concentration graph models. The new approach produces reliable estimates of graphs and efficiently handles problems of hundreds of variables.

Ying Nian Wu, UCLA

Title: Unsupervised learning of compositional sparse code for natural image representation

We propose an unsupervised method for learning compositional sparse code for representing natural images. Our method is built upon the original sparse coding framework where there is a dictionary of basis functions often in the form of localized, elongated and oriented wavelets, so that each image can be represented by a linear combination of a small number of basis functions automatically selected from the dictionary. In our compositional sparse code, the representational units are composite: they are compositional patterns formed by the basis functions. These compositional patterns can be viewed as shape templates. We propose an unsupervised learning method for learning a dictionary of frequently occurring templates from training images (which can come from multiple object categories), so that each training image can be represented by a small number of templates automatically selected from the learned dictionary. Experiments show that our method is capable of learning meaningful compositional sparse code, and the learned templates are useful for image classification. Based on joint work with Yi Hong, Zhangzhang Si, Wenze Hu and Song-Chun Zhu.

Heping Zhang, Susan Dwight Bliss Professor of Biostatistics, Yale University

Title: Genetic Studies of Multivariate Traits

In psychiatric and behavioral research, about six out of ten people with a substance use disorder suffer from another form of mental illness as well, making it necessary to consider multiple conditions as we study the etiologies of these conditions. The occurrence of multiple disorders in the same patient is referred to as comorbidity. Identifying the risk factors for comorbidity is an important yet difficult topic in psychiatric research. The effort of studying the genetics for comorbidity can be traced back to a century ago. It is important to consider and develop inferential tools for multivariate outcomes, particularly when the outcomes are discrete. There is extensive literature on the statistical analysis of multivariate normal variables as well as on nonparametric tests for a single variable of non-normal distribution. However, few options are available for the inference when we have multiple non-normally distributed variables and potentially a hybrid of continuous and discrete variables. To overcome this challenge, we made use of several useful statistical techniques such as the rank-based U-statistics and the kernel-based weighted statistics to accommodate the mix of continuous and discrete outcomes and the presence of important covariates. We conducted thorough simulation and analytic evaluation to assess the control of the type I error and the power of our proposed test. Both empirical and theoretical results suggest that our proposed test increases the power of testing association between genetic variants and multivariate traits while adjusting for the covariates. Applications of our test to real data sets also revealed novel insights.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on April 29, 2013.