Title: Local and Non-local Preferential Attachment Models
The last few years have seen an explosion in network models describing the evolution of real world networks. In the context of math probability, one aspect which has seen an intense focus is the interplay between randomness and limited choice in the evolution of networks. The aim of this talk is to describe the following two pieces of recent work exploring this phenomenon and display the ubiquity of Jagers-Nerman stable age distribution theory from continuous time branching processes in such models.
1. Local models or power of choice: Consider a model of a network where each new individual entering the system only has limited information about the entire network and must make a choice based on this choice. We will consider a particular case linked to the phenomenon of power of two choices.
2. Non-local preferential attachment: To model the formation of "superstars" in twitter event networks, we describe a simple variant of preferential attachment which seems to perform much better on empirical data than the standard model. We will describe the mathematical techniques required to rigorously analyze this model.
Title: Large-Scale Graph Analysis Using Giraph
Large graphs are ubiquitous in real-world problems. Online social networks such as Facebook or Twitter may contain ~10^9 vertices and ~10^12 edges. Machine learning and statistical analysis on such large graphs motivates the need both for scalable computational frameworks as well as techniques amenable to such frameworks. In this talk I will describe Giraph, an open-source platform which implements a Bulk Synchronous Parallel (BSP) model of computation. Many graph algorithms lend themselves naturally to this model of computation. I will describe some of these algorithms and show how many common tasks such as graph partitioning, clustering, and classification can be accomplished within this framework. I will also show how to build a model in this framework for network prediction at scale.
Title: Generalized Preferential Attachment Models
We talk about preferential attachment models with superlinear attraction function and further generalization where at each stage more than one edge is added to the graph. This talk is derived from a paper with C. Cotar, M. LiCalzi and work (under progress) with M. Holmes.
Title: Real Time Processing of Graphical Models in Education
The speaker is Director of Data Science at Knewton, a digital education company that processes interactions streams from over 2 million students with compact (10-50k node) content graphs in order to produce realtime recommendations. Knewton uses probabilistic graphical models to express the relational and temporal structure of each educational experience, inferring latent parameters of content, and of students, that explain observed interactions and inform our recommendations. We've encountered a number of interesting technical challenges while scaling our platform, including versioned content graphs, sparsity and temporal clustering in observed data, and wide variation in the key dimensions of our models. This talk will focus on the (sometimes) pragmatic approaches we've taken to engineering around those challenges.
Title: Measuring Tie Strength in Implicit Social Networks
Given a set of people and a set of events attended by them, we address the problem of measuring connectedness or tie strength between each pair of persons. The underlying assumption is that attendance at mutual events gives an implicit social network between people. We take an axiomatic approach to this problem. Starting from a list of axioms, which a measure of tie strength must satisfy, we characterize functions that satisfy all the axioms. We then show that there is a range of tie-strength measures that satisfy this characterization. A measure of tie strength induces a ranking on the edges of the social network (and on the set of neighbors for every person). We show that for applications where the ranking, and not the absolute value of the tie strength, is the important aspect about the measure, the axioms are equivalent to a natural partial order. To settle on a particular measure, we must make a non-obvious decision about extending this partial order to a total order. This decision is best left to particular applications. We also classify existing tie-strength measures according to the axioms that they satisfy; and observe that none of the "self-referential" tie-strength measures satisfy the axioms. In our experiments, we demonstrate the efficacy of our approach; show the completeness and soundness of our axioms, and present Kendall Tau Rank Correlation between various tie-strength measures.
Title: Connection Times in Large Ad-hoc Mobile Networks
We study connectivity properties in a probabilistic model for a large mobile ad-hoc network. We consider a large number of participants of the system moving randomly in a large domain, with a space-dependent population density of finite, positive order and with a fixed time horizon. Messages are instantly transmitted according to a relay principle, i.e., they are iteratively forwarded from participant to participant over distances ≤ 2R, with 2R the communication radius, until they reach the recipient. In mathematical terms, this is a dynamic continuum percolation model.
We consider the connection time of two sample participants, the amount of time over which these two are connected with each other. In the above thermodynamic limit, we find that the connectivity induced by the system can be described in terms of the counterplay of a local, random, and a global, deterministic mechanism, and we give a formula for the limiting behaviour.
A prime example of the movement schemes that we consider is the well-known random waypoint model (RWP). Here we describe the decay rate, in the limit of large time horizons, of the probability that the portion of the connection time is less than expected.
Title: Nonparametric Graph Estimation
The graphical model has proven to be a useful abstraction in statistics and machine learning. The starting point is the graph of a distribution. While often the graph is assumed given, we are interested in estimating the graph from data. In this talk we present new nonparametric and semiparametric methods for graph estimation. The performance of these methods is illustrated and compared on several real and simulated examples.
Title: Learning Non-Gaussian Bayesian Networks Using Siri
Siri stands for "sliced inverse-regression with interaction detection," a new method we have just developed under the semi-parametric index models for detecting non-linear and interactive effects of the predictors on the response variable. In index models, the response is influenced by predictors through an unknown function of several linear combinations of the predictors. Our finding of the Bayesian formulation of such models enables us to propose a set of new models and methods that can effectively discover second-order effects and interactions among the covariates. A two-stage stepwise procedure based on likelihood ratio test is developed to select relevant predictors and a Bayesian model with dynamic slicing scheme is derived. The performance of the proposed procedure in comparison with some existing method is demonstrated through simulation studies. We also demonstrate how it can correctly identify non-linear non-Gaussian Bayesian network structures when other methods fail.
This is based on the joint work with Yang Li and Bo Jiang.
Title: Contrasted Penalized Integrative Analysis
Single-dataset analysis of high-throughput omics data often leads to unsatisfactory results. The integrative analysis of heterogeneous raw data from multiple independent studies provides an effective way to increase sample size and improve marker selection results. In integrative analysis, the regression coefficient matrix has certain structures. In our study, we use group penalization for one- or two-dimensional marker selection and introduce contrast penalties to accommodate the subtle coefficient structures. Simulations show that the proposed methods have significantly improved marker selection properties. In the analysis of cancer genomic data, important markers missed by the existing methods are identified.
Title: Change Point Inference for Time-varying Erdos-Renyi Graphs
We investigate a model of an Erdos-Renyi graph, where the edges can be in a present/absent state. The states of each edge evolve as a Markov chain independently of the other edges, and whose parameters exhibit a change-point behavior in time. We derive the maximum likelihood estimator for the change-point and characterize its distribution. Depending on a measure of the signal-to-noise ratio present in the data, different limiting regimes emerge. Nevertheless, a unifying adaptive scheme can be used in practice that covers all cases.We illustrate the model and its flexibility on US Congress voting patterns using roll call data.
Title: Dynamics and Large-Scale Structure in Networks
Two active areas of networks research are (1) studies of large-scale structure in networks, things like hierarchy, ranking, and graph partitioning; and (2) studies of dynamical processes, like network traffic, percolation, and synchronization. This talk explores the connection between the two, focusing on two examples in particular, community structure and epidemic processes. The study of these phenomena has produced some powerful new results concerning spectral methods and belief propagation, among other things, and reveals deep and unexpected truths about network structure and its effect on the processes we care about.
Title: Characterizing Individual Behavior from Interaction History
An issue of fundamental importance in network science is characterizing the actors in a network. For example, practitioners are often interested in identifying important or influential actors, and they are interested in clustering actors into similar communities. We seek to characterize the behaviors of actors in a network in terms of their past histories. This characterization is achieved through a continuous-time Cox model which treats networks events (messages and other transactions) as instantaneous events. By using such an inferential framework, we are able to learn about the influence of an individual message on the actors in a network, along with how the influence decays in time.
Title: Graph Models and Scalable Analytics
Graphs are a natural representation of relationships between various types of objects, which arise in many applications, such as the web, social networks, business intelligence, information retrieval, and computer security. The increasing volume and complexity of data motivates the need for scalable analytics that can answer key questions such as: (i) which are the most important nodes? or (ii) what are the key communities of nodes? In this talk, we motivate and discuss various graph models and then present scalable analytics for three important classes of problems.
Title: The Dynamics of Dissemination on Graphs: Theory and Algorithms
Big graphs are prevalent and have posed many fascinating research questions. Among others, graphs are becoming a popular platform to host many interesting dynamic processes, such as the dissemination of viruses, memes, opinions, rumors, etc. What are the key (and common) graph parameters governing these dynamic processes? How can we design effective algorithms to optimize such parameters to guild a corresponding dynamic process in a desired way? In this talk, we aim to answer these questions in three related settings: (1) a dynamic process on static graphs; (2) a dynamic process on dynamic graphs; and (3) a dynamic process on static graphs with partial immunity.
Title: Estimation of Exponential Random Graph Models for Large Social Networks via Graph Limits
Analyzing and modeling network data have become increasingly important in a wide range of scientific fields. Among popular models, exponential random graph models (ERGM) have been developed to study these complex networks. For large networks, however, maximum likelihood estimation (MLE) of parameters in these models can be very difficult, due to the unknown normalizing constant. Alternative strategies based on Markov chain Monte Carlo draw samples to approximate the likelihood, which is then maximized to obtain the MLE. These strategies have poor convergence due to model degeneracy issues. Chatterjee and Diaconis (2011) proposed a new theoretical framework for estimating the parameters of ERGM by approximating the normalizing constant using the emerging tools in graph theorygraph limits. In this paper, we construct a complete computational procedure built upon their results with practical innovations. More specifically, we evaluate the likelihood via simple function approximation of the corresponding ERGM's graph limit and iteratively maximize the likelihood to obtain the MLE. We also propose a new matching method to find a starting point for our iterative algorithm. Through simulation study and real data analysis of two large social networks, we show that our new method outperforms the MCMC-based method, especially when the network size is large (more than 100 nodes).
Priya Govindan, Rutgers University
Title: Local Structural Features Threaten Privacy across Social Networks
Can only a handful of structural features and no knowledge of
link-level connectivity threaten privacy of individuals in a social
network? Specifically, can we use a small number of local structural
features defined on nodes (such as degree, clustering coefficient,
average degree of neighbors, average clustering coefficient of
neighbors, etc) to reduce the uncertainty of individuals' identities?
Our work shows that the answer is yes. We detail how without direct
access to the adjacency matrix, we can use the information in an
auxiliary graph to effectively and efficiently reduce the uncertainty
of any individual's identity. To our knowledge, our work is the first
of its kind that uses only structural features in order to reduce the
uncertainty of an individual's identity to the best possible set of
candidates. In addition to being efficient (i.e., sub-quadratic
computation), our approach automatically selects the appropriate size
of the best possible set of candidates. Previous works assume the
adjacency matrix of the anonymized graph is released; and hence rely
on the sparsity of the human behavior exhibited in the adjacency
matrix. The node × feature matrix is not sparse, making the
re-identification problem a lot harder. Our experiments involve
multiple synthetic and real graphs plus different noise models and
parameter settings. On average for 70% of the individuals in real
graphs, we are able to reduce the uncertainty by 84%. In synthetic
graphs, on average for 82% of the nodes, we are able to reduce the
uncertainty by 60%. We compare our approach to three baseline methods;
and explain our results in terms of Jaccard Similarity and the number
of Lookalikes between the original and auxiliary graphs.
Title: Big Data - Small Mammals: Analyzing Social Networks of Naked Mole Rats
Many statistical and data mining techniques have been used to analyze the deluge of data generated by computerized, sensing devices. From scanner data in retail outlets, to astronomical data, to data generated by social network sites such as Twitter and Facebook, the challenges for making sense of this body of information is complex. All of this has led to a set of methods that can target different patterns and structure within a data set.
Behavioral psychologists traditionally have relied on "low-tech" methodologies for observing animal behavior in the wild and the laboratory. These methods are time intensive and laborious. When the observed animal is a colony animal, with many individuals to observe, traditional methods fail. In previous work, our approach to this issue was to inject RFID passive transponders under the skin of our study animal, the Naked Mole-rat (NMR). RFID readers are placed throughout the housing environment, allowing us to track all movements of all animals, as they move through these areas, with sub-second resolution for long periods of time. This methodology generates huge amounts of data requiring Big Data analytical techniques. From this data we create and analyze the resultant social network structure using frequent pattern analysis, adjacency matrix sampling, and principle components analysis.1, 2 In recent work we are able to measure changes in colony wide behavior described by these social networks. To measure this, we simulate the occurrence of animals in the colony losing RFID transponders in a randomly selected sample from 1, to the entire 33 members of the colony at three different times of day (least active, moderately active, and most active), and compare the social networks indicating animal co-localization using a quadratic assignment procedure (QAP) with a Pearson Correlation Coefficient in addition to a Monte Carlo technique using the Pearson Correlation Coefficient. Our results showed that these methods can detect events that differ from baseline.
Title: A Guide to Selecting a Network Similarity Method
We consider the problem of determining how similar two networks
(without known node-correspondences) are. This problem occurs
frequently in real-world applications such as transfer learning and
change detection. Many network-similarity methods exist; and it is
unclear how one should select from amongst them. We provide the first
empirical study on the relationships between different
network-similarity methods. Specifically, we present (1) an approach
for identifying groups of comparable network-similarity methods and
(2) an approach for computing the consensus among a given set of
network-similarity methods. We compare and contrast twenty
network-similarity methods by applying our approaches to a variety of
real datasets spanning multiple domains. Our experiments demonstrate
that (1) different network-similarity methods are surprisingly well
correlated, (2) some complex network-similarity methods can be closely
approximated by a much simpler method, and (3) a few
network-similarity methods produce rankings that are very close to the
consensus ranking.
Title: Methods and Lessons Learned from Mining the Wikipedia corpus (2001-2010)
Wikipedia is a collaboratively edited, multilingual, free Internet encyclopedia and social media. There are several issues relating to the involvement of Wikipedia knowledge system. For example, how the system is formed, how it evolves, how people collaborate to contribute their knowledge, what is the motivation of their contribution, and etc. The study of Wikipedia user profile data will tell us some clues to answer previous questions. Furthermore, the identification and modeling of top Wikipedia collaborators represent the main stream involvement of Wikipedia. It is also interesting to study their likelihood to stay a stable output over time. Finally, we will be able to reproduce the system after modeling and parameter inference. It will also provide methods and lessons to study similar knowledge based social media.
Title: Towards Robust Device-Free Passive Localization Through Automatic Camera-Assisted Recalibration
Device-free passive (DfP) localization techniques can localize human subjects without wearing a radio tag. Being convenient and private, DfP can find many applications in ubiquitous/pervasive computing. Unfortunately, DfP techniques need frequent manual recalibration of the radio signal values, which can be cumbersome and costly. We present SenCam, a sensor-camera collaboration solution that conducts automatic recalibration by leveraging existing surveillance camera(s). When the camera detects a subject, it can periodically trigger recalibration and update the radio signal data accordingly. This technique requires camera access occasionally each month, minimizing computational costs and reducing privacy concerns when compared to localization techniques solely based on cameras. Through experiments in an open indoor space, we show this scheme can retain good localization results while avoiding manual recalibration.
Title: Low Expected Latency Routing in Dynamic Networks
Timely and efficient message transmission through intermittently and sparsely connected networks is a problem of significant interest to the mobile networking community. Although the long-term statistics obeyed by the time-varying connectivity in such networks can be characterized systematically and can be used for selecting good routes, it may be possible to achieve better performance if actual states of the links varying over time can be intelligently used in conjunction with these statistical dynamics models. In this paper, we explore a {\em continuum} of minimum expected latency routing principles for such dynamic networks, spanning purely model-based and state-oblivious source routing; state-based source routing; and various flavors of dynamic (or hop-by-hop) routing with increasing amounts of current link state knowledge around the source (which is available at the source).
First, we give guaranteed approximation algorithms for the model-assisted source routing problem; then we show using extensive simulations on both synthetic and MIT Reality Mining traces that although dynamically sampling link states helps to improve expected routing latency compared to source routing, there is a rapidly diminishing return in performance with respect to the knowledge of current link states beyond 2 hops from the source. To the best of our knowledge, this is the first thorough characterization of the performance of the entire spectrum of model-assisted routing algorithms ranging from little knowledge to complete knowledge of link dynamics.
Susan P. Imberman, College of Staten Island and the Graduate Center at CUNY
Sucheta Soundarajan, Rutgers University
Wei Wutao, Purdue University
Chenren Xu, Rutgers University
Feng Yu, the Graduate Center at CUNY
Previous: Program