Title: Probabilistic Topic Models and User Behavior
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents. Topic modeling ideas have been adapted to many domains, including images, music, networks, genomics, and neuroscience.
Traditional topic modeling algorithms analyze a document collection and estimate its latent thematic structure. However, many collections contain an additional type of data: how people use the documents. For example, readers click on articles in a newspaper website, scientists place articles in their personal libraries, and lawmakers vote on a collection of bills. Behavior data is essential both for making predictions about users (such as for a recommendation system) and for understanding how a collection and its users are organized.
In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, models that simultaneously analyze a collection of texts and its corresponding user behavior. We studied collaborative topic models on 80,000 scientists' libraries from Mendeley and 100,000 users' click data from the arXiv. Collaborative topic models enable interpretable recommendation systems, capturing scientists' preferences and pointing them to articles of interest. Further, these models can organize the articles according to the discovered patterns of readership. For example, we can identify articles that are important within a field and articles that transcend disciplinary boundaries.
More broadly, topic modeling is a case study in the large field of applied probabilistic modeling. Finally, I will survey some recent advances in this field. I will show how modern probabilistic modeling gives data scientists a rich language for expressing statistical assumptions and scalable algorithms for uncovering hidden patterns in massive data.
David Blei is a Professor of Statistics and Computer Science at Columbia University. His research is in statistical machine learning, involving probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data.
David earned his Bachelor's degree in Computer Science and Mathematics from Brown University (1997) and his PhD in Computer Science from the University of California, Berkeley (2004). Before arriving to Columbia, he was an Associate Professor of Computer Science at Princeton University. He has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), Blavatnik Faculty Award (2013), and ACM-Infosys Foundation Award (2013).
Title: Graph Isomorphism and Circuit Minimization
The complexity status of Graph Isomorphism (GI) and the Minimum Circuit Size Problem (MCSP) remain unresolved till date. We show that GI is efficiently reducible to MSCP. This is the first theorem relating the computational power of GI and MCSP, despite the long history these problems share, as candidate NP-intermediate problems.
This is a joint work with Eric Allender.
Title: On Reconstructing Set Systems from Sizes of Small Intersections
An old result of Kahn, Linial and Samorodnitsky shows that when the size of the universe is n, a set system A_1, .. , A_m is specified uniquely up to permutations given the sizes of t-way intersections for 1<=t<=1+log n. While simple, the proof is non-constructive and does not give an efficient algorithm to reconstruct the set system.
In this work, we show how to make their method constructive by using the Ellipsoid method.
Joint work with Rocco Servedio.
Title: The Polynomial Freiman-Ruzsa Conjecture in Additive Combinatorics and its Applications in Complexity Theory
Additive combinatorics is the branch of discrete mathematics which aims to quantify the amount of additive structure in subsets of additive groups. One of the central conjectures in additive combinatorics is the polynomial Freiman-Ruzsa conjecture (PFR) which attempts to classify 'approximate subgroups' of abelian groups. In a recent breakthrough, a slightly weaker quasi-polynomial version of this conjecture was proven by Sanders.
The PFR conjecture has shown to have various applications in complexity theory: To testing linear and quadratic polynomials [Samorodnitsky '07, Green-Tao '10, Lovett '10], to the construction of two-source extractors [Ben-Sasson, R. '11], to relating rank to communication complexity [Ben-Sasson, Lovett, R. '12], to lower bounds on locally decodable codes [Bhowmick, Dvir, Lovett '13] and to the construction of non-malleable codes [Aggarwal, Dodis, Lovett '14]. Many of these applications are derived via the approximate duality conjecture, introduced in [Ben-Sasson, R. '11], which was shown to have tight relations with the polynomial Freiman-Ruzsa conjecture.
In the talk I will introduce the polynomial Freiman-Ruzsa and approximate duality conjectures and survey the relations between them, as well as some of the applications mentioned above.
Title: Large-scale Curvature of Real-life Networks and Its Implications for Computations
We provide evidence that real-life 'informatics' networks exhibit strong hyperbolicity in the sense of Gromov; the '4-point delta' being much smaller than the graph diameter. This property can be used to advantage. For example, we show that all-pair distances on a graph with E=100s of millions of edges may be approximated in O(E) time (e.g., in less than ten minutes on a standard server) with small average error or the p-centers of the same may be computed in 10s of seconds, both much smaller than anticipated by theory. We speculate that this result is likely due to the uncharacteristically small 'hyperbolic cores' of these graphs.
Title: Poisson, Dirichlet, and Redundancy in Estimation Over Large Alphabets
This talk is about probability estimation in the so-called "large-alphabet" setting, where you want to estimate a distribution $P$ on a possibly countably infinite alphabet based on i.i.d. samples. Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. Pattern probability estimation forms the basis for the information-theoretic approach to data compression in this setting. In this talk we'll look at how estimators based on Bayesian nonparametric modeling perform with respect to information-theoretic measures of compression efficiency.
This is joint work with Narayana P. Santhanam (U. Hawaii) and Jae Oh Woo (Yale U.)