DIMACS Theoretical Computer Science Seminar

Title: Random Walks on Context Spaces: Towards an explanation of the mysteries of semantic word embeddings

Speaker: Sanjeev Arora, Princeton University

Date: Wednesday, April 29, 2015 11:00am-12:00pm

Location: CoRE Bldg, Room 301, Rutgers University, Busch Campus, Piscataway, NJ

Abstract:

Semantic word embeddings represent words as vectors in R^d for say d=300. They are useful for many NLP tasks and often constructed using nonlinear/nonconvex techniques such as deep nets and energy-based models. Recently Mikolov et al (2013) showed that such embeddings exhibit linear structure that can be used to solve "word analogy tasks" such as man: woman :: king: ??. Subsequently, Levy and Goldberg (2014) and Pennington et al (2014) tried to explain why such linear structure should arise in embeddings derived from nonlinear methods. We point out gaps in these explanations and provide a more complete explanation in the form of a loglinear generative model for text corpora that directly models latent semantic structure in words and involves random walk in the context space. A rigorous mathematical analysis is performed to obtain explicit expressions for word co-occurrance probabilities, which leads to a surprising explanation for why word embeddings exist and why they exhibit linear structure that allows solving analogies. Our model and its analysis lead to several counterintuitive predictions, which are also empirically verified.

We think our methodology may be useful in other settings.

Joint work with Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski (listed in alphabetical order).

See: http://www.math.rutgers.edu/~sk1233/theory-seminar/S15/