Small Summaries for Big Data
This book is expected to be published in 2020. This book is aimed at
both students and practitioners interested in algorithms and data
structures for working with very large volumes of data. These
techniques are of relevance to people working in big data, data science,
and machine learning. The algorithms described have been adopted in
systems from tech companies like Google, Apple, Microsoft, Netflix and
Twitter, and many more.
This material will be published by Cambridge
University Press as Small Summaries for Big Data by Graham Cormode and Ke Yi. This pre-publication version is free to
view and download for personal use only. Not for re-distribution, re-sale or use in derivative works. © copyright Graham
Cormode and Ke Ye 2019.
Summary
The volume of data generated in modern applications can be massive,
overwhelming our abilities to conveniently transmit, store, and index.
For many scenarios, it is desirable to instead build a compact summary
of a dataset that is vastly smaller. In exchange for some
approximation, we obtain flexible and efficient tools that can answer a
range of different types of query over the data. This book provides a
comprehensive introduction to the topic data summarization, showcasing
the algorithms, their behavior, and the mathematical underpinnings of
their operation. The coverage starts with simple sums and approximate
counts, building to more advanced probabilistic structures such as the
Bloom Filter, distinct value summaries, sketches, and quantile
summaries. Summaries are described for specific types of data, such as
geometric data, graphs, and vectors and matrices. Throughout, examples,
pseudocode and applications are given to enhance understanding.
Contents
We are making draft chapters publicly available (this snapshot from October 2019 is the pre-production draft).
We welcome comments/corrections by email to the authors.