DIMACS/Columbia Data Science Institute Workshop on Cryptography for Big Data

December 14 - 15, 2015
Davis Auditorium, Schapiro Center for Engineering and Physical Science Research (CEPSR)
Columbia University

Organizers:: Rosario Gennaro, City College of New York, rosario at ccny.cuny.edu; Tal Malkin, Columbia University, tal at cs.columbia.edu; Hoeteck Wee, École Normale Supérieure and Columbia University

Presented under the auspices of the DIMACS Special Focus on Information Sharing and Dynamic Data Analysis, the DIMACS Special Focus on Cryptography as part of the DIMACS/Simons Collaboration in Cryptography and in collaboration with the Columbia Data Science Institute.

Abstracts:

Augustin Chaintreau, Columbia University

Title: What Web Transparency Can Do for You and What Crypto Researchers Can Do for Web Transparency?

Big Data promises important societal progress but exacerbates the need for algorithmic accountability, following recent evidence of digital redlining, online predatory targeting and price discrimination. This talk will first survey the recent area of web transparency, a quest to build generic methods to address this urgent need. Questions such as "What, inside my personal data, is responsible for this ad I just received?" or "Can I trust this online provider who claims not to use data on some sensitive features for personalization?" will be shown with a new light.

In the spirit of fostering more research in this area, we then focus on two limitations of current tools: how to cope with increasing complexity in personalization, and how to conduct transparency experiment at no added risk to the users. We briefly mention two examples of works in progress bridging computational learning theory and privacy preserving analytic systems to present promising avenues for cryptographic approaches in this domain.

(Based on joint works with R. Geambasu, D. Hsu, M. Lecuyer, R. Spahn, G. Ducoffe.)

David Cash, Rutgers University

Title: The Locality of Searchable Encryption

Searchable encryption (SE) allows a data owner to encrypt an index in a way that allows for searching by an untrusted server that cannot decrypt the data. In this talk I will describe a theoretical lower bound on the "spatial locality" of searchable encryption, showing that encrypted search is inherently non-local (and thus often slower) when compared to plaintext searching. Our bound was motivated by implementation difficulties with SE for large datasets, and appears to be the first such lower bound where poor locality is required for security.

Joint work with Stefano Tessaro.

Yevgeniy Dodis, NYU

Title: Backdoorless Cryptography

More recently, the cryptographic community has learned of a disturbingly wide array of new security vulnerabilities. The revelations of Edward Snowden show that the United States National Security Agency successfully gained access to secret information by extraordinary means, including subverting cryptographic standards, intercepting messages and tampering with hardware on its way to users. Due to the complexity of modern cryptographic software, such vulnerabilities are extremely hard to detect in practice, and, ironically, cryptographic modules are often the easiest to attack, as attackers can often use cryptographic mechanisms to mask their activities or opportunistically hide their communications within encrypted traffic. This leads to the question if any meaningful security is possible in the setting where the designers of the algorithm/standardmight intentionally or accidentally introduce hidden backdoors which are unknown to the unsuspecting public, but will allow them to break security without detection.

Motivated by these considerations, we initiate the study of backdoorless cryptography, and present positive and negative results for building both backdoored as well as backdoorless pseudorandom generators.

Yaniv Erlich, Columbia

Title: Genome Hacking

We are entering an era of ubiquitous genetic information for research, clinical care and personal curiosity. Sharing these data sets is vital for progress in biomedical research. However, a growing concern is the ability to protect the genetic privacy of the data originators. Here, I report that surnames can be recovered from personal genomes. When combined with other types of metadata, this technique can be used to triangulate the identity of the target. A key feature of this technique is that it entirely relies on free, publicly accessible Internet resources. I will also talk about the inability to protect most types of genomic information using current type techniques.

Sanjam Garg, UC Berkeley

Title: Secure RAM Computation

Customarily, secure computation techniques are devised for circuits. Using these techniques for securing RAM programs ends up being prohibitively inefficient. To overcome this problem, several secure computation techniques that work directly for RAM programs have been developed. However, the performance of these solutions is unsatisfactory.

In this talk, I will survey some of the new developments that considerably improve the efficiency, on several parameters, of various secure computation tasks for RAM programs.

(Based on joint works with Steve Lu, Divya Gupta, Peihan Miao, Payman Mohassel, Omkant Pandey, Charalampos Papamanthou, Antigoni Polychroniadou, Rafail Ostrovsky, and Alessandra Scafuro.)

Sharon Goldberg, Boston University

Title: NSEC5: Provably Preventing DNSSEC Zone Enumeration

DNSSEC is designed to prevent network attackers from tampering with domain name system (DNS) messages. The cryptographic machinery used in DNSSEC, however, also creates a new vulnerability, zone enumeration, enabling an adversary to use a small number of online DNSSEC queries combined with offline dictionary attacks to learn which domain names are present or absent in a DNS zone. We start by proving that the design underlying current DNSSEC standard, with NSEC and NSEC3 records, inherently suffers from zone enumeration: specifically, we show that security against network attackers and privacy against zone enumeration cannot be satisfied simultaneously unless the DNSSEC server performs online public-key cryptographic operations. We then move on to proposing NSEC5, a new cryptographic construction that solves the problem of DNSSEC zone enumeration while remaining faithful to the operational realities of DNSSEC. NSEC5 can be thought of as a variant of NSEC3 in which the unkeyed hash function is replaced with an RSA-based keyed hashing scheme.

Joint work with Moni Naor, Dimitrios Papadopoulos, Leonid Reyzin, Sachin Vasant, Asaf Ziv.

Seny Kamara, Microsoft Research

Title: Constructions of Codes with the Locality Property

Many encrypted database (EDB) systems have been proposed in the last few years as cloud computing has grown in popularity and data breaches have increased. The state-of-the-art EDB systems for relational databases can handle SQL queries over encrypted data and are competitive with commercial database systems. These systems, most of which are based on the design of CryptDB (SOSP 2011), achieve these properties by making use of property-preserving encryption schemes such as deterministic (DTE) and order-preserving encryption (OPE).

In this paper, we study the concrete security provided by such systems. We present a series of attacks that recover the plaintext from DTE- and OPE-encrypted database columns using only the encrypted column and publicly-available auxiliary information. We consider well-known attacks, including frequency analysis and sorting, as well as new attacks based on combinatorial optimization.

We evaluate these attacks empirically in an electronic medical records (EMR) scenario using real patient data from 200 U.S. hospitals. When the encrypted database is operating in a steady-state where enough encryption layers have been peeled to permit the application to run its queries, our experimental results show that an alarming amount of sensitive information can be recovered. In particular, our attacks correctly recovered certain OPE-encrypted attributes (e.g., age and disease severity) for more than 80% of the patient records from 95% of the hospitals; and certain DTE-encrypted attributes (e.g., sex, race, and mortality risk) for more than 60% of the patient records from more than 60% of the hospitals.

This is joint work with Muhammad Naveed and Charles Wright.

Hugo Krawczyk, IBM Research

Title: Highly-Functional Highly-Scalable Search on Large Encrypted Data Sets

I will describe advances in practical solutions to the problem of searchable encryption in which a data owner D outsources a database to an external server E (e.g., the cloud) in encrypted form. Later D can authorize clients to search the data at E and retrieve only those records matching their authorized queries while hiding from E information about the plaintext database and about the values queried by clients. Privacy of queries is even enforced against the data owner who authorizes queries while only learning minimal information about the queried values. In all cases, searches at E are performed without ever decrypting data or queries (in particular, E never gets the decryption keys).

A solution developed by IBM Research and UC Irvine presents major advances relative to prior work that focused on single-keyword search, single client, and implementations in small-size databases. This work supports search via any boolean expression applied to sets of keywords associated with documents in the database as well as range queries. Our implementation of the proposed protocol has been tested on databases with billions of index entries (document-keyword pairs), e.g., a US-scale census database with 100 million records each with 1,000 associated keywords and a very large collection of crawled web-pages that includes, among others, a full snapshot of the English Wikipedia. Recently, we expanded the system to support more costly queries including substring, wildcard and phrase searches.

The availability of such technology enables privacy solutions that secure data by separating encrypted data from the keys used to decrypt it, and by applying fine-grain access control and delegation mechanisms at the data owner end. While in the above solution data encryption is semantically secure (in particular it avoids the use of deterministic or order preserving encryption), some forms of leakage via access patterns to the encrypted data are still possible. Reducing leakage (while keeping efficiency) and understanding its significance for particular applications is one of the most challenging problems in this area.

Anna Lysyanskaya, Brown University

A Whirlwind Tour of Anonymous Credentials

How does Alice convince Bob that she possesses a particular credential from Charlie? If Alice has a signature from Charlie on her public key, then all she will need to do is to show this signature to Bob, and also convince Bob that she is indeed Alice. In this setting, it is relatively clear how to revoke Alice's credentials; also, one can limit the frequency and context of Alice's credential use, and hold her accountable if she exceeds these limits.

How does Alice do this without revealing her identity to Bob, or indeed any identifying information? Research on anonymous credentials concerns itself with protocols that enable Alice to obtain and demonstrate possession of credentials without revealing unnecessary information. Similarly to the non-anonymous case, limits can be placed on the frequency and context of Alice's use of anonymous credentials as well, and she can be identified and held accountable if she exceeds them.

Anonymous delegation is a research area that concerns itself with the next question: how does Alice delegate her credential to Bob without revealing any information about herself and learning anything about Bob? In this talk, I will survey what we know so far about anonymous credentials, conditional anonymity and anonymous delegation. Specifically, I will outline generic constructions for these schemes, and give examples of specific instantiations. Some of these instantiations are efficient enough for practical use, have been implemented and piloted in real systems. I will also review key open problems whose resolution will make anonymous credentials more widely deployable.

Raluca Popa, UC Berkeley

Title: Building Web Applications on Top of Encrypted Data

Web applications rely on servers to store and process confidential information. However, anyone who gains access to the server (e.g., an attacker, a curious administrator, or a government) can obtain all of the data stored there.

In this talk, I present Mylar, a platform for building web applications, which protects data confidentiality against attackers with full access to server data. Mylar stores sensitive data encrypted on the server, and decrypts that data only in users' browsers. Mylar addresses three challenges in making this approach work. First, Mylar provides a new searchable encryption scheme that enables search over data encrypted with different keys. Second, Mylar allows users to share keys and encrypted data securely in the presence of an active adversary. Finally, Mylar ensures that client-side application code is authentic, even if the server is malicious. Results with a prototype of Mylar built on top of the Meteor framework are promising: porting 6 applications required changing just 36 lines of code on average, and the performance overheads are modest, amounting to a 17% throughput loss and a 50 ms latency increase for sending a message in a chat application. The endometriosis medical application of Newton-Wellesley's hospital, Boston is secured with Mylar.

Joint work with E. Stark, S. Valdez, J. Helfer, N. Zeldovich, F. Kaashoek and H. Balakrishnan

Thomas Ristenpart, Cornell Tech

Title: Exploiting Leakage in Machine Learning and Searchable Encryption

I'll talk about how adversaries can take advantage of subtle leakage of confidential information. In the first half of the talk I'll discuss machine learning and leakage issues in ML models. ML prediction functions, what we refer to as models, derive frequently from sensitive training data sets, and are exposed either directly or indirectly (via API calls) to potential adversaries. I will present our work on model inversion, a new attack setting that we explored recently in a number of scenarios including pharmacogenetics, facial recognition, and lifestyle surveys. In the model inversion attacks we develop, an adversary can use access to a model and some partial information about a person to improve their ability to guess sensitive information about that person, such as their genotype, an image of their face, and private lifestyle behaviors.

I will then go on to discuss our recent work on searchable symmetric encryption, in which a client uploads an encryption of their documents to a server in a way that they can later search it using encrypted searches. These schemes cannot achieve traditional notions of encryption security, and instead are most-often proven to be secure relative to some leakage model that characterizes the kinds of information an (adversarial) server can obtain about the confidential data. We'll look at leakage-abuse attacks that show how to abuse this leakage to learn damaging information about plaintexts.

This talk will cover joint work with Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page and David Cash, Paul Grubbs, and Jason Perry.

Mike Rosulek, Oregon State

Title: Secure Computation with Sublinear Cost

Secure computation refers to a protocol that allows parties to perform a computation on private inputs, revealing only the result of that computation and nothing more. In order to guarantee security, the cost of the protocol must be at least linear in the size of the data. This limitation is particularly problematic when the data is large; secure computation and sublinear-time algorithms appear to be incompatible. In this talk I will survey recent approaches that attempt to avoid this limitation and allow for sublinear-cost secure computation in certain settings.

Abhi Shelat, University of Virginia

Title: The Next 100,000x

In the last 5 years, there have been roughly 10^5 factor improvements in two-party secure computation, both scale and speed. Where will the next 10^5 factor come from? I will speculate and present a few promising directions. I will discuss an application of secure computation that is “big” for secure computation, and currently out of reach.

Elaine Shi, Cornell

Title: Trusted Hardware: Life, the Composable Universe, and Everything

Trusted hardware offers a promising and practically viable solution for enabling privacy-preserving analytics over big data. Although numerous secure processors have been built in academia and industry, difficult challenges lie in the way of practical, ubiquitous adoption of trusted hardware.

In this talk, I will give a tutorial on trusted hardware such as Intel's SGX. I will seek to answer frequently asked questions such as:

What does trusted hardware actually offer?
How do various secure processors compare?
Does trusted hardware obviate modern cryptography?
Why hasn't trusted hardware been widely deployed (certainly incommensurate with the theoretical understanding of its expressive power)?
How do we tackle these difficult challenges and enable the ubiquitous adoption of trusted hardware?

Part of this talk seeks to build a bridge between architecture, cryptography, and programming languages, and to promote synergy between these communities such that we can work together towards enabling the ubiquitous adoption of trusted hardware.

Adam Smith, Penn State

Title: Privacy of Secured Computations

Modern cryptography offers a mind-blowing range of tools for "securing computation", from zero-knowledge proofs and secure function evaluation to homomorphic encryption and program obfuscation. These tools offer sophisticated control over the information leaked by intermediate computations, essentially allowing one to hide all but the desired outputs.

The increasing maturity of these tools raises profound questions: What computations can, and should, we secure? How can we reason about the information leaked by the outputs of a computation?

This tutorial will present some of the technical ideas that address these problems. We will give a partial taxonomy of the attacks on data privacy based on computation outputs. We will then define differential privacy, a definitional framework that offers provable resistance to a large class of attacks. We briefly survey the techniques used to achieve differential privacy and, time permitting, recent connections to economics and statistics.

Vinod Vaikuntanthan, MIT

Title: Computing on Encrypted Data

The basic nature of encryption has always been all-or-nothing: anyone who knows the secret key can decode and recover the entire data; but, without the key, nothing can be revealed.

Modern technologies such as cloud computing raise fundamentally new challenges: Can we compute on encrypted data without decrypting it,and without knowledge of the secret key? Which functions can be computed this way? Who can learn the results of such computations?

In this talk, I will survey recent works on constructing fully homomorphic encryption and functional encryption schemes, two powerful methods of computing on encrypted data that answer these challenges.

Panel

In this panel, we will discuss security and privacy challenges in the journalism and policy domain. In particular, we will address the following question: in this post-Snowden era of Big Data and potential mass surveillance, how can cryptography and cryptographers potentially help to ensure our civil liberties, from privacy of the individual to free speech and freedom of the press?

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on December 10, 2015.