DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness

October 23 - 24, 2017
DIMACS Center, CoRE Building, Rutgers University

Organizing Committee:
John Abowd, Cornell University
René Bastón, Columbia University
Tal Rabin, IBM
Adam Smith, Boston University
Salil Vadhan, Harvard University
Rebecca Wright, Rutgers University
Workshop email: Barriers_workshop at email.rutgers.edu
Presented under the auspices of the DIMACS Big Data Initiative on Privacy and Security and in collaboration with the Northeast Big Data Innovation Hub.


Jeremiah Blocki, Purdue University

Title: Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords

This talk will describe the design and implementation of a differentially private mechanism for releasing perturbed integer partitions. This mechanism was used to release a password frequency corpus of 70 Million Yahoo! passwords. Given a dataset of user-chosen passwords, the frequency list reveals the frequency of each unique password. The talk will especially focus on some of the design decisions we encountered when deploying our differentially private mechanism.

Our mechanism is based on a novel algorithm for sampling that enables an efficient implementation of the exponential mechanism for differential privacy (naive sampling is exponential time). While we proved that our mechanism introduces L1 error at most $n^{1/2}/\epsilon$, the best lower bounds on L1 error are $n^{1/2}/\epsilon^{1/2}$. We will present strong empirical evidence for the conjecture that the L1 error actually scales with $n^{1/2}/\epsilon^{1/2}$. These empirical results suggest that the algorithm might be usefully applied to reduce the L1 error of differentially private algorithms for releasing degree distributions in a social network with node-privacy.

Henry Corrigan-Gibbs, Stanford University

Title: Private Collection of Aggregate Statistics at Scale

This talk will present Prio, a privacy-preserving system for the collection of aggregate statistics. Each Prio client holds a private data value (e.g., its current location), and a small set of servers compute statistical functions over the values of all clients (e.g., the most popular location). As long as at least one server is honest, the Prio servers learn nearly nothing about the clients' private data, except what they can infer from the aggregate statistics that the system computes. To protect the system's robustness in the face of malicious clients, Prio uses a new lightweight cryptographic tool we call a secret-shared non-interactive proof (SNIP).

Towards the end of the talk, I will discuss practical barriers to deploying private aggregation systems, such as Prio, based on our discussions with potential industry partners.

This talk is based on joint work with Dan Boneh.

Michael Hay, Colgate University

Title: Facilitating the Design and Evaluation of Differentially Private Algorithms

The existence of real-world deployments of differential privacy demonstrates that this technology can be used in practice, but the relatively small number of deployments suggests that for a given task, it is non-trivial to design an algorithm that achieves the competing goals of accuracy, efficiency, and privacy.

In this talk, I will explore some of the challenges of algorithm design and deployment and our attempts to demystify this process. I begin at the end with empirical evaluation, where I will describe a set of evaluation principles for assessing the accuracy of differentially private algorithms and the results of a benchmark study. The findings of this study motivate the need for algorithms and algorithm design processes that take deployment context into account.

I will describe our work on algorithm selection: choosing the best algorithm "at runtime" for a given task. I will conclude with a look at the process of algorithm design itself and describe ongoing work on developing easy-to-use software engineering tools that allow privacy non-experts to write accurate and secure differentially private algorithms.

Brett Hemenway, University of Pennsylvania

Title: Secure Multiparty Computation for Scientific Research

This talk will describe recent work on the design and implementation of practical secure Multiparty Computation (MPC) protocols. We will focus on a recent initiative funded by the Laura and John Arnold Foundation to develop an open-source MPC framework that would allow researchers to compute statistics across multiple, private data sets. This is a joint-effort between cryptographers, statisticians, social-scientists, policy-makers and data-owners, and this talk will outline some of the design decisions and use cases that have arisen from this collaboration.

James Honaker, Harvard University

Title: Differential Privacy in the Scientific Data Repository

We describe a roll for differential privacy in open data repositories handling sensitive data. Archival repositories in the human sciences balance discoverability and replicability with their legal liabilities and ethical constraints to protect sensitive information. The ability to explore differentially private releases of archived data allows a curve-bending change in this trade-off.

We further describe PSI, an implementation of a curator system for differentially private queries and statistical models, and its integration with the Dataverse repository. We describe some of the pragmatics of implementing a general purpose curator that works across a wide variety of types of data and types of uses, and of presenting differential privacy to an applied audience new to these concepts.

Seny Kamara, Brown University

Title: How to Query Encrypted Data with Security against Persistent and Snapshot Adversaries

We revisit the problem of designing secure and efficient dynamic searchable symmetric encryption (SSE) schemes in several respects. First, we consider the more general setting of structured encryption (STE). In particular, we focus on the design of dynamic encrypted multi-maps which give single-keyword SSE schemes and can be used as building blocks in boolean SSE and graph encryption schemes. Second, motivated by the problem of data breaches, we formalize a notion of security for dynamic STE schemes that guarantees security against a snapshot adversary; that is, an adversary that receives a copy of the encrypted structure at various times. Moreover, we initiate the study of dual-secure STE constructions; constructions that are secure against both a persistent and snapshot adversary while also being forward private.

As a concrete instantiation, we propose a new dual-secure dynamic multi-map encryption scheme. It outperforms all (non-dual-secure) existing constructions and has query complexity that grows with the selectivity of the query and the number of deletes since the client executed a (linear-time) rebuild protocol (which we show can be de-amortized).

We implemented our scheme and evaluated its concrete efficiency empirically. Our experiment shows that our construction is very efficient with queries taking less than 1 microsecond per query/label pair (keyword/id pair in the setting of SSE).

Ben Kreuter, Google

Title: Secure Multiparty Computation at Google

For over thirty years, cryptographers have studied Secure Multiparty Computation, but only in the past few years have real applications been deployed. Google is one of the early adopters of MPC and has been using MPC to support business and consumer applications while protecting user data. This talk will describe specific applications of MPC at Google, and how Google's engineers have solved some of the practical challenges of deploying MPC in large-scale applications.

This talk will also discuss the important role of theoretical results and proof techniques in the design process of those MPC protocols.

Philip Leclerc, Stephen Clark, William Sexton, Center for Disclosure Avoidance Research, U.S. Census Bureau

Title: 2020 Decennial Census: Formal Privacy Implementation Update

Work continues at the U.S. Census Bureau regarding the development of algorithms and systems to provide data releases with formal privacy guarantees for the 2020 Census of Population and Housing. In this talk we will present the differential private algorithms under development, the impact of identified structural zeroes on these algorithms, and the design of the Disclosure Avoidance Sub-system (DAS). We will also present technical challenges and stakeholder requirements that we have encountered.

Ilya Mironov, Google

Title: Rényi Differential Privacy

We discuss a natural relaxation of differential privacy based on the Renyi divergence. Closely related notions have appeared in several recent papers that analyzed composition of differentially private mechanisms. We argue that the useful analytical tool can be used as a privacy definition, compactly and accurately representing guarantees on the tails of the privacy loss distribution, and discuss its recent applications in privacy-preserving machine learning.

Joseph Near, UC Berkeley

Title: Towards Practical Differential Privacy for SQL Queries

Differential privacy promises to enable general data analytics while protecting individual privacy. However, applications of differential privacy in industry (e.g. at Google and Apple) have so far focused on collecting specific and limited kinds of information from users.

In this talk, I will discuss the development of a differentially private interface for general-purpose database queries. In collaboration with Uber, we developed a set of requirements for deploying such an interface in a practical setting--among them, support for joins and the ability to work with an existing database. I will describe our approach, elastic sensitivity, for satisfying these requirements, and detail our experience implementing the approach in Uber's setting.

Aleksandra Slavkovic, Penn State University

Title: Structure and Sensitivity in Differential privacy: Optimal K-norm mechanisms

Differential Privacy (DP) provides an elegant mathematical framework for defining a provable disclosure risk in the presence of arbitrary adversaries. However, numerous technical and practical subtleties exist that limit its usability in statistical applications. We introduce the concept of the adjacent output space, where the structure of this space is directly connected to the sensitivity analysis. We provide extensions to the previously proposed K-norm mechanisms and show that the optimal K-Norm mechanism for any problem is determined solely by the adjacent output space. We focus on L1, L2, and L-infinity in particular as well as other K-norms. We implement these mechanisms on linear and logistics regressions, and demonstrate the improvements on data utility. We show that the choice of norm can result in a significant reduction of noise. By choosing one mechanism over another, the same statistical utility can be achieved using half the original privacy budget. These improvements result in higher data usability, more accurate results, and consequently better inference under DP.

(Jordan Awan and Aleksandra Slavkovic)

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 18, 2017.