### BSF/DIMACS/DyDAn Workshop on Data Privacy

#### February 4 - 7, 2008 DIMACS/DyDAn Center, CoRE Building, Rutgers University

Organizers:
Kobbi Nissim, Ben Gurion University, kobbi at cs.bgu.ac.il
Benny Pinkas, University of Haifa, benny at cs.haifa.ac.il
Rebecca Wright, Rutgers University, rebecca.wright at rutgers.edu
Presented under the auspices of the DIMACS Special Focus on Communication Security and Information Privacy and the Center for Dynamic Data Analysis (DyDAn).

This workshop is jointly sponsored by:

• United States-Israel Binational Science Foundation
• DIMACS
• US Department of Homeland Security

#### Abstracts:

John M Abowd, Cornell University

Tutorial: Synthetic Data

Synthetic data is a statistical disclosure limitation (SDL) method that either fully or partially replaces the underlying confidential data with data that have been sampled from an appropriate joint distribution. In the computer science data privacy literature, synthetic data is a random sanitizer that takes the underlying data into the release data by means of an appropriate, random, transition matrix. The tutorial discusses the origins of synthetic data in the SDL literature with examples from real projects that have been conducted by statistical agencies of the U.S. government. The formal relation between synthetic data and certain differential privacy measures is also discussed.

John M Abowd, Cornell University

Title: Privacy: Theory Meets Practice On The Map

In this paper, we propose the first formal privacy analysis of a data anonymization process known as the synthetic data generation, a technique becoming popular in the statistics community. The target application for this work is a mapping program that shows the commuting patterns of the population of the United States. The source data for this application were collected by the U.S. Census Bureau, but due to privacy constraints, they cannot be used directly by the mapping program. Instead, we generate synthetic data that statistically mimic the original data while providing privacy guarantees. We use these synthetic data as a surrogate for the original data. We find that while some existing definitions of privacy are inapplicable to our target application, others are too conservative and render the synthetic data useless since they guard against privacy breaches that are very unlikely. Moreover, the data in our target application is sparse, and none of the existing solutions are tailored to anonymize sparse data. In this paper, we propose solutions to address the above issues.

Amos Beimel, Ben-Gurion University

Title: How Should We Solve Search Problems Privately

Secure multiparty computation allows a group of distrusting parties to jointly compute a (possibly randomized) function of their inputs. However, it is often the case that the parties executing a computation try to solve a search problem, where one input may have a multitude of correct answers ? such as when the parties compute a shortest path in a graph or find a solution to a set of linear equations. The choice of the algorithm that picks one output from the solution set has significant implications on the privacy of the computation. In this talk, I'll first discuss a minimal definition for private computation of search problems and describe impossibility results of private approximation of search problems. Then,  I'll discuss stronger definitions of privacy for search problems that provide reasonable privacy and describe algorithmic machinery for designing such protocols for a broad selection of search problems.

This talk is based on joint works with Paz Carmi, Renen Hallak, Tal Malkin, Kobbi Nissim, and Enav Weinreb

Avrim Blum, Katrina Ligett, Aaron Roth, Carnegie Mellon University

Title: A Learning Theory Perspective on Data Privacy: New Hope for Releasing Useful Databases Privately

We demonstrate that, ignoring computational constraints, it is possible to release privacy-preserving databases that are useful for all queries over a discretized domain from any given concept class with polynomial VC-dimension. We show a new lower bound for releasing databases that are useful for halfspace queries over a continuous domain. Despite this, we give a privacy-preserving polynomial-time algorithm that releases information useful for all halfspace queries, for a slightly relaxed definition of usefulness.

Melissa Chase

Title: Delegatable Anonymous Credentials

Anonymous credential systems allow users to obtain credentials from organizations, and to anonymously and unlinkably prove possesion of valid credentials. Often in practice, however, a user authenticates using some credential chain: a root organization gives a credential to an intermediate party who can in turn use this to issue credentials to user. We address this problem by presenting the first efficient delegatable anonymous credential scheme. In such a scheme, a party can receive credentials from an organization, and then anonymously and unlinkably issue delegated credentials to other users, who can in turn delegate these credentials to another level. Then a user can prove possession of a valid chain of credentials of a given length without revealing any other identifying information. This talk will describe our result and outline some of the main techniques used to achieve it.

This is joint work with Mira Belenkiy, Jan Camenisch, Markulf Kohlweiss, Anna Lysyanskaya, and Hovav Shacham

Yevgeniy Dodis, NYU and Harvard University

Title: Deniable Authentication on the Internet

We revisit the question of deniable cryptographic primitives, where, intuitively, a malicious party observing the interaction, cannot later prove to a third party that the interaction took place. Example include deniable message authentication, key exchange and identification. While these question was heavily studied in the literature, we argue that none of the existing definitions (and solutions) to this question is satisfactory.

We propose the strongest and, arguably, the most natural and intuitive definition of deniability for these tasks: the proposed protocol should be secure in the recently proposed Generalized Universal Composability (GUC) Framework of Canetti, Dodis, Pass and Walfish. Among other things, our definition guarantees on-line deniability and concurrent composition. Quite remarkably, our main result shows that none of the above mentioned tasks (identification, key exchange, authentication) is realizable against adaptive attackers, even in the REGISTERED PUBLIC KEY MODEL (registered PKI), and even if data erasures are allowed. Registered PKI is the strongest setup assumption which is assumed to be "reasonable". Thus, our result explains why all the previous attempts to solve deniability felt short of achieving the strongest form of deniability.

We also show that various slight relaxation of our notion ARE achievable, and discuss implications of our results to the general task of designing deniable protocols.

Arik Friedman, Technion, Israel

Title: k-Anonymous Data Mining

In this talk we explore an approach to privacy preserving data mining that relies on the k-anonymity model. The k-anonymity model guarantees that no private information in a table can be linked to a group of less than k individuals. We suggest modified definitions of k-anonymity that guarantee that no private information in a data mining model can be linked to a group of less than k individuals from the private database. Using these definitions, we present data mining algorithms that are guaranteed to maintain k-anonymity of the learning examples. Experiments show that embedding anonymization within the process of Data mining provides better accuracy than anonymizing the data first and mining the data later.

Carmit Hazay, Bar-Ilan University

Title: Efficient Protocols for Set Intersection and Pattern Matching with Security Against Malicious and Covert Adversaries

In this talk we present efficient secure protocols for set intersection and pattern matching. Our protocols for securely computing the set intersection functionality are based on secure pseudorandom function evaluations, in contrast to previous protocols that used secure polynomial evaluation. In addition to the above, we also use secure pseudorandom function evaluation in order to achieve secure pattern matching. In this case, we utilize specific properties of the Naor-Reingold pseudorandom function in order to achieve high efficiency.

Our results are presented in two adversary models. Our protocol for secure pattern matching and one of our protocols for set intersection achieve security against malicious adversaries under a relaxed definition where one corruption case is simulatable and for the other only privacy (formalized through indistinguishability) is guaranteed. We also present a protocol for set intersection that is fully simulatable in the model of covert adversaries. Loosely speaking, this means that a malicious adversary can cheat, but will then be caught with good probability.

Shiva Kasiviswanathan

Title: What Can We Learn Privately?

Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life datasets. We ask (and answer): what concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? Our goal is a complexity-theoretic classification of learning problems efficiently solvable with the privacy restriction.

In this talk, we present four private learning models: local interactive (LI), learning non-interactive (LNI), centralized interactive (CI), and centralized non-interactive (CNI). We give a characterization of these models with respect to standard learning models, PAC and SQ. We show that for learning problems LNI is a strict subset of LI, and LI is a strict subset of CNI=CI. This characterization takes into account the number of samples required for learning, but not computational efficiency. We also present a partial characterization of these models when efficiency is taken into account.

Joint work with Homin Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.

Aggelos Kiayias, University of Connecticut

Title: How to Collect a Function Securely?

Consider a set of private values x1, ..., xn dispersed over a set of computing units. An analyst wishes to evaluate f(x1,...,xn) for some possibly private function f. The analyst has no other access to the units other than through the dispatching of agents that can perform a pass over the units. We call the above scenario "mobile data collection & processing" and we are interested in privacy preserving solutions where the relevant measures of efficiency are

1. the number of agents the analyst needs to dispatch.
2. the memory required by an agent.
3. the computation required by the units.
In this talk we present a number of different protocols for general functions as well as for specific functions relevant to private statistical calculations.
Yehuda Lindell, Bar Ilan University

Tutorial: Secure Multiparty Computation and Privacy-Preserving Data Mining

In this tutorial, we survey the basic paradigms and notions of secure multiparty computation and discuss their relevance to the field of privacy-preserving data mining. A significant portion of the tutorial will focus on how security is defined in this field, why it is defined in this way, and what its limitations are. In addition to the definitional aspects of secure multiparty computation, we discuss the issue of efficiency and demonstrate the difficulties involved in constructing highly efficient protocols. This leads to what we believe to be one of the biggest challenges of this field; the development of new definitions that provide a sufficient security guarantee and for which more efficient protocols can be constructed. Finally, we discuss the relationship between secure multiparty computation and privacy-preserving data mining, and show which problems it solves and which problems it does not.

Yehuda Lindell, Bar Ilan University

Title: Constructions of Truly Practical Secure Protocols using Standard Smartcards

In this talk we show that using standard smartcards it is possible to construct truly practical secure protocols for a variety of tasks. Our protocols achieve full simulation-based security in the presence of malicious adversaries, and can be run on very large inputs. We present protocols for secure set intersection, oblivious database search and more. Our protocols are simple and achieve almost optimal efficiency. We have also implemented our set intersection protocol in order to demonstrate that it truly is practical. For example, secure set intersection on sets of size 30,000 elements takes 20 seconds for one party and 30 minutes for the other. This demonstrates that in settings where physical smartcards can be sent between parties (as in the case of private data mining tasks between security and governmental agencies), it is possible to achieve the highest level of security: proven simulation-based security in the presence of malicious adversaries.

This talk is based on joint work with Carmit Hazay.

Paul B. Massell, Statistical Research Division, U.S. Census Bureau

Title: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata

When companies respond to an economic survey conducted by the U.S. Census Bureau, the companies assume their data will be protected in some way and will be used only to produce statistical data products. In such products, data from 2 or more companies is combined in a way that does not allow any table user to derive an accurate estimate of any data value contributed by any company. The variables protected are magnitude ones, such as receipts or payroll. Thus, in this context, "protecting company contributed data" means ensuring a level of fuzziness by table users (including users from contributing companies). Providing this "fuzziness constraint" is satisfied, the Census Bureau does wish to produce tables that are as complete and accurate as possible. Since there are a large number of large tables produced in certain economic programs, developing methods that can protect related tables in a consistent way is important Easy implementation is also important; specifically, the Census Bureau would like to use fast algorithms that are easy to code and modify.

We will discuss briefly the method of cell suppression, a method that has been used for many years for protecting such tables. The main focus of our talk however, is an alternative method that was developed several years ago by researchers at the Census Bureau that meets the goals described above while allowing for the release of more data than is possible under cell suppression. This new method uses a carefully calibrated noise distribution to generate noise which is added to the microdata values of a magnitude variable requiring protection. These noisy microdata values are then tabulated to form the cell values for all the tables in a statistical program that describe that variable (e.g., receipts). This method is conceptually simple and easy to implement; in particular, it is much simpler algorithmically than cell suppression. The main concerns are whether noise protected tables are fully protected and whether the noisy cell values are as or more useful to users than the combination of exact and suppressed values provided by cell suppression. The seminal paper by Evans-Zayatz-Slanta (J. Official Statistics, 1998) showed that this was clearly true for the survey analyzed in that paper. The work presented in this talk provides analysis for additional surveys with different features than the survey described in the earlier paper. We present general protection arguments that involve ways of relating the uncertainty provided by noisy values to the required amount of protection. We present graphs which show the different distributions of net noise on the set of sensitive cells versus that for the non-sensitive cells. We also discuss some ways to fine-tune the algorithm to a particular table, taking advantage of its special characteristics. We call this new variation 'balanced EZS noise'. Our conclusion is that when EZS noise is appropriately applied, it fully protects tables while usually releasing more useful data than cell suppression. The possible application of EZS noise to a variety of statistical programs within the economic directorate is currently being researched.

This talk is an expanded version of an invited talk presented at the Third International Conference on Establishment Surveys (ICES 2007 in Montreal) session called "Advances in Disclosure Protection: Releasing More Business and Farm Data to the Public". That paper was co-authored with Jeremy Funk. Paul and Jeremy are members of the Disclosure Avoidance Research Group in the Statistical Research Division of the U.S. Census Bureau. Laura Zayatz is the head of that group and provided guidance on this project.

Frank McSherry

Title: Privacy Integrated Queries: Programming Privately using LINQ

We report on a simple data access API that manages differential privacy, built around the SQL-like Language Integrated Query (LINQ) API in C#. The programmer manipulates "protected" data sources much like they would using LINQ or SQL: transformations, restrictions, groupings, joins, etc; and is permitted access to aggregate quantities: counts, averages, medians, only through careful methods that ensure the appropriate level of differential privacy for the source data sets from which the current data are constructed. We will review several example programs that use PINQ in novel ways, resulting in programs that could not themselves be easily analyzed for privacy properties, but for which privacy is structurally guaranteed through PINQ. Time permitting, we may discuss several of the implementation challenges involving securing the execution against numerous privacy attack vectors.

Frank McSherry

Title: Mechanism Design via Differential Privacy

In this talk we explore the connection between differential privacy and mechanism design, the latter interested in the design of algorithms that encourage honest participation of players. Casting privacy as one of many strategic concerns, we see that the guarantees of differential privacy extend more broadly to other forms of strategic play, and the robust privacy guarantees have game-theoretic implications for issues such as truthfulness, coalition-resilience, and repeatability of games. As examples, we present some recent results from the theory of auctions incorporating these results, giving differential privacy guarantees, but also with an increase expected revenue resulting as well

Tal Moran, The Weizmann Institute of Science.

Title: Everlasting Privacy in Cryptographic Voting Protocols

In recent years, there has been increasing interest in "universally-verifiable" voting protocols: protocols that provide every voter with the means to verify that all the votes were counted correctly (even if the computers running the election are not trusted). The protocols must take into account the fact that voters are humans (rather than computers), and so cannot be expected to perform complex operations (such as raising integers to large powers).

One of the biggest challenges in designing such protocols is achieving verifiability without sacrificing ballot secrecy. Because vote-buying and coercion is a very real threat for voting systems, ballot secrecy must be maintained even if the voter colludes with an attacker to reveal the vote. Most of the proposed cryptographic voting schemes assure privacy in a computational sense (e.g., by using encrypted ballots). This means that ballot privacy may be breached some time in the future (e.g., if the encryption algorithm is broken).

I will present a universally-verifiable voting protocol that has "everlasting privacy" for voters: the published data used to verify the tally contains no information at all about the voters' choices; thus, once the election is over, even a computationally unbounded adversary cannot reveal the ballots.

Shubha U. Nabar, Stanford University and Nina Mishra, University of Virginia

Title: Cell Suppressions Leak Information

Statistical agencies routinely publish aggregate data in the form of a contingency table. To preserve the privacy of respondents, cells in these tables are sometimes suppressed. Our work uncovers fundamental problems with existing cell suppression algorithms, namely that the act of suppressing a cell can itself leak information. We illustrate via a simple example how privacy can be breached by simply knowing the suppression algorithm. We propose a solution, a simulatable cell suppressor, that makes suppression decisions irrespective of information not available to an attacker --- so that suppressions provably do not leak information. A critical technical subroutine needed for a simulatable cell suppressor is sampling a data set according to an underlying probability distribution that is consistent with a partially filled contingency table. We give algorithms to solve this technical subroutine for the special case of boolean private attributes based on dynamic programming and also sampling perfect matchings. In addition, we articulate differences between the prior distribution that generates the data, the attacker's prior and the suppressor's prior. For the first time, we study how these differences affect privacy. Finally, to address utility, we illustrate the kinds of tables our algorithm will publish unsuppressed.

Moni Naor, Weizmann Institute of Science

Title: On the Difficulties of Achieving Disclosure Prevention

In 1977 the Swedish statistician Tore Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a formalization of Dalenius' goal along the lines of semantic security cannot be achieved.

The key issue is the side information an adversary may have regarding the database. We essentially show that under very general conditions regarding the database, the notion of privacy violation and the amount of noise tolerable it is possible to construct such auxiliary information that will be useless without the database, yet if there is access to an even a noisy version of the database, then one can obtain sensitive information.

This state of affairs suggest exploring new measures of privacy, in particular differential privacy, which, intuitively, captures the increased risk to one's privacy incurred by participating in a database.

Joint work with Cynthia Dwork

Yuval Nardi, Carnegie Mellon University

Title: Secure Logistic Regression

We address the problem of performing a logistic regression in a secure way without directly sharing data among parties. We suppose that data are collected separately and intimately by several separate parties (agencies). These parties wish to analyze the pooled (combined) data without actually combining the parts that they possess, i.e., they want to fit a model and make inferences using the pooled data in a way that no one party's data are disclosed to another party. When the parties have exactly the same variables, but for different records, we say that the data have been horizontally partitioned. If all parties have the same records, but different parties have different predictor variables, we say that the data have been vertically partitioned. In the general case, involving vertically partitioned, partially overlapping data, some parties don't share the same variables, and some of the records are jointly shared by several parties. In this paper we build on earlier results by Fienberg, Fulp, Slavkovic and Wrobel (2006) on the horizontally partitioned case and describe methods for both the vertically partitioned and the general cases.

Joint with Stephen Fienberg, Carnegie Mellon University and Aleksandra Slavkovic, Pennsylvania state University.

Eran Omri, Ben Gurion University

Title: Distributed Private Data Analysis: Simultaneously Solving How and What

In this work we combine two directions in the field of privacy. While in both the goal is to privately evaluate some function $f$ on $n$ inputs, the two approaches differ in their privacy goals and requirements. The first direction is secure function evaluation (SFE), where $n$ parties, sharing a common interest in distributively and securely computing $f$ applied to their private inputs, are given a protocol to do so. An SFE protocol for evaluating $f$ is private if no subset of the parties may deduce information, other than what can be learned from the result of $f$ and the initial inputs of the members of the subset. The second direction is private data analysis where computation of $f$ is considered to be private if the privacy of each record is preserved individually. One possible way to obtain useful information on a collection of individual information, while protecting individual data, is by adding carefully chosen noise'', i.e., by evaluating a random approximation $\widehat{f}$ of $f$. This choice of $\widehat{f}$ is usually done while abstracting away implementation issues, assuming that the computation is performed by a trusted server holding the data.

A natural paradigm for implementing the computation without a trusted party is by constructing an SFE protocol for $\widehat{f}$. This conceptual separation, between the decision on which $\widehat{f}$ to compute and how to compute it, is valuable. However, this approach may result in non-optimal protocols, as SFE protocols need to comply with strong privacy requirements. For instance, publishing outputs of intermediate computations, as long as they do not compromise individual privacy, may be useful. However, publishing these outputs may be impossible under SFE definitions (e.g., if for some party these outputs cannot be simulated from the input of the party and the output of the function).

We initiate an examination whether there are advantages to choosing $\widehat{f}$ and the protocol for computing it simultaneously. In particular, we investigate the case of sum queries and specify under which accuracy requirements, it is beneficial to adapt this paradigm.

Joint work with Amos Beimel and Kobbi Nissim.

Title: Smooth Sensitivity and Sampling

In the output perturbation model, to preserve privacy the value of a function f evaluated at the database is released with a small amount of additive noise. We introduce the Smooth Sensitivity framework for determining the amount of noise necessary to satisfy differential privacy. This framework generalizes previous approaches by allowing the noise magnitude to depend not only on the function we want to release, but also on the database itself. One of the challenges we address is to guarantee that the noise magnitude itself should not leak information about the database. For that we calibrate the noise magnitude to the *smooth sensitivity* of f on the database x - a measure of how variable f is in the neighborhood of the instance x.

To apply the framework, one must compute or approximate the smooth sensitivity of f on x. We show how to do this efficiently for several functions including the median and the cost of the minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on many databases x even when no efficient algorithm for approximating smooth sensitivity of f is known or when f is given as a black box.

Based on "Smooth Sensitivity and Sampling in Private Data Analysis", joint work with Kobbi Nissim and Adam Smith

Vibhor Rastogi, University of Washington

Title: The Boundary between Privacy and Utility in Data Publishing

In this talk, I will discuss the privacy problem in data publishing: given a database instance containing sensitive information "anonymize" it to obtain a view such that, on one hand attackers cannot learn any sensitive information from the view, and on the other hand legitimate users can use it to compute useful statistics. These are conflicting goals.

I will present results that provide an almost crisp separation of the case when a useful anonymization algorithm is possible from when it is not, based on the attacker's prior knowledge. Our definition of privacy is derived from existing literature and relates the attacker's prior belief for a given tuple t, with the posterior belief for the same tuple. Our definition of utility is based on the error bound on the estimates of counting queries.

The main result has two parts. First we show that if the prior beliefs for some tuples are large then there exists no useful anonymization algorithm. Second, we show that when the prior is bounded for all tuples then there exists an anonymization algorithm that is both private and useful. The anonymization algorithm that forms our positive result is novel, and improves the privacy/utility tradeoff of previously known algorithms with privacy/utility guarantees such as FRAPP

Rivka Ribak, University of Haifa

Title: The meanings of privacy: On the cultural inflections of surveillance discourse

North American analyses of privacy construct it as a human right that is inevitably violated by advanced communication technologies. Growing exposure is expected to become the fate of more and more people as globalization expands. However, studies of surveillance practices elsewhere suggest alternative interpretations of information disclosure, and developments in globalization theories highlight complicated relationships between the global, the local, and their mediators. This talk focuses on the latter - the agents of globalization - asking how Israeli journalists introduced the new medium to the locals and, more specifically, how they mediated North American concerns over web privacy to their imagined Israeli readers. The analysis identifies three arenas in journalists' discussion about web privacy: The US, Israel, and the global community of surfers, noting both the predominance of the US in this discourse, and the absence of local structural accounts. The discussion proposes that by Americanizing the concern over privacy violation and by constructing it as paranoia of the technologically inept, the text neglects to contextualize privacy violation, and privatizes the struggle against it.

Rathindra Sarathy, Oklahoma State University and Krish Muralidhar, University of Kentucky

Title: A Hybrid Perturbation/Swapping Approach for Masking Numerical Data

Many authors have suggested the use of hybrid approaches for masking numerical data. In this paper, we describe a hybrid masking approach that is a hybrid of perturbation and swapping. We provide a theoretical basis for the new procedure. We compare the performance of the new procedure to both perturbation and swapping. The results indicate that the new hybrid procedure has advantages that are not available with either of the two individual approaches.

Gil Segev, Weizmann Institute of Science

Title: Securing Vote Storage Mechanisms: Deterministic History-Independent Strategies for Storing Information on Write-Once Memories

Motivated by the challenging task of designing "secure" vote storage mechanisms, we deal with information storage mechanisms that operate in extremely hostile environments. In such environments, the majority of existing techniques for information storage and for security are susceptible to powerful adversarial attacks. In this setting, we propose a mechanism for storing a set of at most $K$ elements from a large universe of size $N$ on write-once memories in a manner that does not reveal the insertion order of the elements. We consider a standard model for write-once memories, in which the memory is initialized to the all $0$'s state, and the only operation allowed is flipping bits from $0$ to $1$. Whereas previously known constructions were either inefficient (required $\Theta(K^2)$ memory), randomized, or employed cryptographic techniques which are unlikely to be available in hostile environments, we eliminate each of these undesirable properties. The total amount of memory used by the mechanism is linear in the number of stored elements and poly-logarithmic in the size of the universe of elements.

In addition, we consider one of the classical distributed computing problems: conflict resolution in multiple-access channels. By establishing a tight connection with the basic building block of our mechanism, we construct the first deterministic and non-adaptive conflict resolution algorithm whose running time is optimal up to poly-logarithmic factors.

Joint work with Tal Moran and Moni Naor.

Vitaly Shmatikov, The University of Texas at Austin

Title: Robust De-anonymization of Multi-dimensional Datasets

We present a set of de-anonymization techniques for highly multi-dimensional datasets containing anonymized records of purchases, transactions, preferences, and similar individual data. We show that an attacker who has only a small amount of auxiliary information about an individual can identify this individual's record if it appears in the anonymized dataset. Our methods are robust to perturbation of attribute values in the published records, tolerate errors and fuzziness in the attacker's auxiliary information, and work even in the situations where only a subset of the original dataset has been published.

We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service. We demonstrate that an attacker who knows (possibly imprecisely) only a little bit about an individual subscriber can easily identify this subscriber's record if it is present in the released dataset, or, at the very least, identify a small set of records which include the subscriber's record.

Joint work with Arvind Narayanan.

Aleksandra Slavkovic, Penn State University

Tutorial: Statistical Disclosure Limitation Methods

Statistical disclosure limitation (SDL) applies statistical tools to the problem of limiting releases of sensitive information about individuals and groups that are part of statistical databases while allowing for proper statistical inference. The SDL methods tackle the fundamental tradeoff between data utility and data risk, in both microdata and tabular data releases. In this tutorial, we give an overview of a number of SDL methods, focusing on the statististicans' perspective and definitions of utility and risk given various contexts. A special emphasis will be given to tabular data releases.

Title: Differential Privacy (a tutorial)

The notion of "differential privacy" emerged from a recent line of work in the theoretical computer science community that sought to formulate and achieve rigorous definitons of privacy in statistical databases. Differential privacy states, roughly, that an adversary learns the same things about an individual regardless of whether that individual's data is contained in the database. This definition has several important properties: it makes assumptions neither about what kind of attack might be perpetrated based on publicly released information, nor about what additional, external information the attacker might possess.

I will define differential privacy and related notions of indistinguishability. I will then explain a simple framework for designing "output perturbation" protocols that satisfy differential privacy. In output perturbation, a function f (say, a particular data mining procedure or statistical summary) is computed on the database, and its value is released with a small amount of additive noise.

Finally, I will briefly survey some of the directions that research on differential privacy has taken.

Kunal Talwar

Tutorial: The Exponential Mechanism

Noise addition approaches to differential privacy focus on real valued functions whose values are relatively insensitive to the change in the data of a single individual and whose usefulness is relatively unaffected by additive perturbations. In many natural settings such as pricing and learning, these assumptions do not hold. This talk will describe the exponential mechanism that generalizes the previous approaches to such domains.

Ying Xu, Stanford University

Title: Efficient Algorithms for Masking and Finding Quasi-Identifiers

A quasi-identifier refers to a subset of attributes that can uniquely identify most tuples in a table. Incautious publication of quasi-identifiers will lead to privacy leakage. In this paper we consider the problems of finding and masking quasi-identifiers. Both problems are provably hard with severe time and space requirements, so we focus on designing efficient approximation algorithms for large data sets.

We first propose two natural measures for quantifying quasi-identifiers: distinct ratio and separation ratio. We develop efficient algorithms that find small quasi-identifiers with provable size and separation/distinct ratio guarantees, with space and time requirements sublinear in the number of tuples. We also propose efficient algorithms for masking quasi-identifiers, where we use a random sampling technique to greatly reduce the space and time requirements, without much sacrifice in the quality of the results. Our algorithms for masking and finding quasi-identifiers naturally apply to stream databases. Extensive experimental results on real world data sets confirm efficiency and accuracy of our algorithms.

Joint work with Rajeev Motwani

Danfeng (Daphne) Yao, Rutgers University

Title: Efficient Signature Schemes Supporting Sedaction, Pseudonymization, and Data Deidentification

We give a new signature algorithm that allows for controlled changes to the signed data. The change operations we study are removal of subdocuments (redaction), pseudonymization, and gradual deidentification of hierarchically structured data. These operations are applicable in a number of practically relevant application scenarios, including the release of previously classified government documents, privacy-aware management of audit-log data, and the release of tables of health records. When applied directly to redaction, our algorithm significantly reduces the overhead of cryptographic information that has to be stored with the original data.

To appear in the Proceedings of AsiaCCS 2008. Joint work with Stuart Haber, Yasuo Hatano, Yoshinori Honda, William Horne, Kunihiko Miyazaki, Tomas Sander, and Satoru Tezuka.

Sergey Yekhanin, Institute for Advanced Study

Title: New efficient attacks on statistical disclosure control mechanisms

The goal of a statistical database is to provide statistics about a population while simultaneously protecting the privacy of the individual records in the database. The tradeoff between privacy and usability of statistical databases has attracted a lot of attention in statistics, theoretical computer science, security, and database communities in the recent years.

Following Dinur and Nissim we model a database by an n-bit string x_1,...,x_n with a query being a subset q\in [n] to be answered by a sum_{i\in q} x_i. A database statistical disclosure control mechanism (curator) is an algorithm that sits between the database and the user. The user interacts with the curator, which may modify the actual query outcome by adding noise in order to protect privacy.

In this work we present several novel attacks on database curators highlighting the limits of usability of privacy preserving statistical disclosure control mechanisms. Our attacks are more efficient and apply under broader assumptions than those from the earlier works. Specifically we show that an adversary asking just n queries against an n-sized database and running in O(n log n) time, can achieve a blatant privacy violation provided that every curator's response carries o(sqrt(n)) noise. The best previously known attack (by Dinur and Nissim) required O(n log^2 n) queries and O(n^5 log^4 n) running time. We also treat the setup initially considered by Dwork et al. where the curator is permitted to add arbitrary (unbounded) noise on a certain fraction of of queries, and low noise on the rest. We present ultra-efficient attacks running in poly(1/epsilon) time that can tolerate the optimal (1/2 - epsilon) fraction of unbounded noise provided the noise on the rest of the queries is sufficiently low. (Joint work with Cynthia Dwork.)

William Yurcik, University of Texas at Dallas (UT-D)

Title: Toward Safe Sharing of Network Data with Anonymization Tools: Multi-Field, Multi-Level Protection and Privacy/Utility Tradeoffs"

Since network attacks typically cross network boundaries and attackers frequently change targets to different security domains, effective protection requires defenders to look beyond their own organizational perimeter toward cooperation and data sharing with other organizations in order to defeat attackers. However, to date little or no data sharing has occurred between organizations with the consequences being lagging defense capabilities and ad hoc protection. Unfortunately, the same cannot be said about attackers who are quite efficient at sharing vulnerability and exploit information amongst themselves!

The practical concerns which have prevented organizations from sharing network data include a valid fear that private and/or security information in shared data may be misused to cause harm. Examples of private/security information in network data that we may not want to reveal include personally-identifiable user information, user activities, system configurations, network topologies, network services, organizational defenses, and attack impacts ? all valuable information to attackers. In order to facilitate safe data sharing between organizations, we are developing a basic infrastructure of privacy-preserving tools based on anonymization (SCRUB*) that are compatible with existing techniques - no changes necessary. Private and/or security information within shared data can be obscured to provide practical levels of assurance that shared data cannot be used to cause harm. We incorporate state-of-the-art network data anonymization algorithms developed by others as well as developed in-house. To characterize the tension between privacy and utility of the data with the use of anonymization, we report empirical measurements of how anonymization effects the utility of data in the context of network security -- to our knowledge this is the first work to attempt quantifying this tradeoff. Lastly, since different organizations have security policies with different privacy requirements, we present multi-level anonymization options that more closely match real sharing environments between parties since no one-size-fits-all anonymization scheme will work. While this work has focused on anonymization of network data for safe sharing, the concepts are certainly transferable to other domains.

(joint work with Clay Woolam, Latifur Khan, and Bhavani Thuraisingham from UT-D)

Tal Z. Zarsky, University of Haifa

Title: E Gov, Online Citizen Scrutiny and Participation - The Joint Challenges for Cryptologists and Policy Makers

The people's ability to play a direct role in the broad democratic discourse was always considered limited in view of their lack of time, expertise and ability. The limited role for individuals as direct participants is about to change, if indeed recent technological and social developments closely related to the emergence of the internet would be applied and used correctly and thoughtfully. Applying the internet's technological and social potential to meet the objectives mentioned is a tedious task. In this paper, I generally address this task, while focusing on privacy-related issues. I explain that the challenges at hand could be met only if scholars and policy makers from the fields of information law and cryptology work closely together

I approach the privacy concerns arising in this context by focusing on two crucial points. First, I examine the point of intake of information by citizens so that they can engage in the various civic tasks mentioned. Here, we must balance among two important, yet apparently conflicting interests. To generate an effective discourse, we must provide the public with as much information as possible. On the other hand, we must keep in mind the privacy interests of data subjects whose information is stored by government and made public. Various strategies could be applied so to try and meet both objectives.

Second, I address the point of output, when individuals convey information and participate in the ongoing discourse ? a process that might require they disclose personal information about themselves prior to participating. The process would only be fair and democratic if every participant would be provided an equal voice to his or her peers. However, throughout this process, parties with various interests will strive to promote their own ideas, agendas and objectives by attempting to magnify their voice in the online realm. An easy response to this concern is requiring all participants to make use of a constant and personal identifier. Yet this somewhat radical response would be criticized as undermining the ability of the citizen to voice her opinion while exercising anonymity. Therefore, scholars and policy makers from both the fields of law and computer science must construct mid-way solutions that will balance among these somewhat conflicting objectives of both fairness and privacy.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center