DIMACS/PORTIA Workshop on Privacy-Preserving Data Mining

March 15 - 16, 2004
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Cynthia Dwork, Microsoft, dwork at microsoft.com; Benny Pinkas, HP Labs, benny.pinkas at hp.com; Rebecca Wright, Stevens Institute of Technology, rwright at cs.stevens-tech.edu

Presented under the auspices of the Special Focus on Communication Security and Information Privacy, and the PORTIA project.

The workshop is followed by a related working group on March 17, 2004.

Abstracts:

Steve Bellovin, AT & T Research

Title: Privacy-Enhanced Searches Using Encrypted Bloom Filters

It is often necessary for two or more or more parties that do not fully trust each other to selectively share data. We propose a search scheme based on Bloom filters and Pohlig-Hellman encryption. A semi-trusted third party can transform one party's search queries to a form suitable for querying the other party's database, in such a way that neither the third party nor the database owner can see the original query. Furthermore, the encryption keys used to construct the Bloom filters are not shared with this third party. Provision can be made for third-party ``warrant servers'', as well as ``censorship sets'' that limit the data to be shared.

Yan-Cheng Chang, Harvard University

Title: Privacy Preserving Keyword Searches on Remote Encrypted Data

We consider the following problem: a user U wants to store his files in an encrypted form on a remote file server S. Later the user U wants to efficiently retrieve some of the encrypted files containing (or indexed by) specific keywords, keeping the keywords themselves secret and not jeopardizing the security of the remotely stored files. For example, a user may want to store old e-mail messages encrypted on a server managed by Yahoo or another large vendor, and later retrieve certain messages while traveling with a mobile device.

In this talk, we offer solutions for this problem under well-defined security requirements. Our schemes are efficient in the sense that no public-key cryptosystem is involved. Indeed, our approach is independent of the encryption method chosen for the remote files. They are also incremental, in that U can submit new files which are totally secure against previous queries but still searchable against future queries.

Shuchi Chawla, CMU

Title: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

In this talk, I will discuss some ongoing work with Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer and Hoeteck Wee. We consider the problem of modifying (sanitizing) high-dimensional data, such as census data, in such a way that the sanitized data can be published without violating the privacy of the original data, and yet retains its utility. Our definition of privacy is based on the intuition that the privacy of an individual is preserved to the extent that the individual blends in with the crowd. Based on this definition, using a geometric approach, we give some (preliminary) results on the preservation of privacy as well as utility.

Lawrence Cox, National Center for Health Statistics

Title: Confidentiality in Tables Viewed from an Algebraic Perspective

National statistical offices and other organizations that release statistical data products derived from confidential information collected from individual respondents (persons, businesses, organizations) have a responsibility to preserve the confidentiality of individually identifiable data. Methods for confidentiality protection in statistical tabulations (tables) of nonnegative counts include cell suppression, rounding, perturbation and tabular adjustment. Complementary cell suppression is an NP-hard problem, and the latter three methods are similarly complicated by the need to preserve tabular structure (additivity of item detail to lowest-level totals, of lowest totals to next-level totals, etc.), resulting in challenging mathematical and computational problems. In this presentation, I will briefly summarize these confidentiality protection methods and relate them to a particular mathematical problem, namely, that of computing moves between neighboring solutions to a corresponding integer linear program.

Giovanni DiCrescenzo, Telcordia

Title: Public-Key Encryption with Keyword Search

We study the problem of searching on data that is encrypted using a public key system. Consider user Bob who sends email to user Alice encrypted under Alice's public key. An email gateway wants to test whether the email contains the keyword ``urgent'' so that it could route the email accordingly. Alice, on the other hand does not wish to give the gateway the ability to decrypt all her messages. We define and construct a mechanism that enables Alice to provide a key to the gateway that enables the gateway to test whether the word ``urgent'' is a keyword in the email without learning anything else about the email. We refer to this mechanism as {\em Public Key Encryption with keyword Search}. As another example, consider a mail server that stores various messages publicly encrypted for Alice by others. Using our mechanism Alice can send the mail server a key that will enable the server to identify all messages containing some specific keyword, but learn nothing else. We define the concept of public key encryption with keyword search and give several constructions.

Michael Freedman, NYU

Title: Efficient Private Matching and Set Intersection

We consider the problem of computing the intersection of private datasets of two parties, where the datasets contain lists of elements taken from a large domain. This problem has many applications for online collaboration. We present protocols, based on the use of homomorphic encryption and balanced hashing, for both semi-honest and malicious environments. For lists of length $k$, we obtain $O(k)$ communication overhead and $O(k\ln\ln k)$ computation. The protocol for the semi-honest environment is secure in the standard model, while the protocol for the malicious environment is secure in the random oracle model. We also consider the problem of approximating the size of the intersection, show a linear lower-bound for the communication overhead of solving this problem, and provide a suitable secure protocol. Lastly, we investigate other variants of the matching problem, including extending the protocol to the multi-party setting as well as considering the problem of approximate matching.

Yuval Ishai, Technion

Title: Extending Oblivious Transfers Efficiently

Oblivious Transfer (OT) is a common building block in secure computation protocols, which typically forms their efficiency bottleneck. We consider the problem of extending oblivious transfers: Given a small number of OTs "for free," can one implement a large number of OTs? Beaver has shown that a one-way function is sufficient to extend OTs. However, his extension method is very inefficient in practice, in part due to its non-black-box use of the underlying one-way function.

We give efficient protocols for extending OTs in the random oracle model, and put forward a new cryptographic primitive which can be used as a black box to instantiate the random oracle in our protocols. Our methods substantially reduce the amortized cost of OT, allowing each additional OT to be generated using just a few applications of a cheap "symmetric" primitive.

Joint work with Joe Kilian, Kobbi Nissim, and Erez Petrank.

Markus Jakobsson, RSA Laboratories

Title: Cryptographic Randomized Response Techniques

I will describe cryptographically secure techniques to guarantee unconditional privacy for respondents to polls. The constructions are efficient and practical, and are shown not to allow cheating respondents to affect the ``tally'' by more than their own vote - which will be given the exact same weight as that of other respondents. This is joint work with Helger Lipmaa and Andris Ambainis. The paper appeared at PKC 04, and is available on www.markus-jakobsson.com

Hillol Kargupta, University of Maryland

Title: Random Encodings, Privacy Loss, and Some Possible Solutions: A Coding Theory Perspective

This talk will offer a perspective of privacy preserving data mining using randomized perturbations in the light of coding theory. It will discuss the interplay of additive noise and randomized linear encodings. It will also raise the following question: When the noise is additive, can linear encodings be used for estimating the original data from the perturbed version (i.e. breaching the privacy) with arbitrary high accuracy by increasing the dimension of the encoded space? This issue appears to be related to the idea of increasing the accuracy of the data transmission quality through a noisy channel by slowing down the data transmission rate---a direct outcome of Shannon's Second Theorem. The talk will also discuss the possibility of a new approach toward designing privacy preserving data mining algorithms by constructing two separate ``transmission channels''---one for the original data and another for the type of patterns in the data that we intend to preserve. The idea is to introduce perturbations or transformations that reduce the ``transmission capacity'' of the data-channel for minimizing the privacy-loss for the data but increase the ``transmission capacity'' of the pattern-channel for enhancing the quality of the patterns detected by the data mining algorithm.

Alan Karr, National Institute of Statistical Sciences

Title: Secure Regression on Distributed Databases

Several methods will be presented for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others.

Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.

Secure multi-party computation based on shared local statistics effects computations necessary to compute least squares estimators of regression coefficients and error variances by means of analogous local computations that are combined additively using the secure summation protocol. Two approaches are described to model diagnostics in this setting, one using shared residual statistics and the other using secure integration of synthetic residuals.

Eyal Kushilevitz, Technion

Title: Amortized PIR via Batch Codes

Private Information Retrieval (PIR) protocols allow a user to retrieve data items from a database, while keeping secret the identity of the retrieved items. Research on PIR protocols concentrated so far mostly on the communication complexity of these protocols and, in particular, achieved sub-linear communication. On the other hand, the time complexity of such protocols can be proven to be at least linear which make time the most problematic resource for PIR protocols. In the talk I will discuss ways to go around this obstacle by amortizing the time complexity. I will mostly concentrate on a method that uses a new concept termed ``Batch Codes'' which is interesting for its own.

Based on joint work with Y. Ishai, A. Sahai, and R. Ostrovsky.

Yehuda Lindell, IBM

Title: On the Difficulty of Defining Ideal Functionalities for Privacy
Preserving Data Mining: Why naive Secure Multiparty Computation Fails

The techniques and paradigms of secure multi-party computation can be used to model the security requirements of essentially any distributed task. Thus, they can also be used in the setting of privacy preserving data mining. We present the ideal/real model paradigm of defining security and show how it can be used to model problems of secure distributed computing, and in particular, the problem of privacy preserving data mining. We then present two examples of definitions of security for specific data mining applications that have the following paradoxical and surprising property. On the one hand, the definitions clearly capture the intuition that the parties' privacy is fully preserved. However, on the other hand, we show that the *dangers* of non-private data mining are still present when using these definitions of privacy-preserving data mining. Beyond the immediate conclusion that great care must be taken when defining security (even via the ideal/real model paradigm), we conclude that a deep understanding of the dangers of non-private data mining must be obtained before attempting to solve the cryptographic privacy-preserving data mining problem.

Joint work with Rosario Gennaro, Tal Rabin and Tal Zarsky.

Kobbi Nissim, Microsoft

Title: Privacy-Preserving Datamining on Vertically Partitioned Databases

In a recent paper Dinur and Nissim considered a statistical database in which a trusted database administrator monitors queries and introduces noise to the responses with the goal of maintaining data privacy. They proved that a noise of magnitude at least sqrt(n) is required to avoid privacy breaches by a poly-time adversary making, roughly, a linear number of queries. As databases grow increasingly large, the possibility of being able to issue only a sub-linear number of queries becomes realistic. The possibility of exploiting a sub-linear bound on the adversary was first explored by Dinur, Dwork and Nissim who demonstrated a database mechanism that guarantees privacy with noise magnitude less than sqrt(n).

We further investigate this situation, generalizing the previous work in two directions. (i) We define and analyze privacy for multi-attribute databases. In particular, we deal with a richer set of queries, and our privacy requirements deal not only with single attributes, but with functions of attributes. (ii) We demonstrate the usability of our database mechanism for datamining vertically partitioned databases, in which different subsets of attributes are stored in different databases. We also show how to use our datamining techniques on published noisy statistics, allowing publication of statistical data in a census-bureau like publication of aggregate statistics.

Joint work with Cynthia Dwork.

Emmanuel Pontikakis, University of Patras

Title: An Experimental Study of Association Rule Hiding Techniques

Data mining provides the opportunity to extract useful information from large databases. Various techniques have been proposed in this context in order to extract this information in the most efficient way. However, efficiency is not our only concern in this study. The security and privacy issues over the extracted knowledge must be seriously considered as well. By taking this into consideration, we study the procedure of hiding sensitive association rules in binary data sets by using distortion and blocking techniques and we present some state of the art algorithms in this field. We also provide a fuzzification of the support and the confidence of an association rule in order to accommodate for the existence of blocked/unknown values. In addition, we quantitatively compare the proposed algorithms by running experiments on binary data sets, and we also qualitatively compare the efficiency of the proposed algorithms in hiding association rules. We utilize the notion of border rules, by putting weights in each rule, and we use effective data structures for the representation of the rules so as (a) to minimize the side effects created by the hiding process and (b) to speed up the selection of the victim transactions. Finally, we study the overall security of the modified database, using the C4.5 decision tree algorithm of the WEKA data mining tool, and we discuss the advantages and the limitations of blocking.

Aleksandra B. Slavkovic, Carnegie Mellon University

Title: Tabular Data: Releases of Conditionals and Marginals

Statistical disclosure limitation applies statistical tools to the problem of limiting releases of sensitive information about individuals and groups that are part of statistical databases while allowing for proper statistical inference. Within this context, Dobra and Fienberg (2000, 2002) have employed Markov bases, in connection with decomposable and graphical log-linear models given a set of margins, to establish bounds and distributions for the cell entries in contingency tables. In this talk, we present a framework for finding the bounds and distribution when given an arbitrary collection of marginals and conditionals. We extend the results of Arnold et al.(1999) on the uniqueness of discrete distributions and describe new results on bounds for cell entries in k-way tables estimated via optimization methods such as linear and integer programming. We give a complete characterization of the two-way table problem and discuss extensions to multi-way tables including relationships to directed acyclic graphical models. We use tools from algebraic geometry to represent the tables of counts and describe the locus T of all possible tables under the given constraints. Markov bases needed to construct a connected Markov chain over T are described. These bases can be used to induce probability distributions over the space of possible tables via Markov Chain Monte Carlo sampling. This research presents new theoretical links between disclosure limitation, statistical theory and computational algebraic geometry and practical implications for confidentiality and statistical disclosure limitation.

Martin Strauss, AT&T Research

Title: Private Data Mining Based on Randomized Linear Projections

Consider three (or more) parties, Alice, Bob, and Charlotte, who each have a dataset of items with attribute value in the range [0,N). They represent their datasets by vectors A, B, and C of N frequency counts. They want to perform datamining on the merged dataset, represented by the vector sum M = A + B + C of the individual frequency count vectors. There are three main issues confronting the parties: privacy, efficiency (of both time and communciation), and (approximate) correctness. We argue for a particular approach, which differs from some previous work in its notion of privacy. We then give a number of examples.

Specifically, we require that neither the messages nor the output of a suitable protocol give more information than the ideal exact answer, M. By contrast, traditional cryptographic techniques applied to an approximation protocol will preserve time efficiency and hide information leakage in the *messages*; unfortunately, the blowup in communication may be exponential and the approximate *output* may leak private information, giving a result that is neither efficient nor private. Another class of techniques calls for adding noise to the inputs and/or outputs of a computation; generally, the privacy guarantee is weak unless so much noise is added that the useful information is also obliterated.

We show that several approximations to the vector sum fit our notion of privacy, after straightforward modificiation. These approximations, including histograms, Fourier decompositions and quantile summaries, were developed without privacy in mind, and are based on random linear projections. We get a statement of the following sort: An approximation of size B (e.g., a B-bucket histogram) to the vector sum of vectors of length N, which has sum square error at most (1+epsilon) times optimal, can be constructed in time poly(BN/epsilon) and communication poly(B*log(N)/epsilon)---similar to the non-private setting. Nothing is learned by any party beyond the true exact value of the vector sum.

This represents joint work (in several papers) with Feigenbaum, Gilbert, Guha, Indyk, Ishai, Malkin, Muthukrishnan, Nissim, and Wright, and work of others, e.g., Hershberger, Johnson, Kargupta, Kushilevitz, Mansour, Ostrovsky, Park, and Rabani.

David Woodruff, MIT

Title: Private Inference Control

Access control can be used to ensure that database queries pertaining to sensitive information are not answered. This is not enough to prevent users from learning sensitive information though, because users can combine non-sensitive information to discover something sensitive. Inference control prevents users from obtaining sensitive information via such ``inference channels'', however, existing inference control techniques are not private - that is, they require the server to learn what queries the user is making in order to deny inference-enabling queries.

We propose a new primitive - private inference control (PIC) - which is a means for the server to provide inference control without learning what information is being retrieved. PIC is a generalization of private information retrieval (PIR) and symmetrically-private information retrieval (SPIR). While it is straightforward to implement access control using PIR (simply omit sensitive information from the database), it is nontrivial to implement inference control efficiently. We measure the efficiency of a PIC protocol in terms of its communication complexity, its round complexity, and the work the server performs per query. Under existing cryptographic assumptions, we present a PIC protocol that is essentially optimal in terms of communication and round complexity (w.r.t. known PIR protocols). We also present a generic reduction which shows that one can focus on designing PIC schemes for which the inference channels take a particularly simple threshold form.

Joint work with Jessica Staddon

Tal Zarsky, Yale University

Title: Data Mining and Information Privacy - New Problems and the Search For Solutions

Recent technological and social changes have lead to the emergence of information privacy concerns. Data mining, as a powerful form of data analysis, plays several roles in the information privacy context. First, data mining potentially affects the problems stemming from the collection of personal information. In this context, I will address the various concerns voiced in view of today's abilities to gather and use vast amounts of personal data, and determine how such concerns are affected by the availability of data mining applications. I emphasize concerns that personal information will be used to discriminate among individuals, prove harmful to their autonomy, and lead to unfair results in view of erroneous data and processes.

Second, the ability to use data mining applications will also affect policy decisions as to what forms of solutions are needed for solving privacy concerns. Here I will address the ways in which data mining might undermine the effectiveness of specific solutions, as well as mention the benefits of this form of analysis, which might be lost with the introduction of other privacy enhancing proposals. Finally, I will briefly address the new forms of solutions data mining technology enables and how they are viewed in the privacy law context.

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on May 5, 2004.