DIMACS Tutorial: Statistical De-identification of Confidential Health Data with Application to the HIPAA Privacy Regulations

April 30 - May 1, 2009
DIMACS Center, CoRE Building, Rutgers University

Daniel Barth-Jones, Columbia University, db2431 at columbia.edu
Alina Campan, Northern Kentucky University, campana1 at nku.edu
Traian Marius Truta, Northern Kentucky University, trutat1 at nku.edu
Presented under the auspices of the Special Focus on Computational and Mathematical Epidemiology and the Special Focus on Communication Security and Information Privacy.
Abstracts: Alina Campan, Northern Kentucky University

Title: Emerging Privacy Threats for Health Data

Most of the work in SDC and PPDM has been done with respect to microdata or tabular data. Nowadays confidential information is stored in non-traditional data models and new privacy preserving methods are needed. Privacy can be envisioned in the context of location based services, cloud computing, and social networks. The focus of this talk is on privacy threats for healthcare data due to the growth of social networks. The advent of social network sites in the last few years seems to be a trend that will likely continue in the years to come. Online social interaction has become very popular around the globe and most sociologists agree that this will not fade away. Social network sites gather confidential information from their users (for instance, the social network site PacientsLikeMe, http://www.patientslikeme.com/, collects confidential health information) and, as a result, social network data has begun to be analyzed from a different, specific privacy perspective. Since the individual entities in social networks, besides the attribute values that characterize them, also have relationships with other entities, the risk of disclosure increases. We present solutions to anonymize a social network and we introduce a structural information loss measure that quantifies the amount of information lost due to edge generalization in the anonymization process.

Traian Marius Truta, Northern Kentucky University

Title: Overview of Statistical Disclosure Control and Privacy-Preserving Data Mining

Protecting data in such a way that it can be publically released without giving away confidential information that can be linked to specific entities (individuals, businesses, etc.) is an important and complex problem in today's information age. Researchers and practitioners from several fields have tried to define this problem in their particular context. The most common names for this problem were introduced in the statistical (Statistical Disclosure Control- SDM) and computer science (Privacy-Preserving Data Mining - PPDM) fields. Similar, yet different, approaches to modify data such that the disclosure of individuals is prevented and the information loss is small have been introduced in both SDC and PPDM communities. In this talk we will introduce the terminology associated with each field, present methods to de-identify data, define disclosure risk and information loss, and discuss SDC and PPDM in the context of healthcare data.

Traian Marius Truta, Northern Kentucky University

Title: De-identifying Health Data: Measuring and Controlling Disclosure Risk

Guaranteeing a minimum level of protection for released data is required not only by the common sense but also by various privacy regulations. Accurately measuring and limiting this risk of disclosure is an important problem. Various approaches to assess the disclosure risk are presented. We will focus on three disclosure risk measures, namely minimal, maximal, and weighted disclosure risks. The minimal disclosure risk measure represents the percentage of records that can be correctly identified by an intruder based on prior knowledge of key attribute values. The maximal disclosure risk measure considers the risk associated with probabilistic record linkage for records that are not unique in the masked microdata. The weighted disclosure risk measure allows the data owner to compute the risk of disclosure based on weights associated with different clusters of records. In the second part of the talk we will discuss how to control the disclosure risk while preserving the utility of the data. In this context, several properties (k-anonymity, p-sensitive k-anonymity, l-diversity, etc.) that a released dataset may have in order to prevent the disclosure of confidential information will be presented.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 28, 2009.