DIMACS Working Group on Privacy / Confidentiality of Health Data

December 10 - 12, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Rakesh Agrawal, IBM Almaden, ragrawal@acm.org
Larry Cox, CDC, lcox@cdc.gov
Joe Fred Gonzalez, CDC, jfg2@cdc.gov, chair
Harry Guess, University of North Carolina, harry_guess@unc.edu
Tomas Sander, HP Labs, tosander@exch.hpl.hp.com
Presented under the auspices of the Special Focus on Communication Security and Information Privacy and
Special Focus on Computational and Mathematical Epidemiology.

Subgroup meeting DIMACS Working Group on Data De-Identification, Combinatorial Optimization, Graph Theory, and the Stat/OR Interface.

Privacy concerns are a major stumbling block to public health surveillance, in particular bioterrorism surveillance and epidemiological research. Moreover, the Health Insurance Portability and Accountability Act (HIPAA) of 2002 imposes very strict standards for rendering health information not individually identifiable. One approach involves removal of a number of potential identifiers including all dates of health events. Causes must precede effects, so removing temporal relationships in a dataset makes it all but impossible to use the data for etiologic research or for studies of medical care outcomes. Another approach to de-identification under HIPAA is for an expert statistical opinion to be provided that the risk of identifying an individual is very small. Accepted standards for making such a determination have not been developed. How to use large health care databases to detect medical or terrorist risks and improve health care quality while maintaining privacy and confidentiality of the data is a serious challenge. The problem is of interest to government agencies at all levels of government, industrial and academic researchers, as well as to a growing commercial sector that collects, maintains, and markets such data sets. Not many computer scientists knowledgeable about methods of cryptography/security/privacy/cryptography have gotten involved in this area (though some have), and the area is ripe for new partnerships between those in the public health/epidemiology community, the health data industry, and the computer science community. This working group was motivated by work of the DIMACS Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis, which is part of the DIMACS Special Focus on Computational and Mathematical Epidemiology. It will meet separately before the first meeting of the Working Group on On-line Privacy: Threats and Tools or in conjunction with it. The group will explore computational techniques for ensuring that the identity of an individual contained in a released data set cannot be identified. The challenge is to produce anonymous data that is specific enough to be useful for research and analysis. It will consider ways to remove direct identifiers (social security number, name address, telephone number), and ways to aggregate, substitute, and remove information from data sets. Also of interest will be questions having to do with using electronic data matching to link data elements from various sources/data sets in order to identify individuals, while maintaining privacy of others. The group will investigate methods for privacy protection in field-structured data and ways to extend existing methods to large data sets, as well as systems to render textual data sufficiently anonymous. Finally, the group will explore formal frameworks for disclosure control and formal protection models.

Next: Call for Participation
Working Group Index
DIMACS Homepage
Contacting the Center
Document last modified on November 18, 2003.