DIMACS Working Group on Privacy / Confidentiality of Health Data

December 10 - 12, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:
Rakesh Agrawal, IBM Almaden, ragrawal@acm.org
Larry Cox, CDC, lcox@cdc.gov
Joe Fred Gonzalez, CDC, jfg2@cdc.gov, chair
Harry Guess, University of North Carolina, harry_guess@unc.edu
Tomas Sander, HP Labs, tosander@exch.hpl.hp.com
Presented under the auspices of the Special Focus on Communication Security and Information Privacy and
Special Focus on Computational and Mathematical Epidemiology.

Subgroup meeting DIMACS Working Group on Data De-Identification, Combinatorial Optimization, Graph Theory, and the Stat/OR Interface.


Abstracts:


Nabil Adam and Aabhas Paliwal, Rutgers University CIMIC/MERI

Title: Semantic Web Services for Privacy/Confidentiality of Health Care Data

Information systems technology allows instant retrieval of medical information, widening access to a greater number of people. Computerization of medical records has also threatened patient privacy and, in particular, has increased the potential for misuse, especially in the form of non-consensual secondary use of personally identifiable records. The most fundamental principle of fair use of information is that no secondary use of medical information should take place unless authorized by the patient.

This presents a challenge for ensuring privacy and confidentiality protection while providing authorized users with the convenience of e-Healthcare. We investigate the markup of web services with a semantic policy language as an alternative to traditional authentication and access control methods.

This augmentation provides a more efficient and flexible management capability for privacy and confidentiality related issues applicable to the e-Healthcare. Semantically-rich policy representations can simplify policy representations, reduce policy conflicts and facilitate interoperability. This aids authentication and authorization by providing support for complex problem solving, knowledge modeling and reuse. We present a semantically rich web services motivated approach to e-Healthcare and discuss the merits of this integrated approach.


Daniel C. Barth-Jones, Center for Healthcare Effectiveness Research, Wayne State University School of Medicine, Detroit, MI

Title: Protecting the Privacy of Healthcare Data While Preserving the Utility of Geographic Location Information for Epidemiologic Research

Epidemiologic and healthcare systems research conducted with administrative healthcare data has demonstrated considerable utility and value for the healthcare system in the U.S., which has resulted in a well-developed healthcare information industry utilizing such data. The recent implementation of the HIPAA Privacy Standards, however, has necessitated dramatic changes in the process of conducting research with administrative data. Under the privacy standards, conducting research with statistically de-identified administrative data is an attractive option because such data can be used without restrictions. Demographic and geographic characteristics in administrative data sets are particularly important determinants of disclosure risks for confidential medical information as well as essential data for many types of epidemiologic analyses. A framework is presented for conducting disclosure risk analyses for administrative data that considers the real-world complications involved in data intrusion attempts through record linkage methods. Disclosure risk analyses are reported focusing on three variables commonly found in administrative data: 1) date of birth/age categorization, 2) gender, and 3) geographic location detail. Disclosure risks are examined as a result of population density and the cross-classification structure of the demographic variables. Results of these analyses indicate that considerable disclosure control can be achieved with simple modifications of administrative data sets while preserving important geographic location detail.

Biosketch:
Daniel Barth-Jones earned his M.P.H. degree in general epidemiology and his Ph.D. in Epidemiologic Science at the University of Michigan. He is currently an assistant professor and epidemiologist for the Center for Healthcare Effectiveness Research at Wayne State University. His interest in statistical disclosure issues began in the early 1990's while he was employed as a senior statistician at HCIA (now Solucient), a large healthcare information organization. His current focus is on practical and applied problems in disclosure risk assessment and disclosure control faced by healthcare information vendors. He also maintains an active research agenda in the computer simulation of potential HIV vaccine impacts on epidemic control.


Judith Beach, Associate General Counsel Regulatory Affairs, Chief Privacy Officer, Quintiles Transnational

Title: Health Care Databases under HIPAA: Statistical Approaches to De-identification of Protected Health Information

Dr. Beach's talk will address:

  1. Evolution of De-identification Standards - HIPAA Privacy Regulation
  2. De-identification Standards for Health Information in Research

      1. Safe Harbor
      2. Statistician Method
        (1) HIPAA Provisions
        (2) Quintiles Experience and Methodology
      3. Limited Data Set
  1. Preemption of State laws on De-identification Standards for Health Information
  2. Health Information Privacy - Cases and Controversies

Biosketch:
Dr. Judith E. Beach is the Associate General Counsel for Regulatory Affairs, Chief Privacy Officer, and Coordinator of Government Relations with Quintiles Transnational Corp., a global private company headquartered near Research Triangle Park, North Carolina. Quintiles helps improve healthcare worldwide by providing a broad range of professional services, information and partnering solutions to the pharmaceutical, biotechnology and healthcare industries. Dr. Beach's responsibilities include providing legal and regulatory advice and guidance to Quintiles' personnel on various international and domestic regulatory issues concerning the pharmaceutical, medical device, biotechnology industries. In this capacity, she has been involved in providing counsel with respect to good clinical practices in the conduct of clinical trials and the protection of human participants in research with respect to investigators, institutional review boards, sponsors, and clinical research monitors and good manufacturing practices regarding drug product ownership. She is the Chair of the Company's Council on Research Ethics (CORE), which monitors ethical issues related to all stages of research. As the Chief Privacy Officer and Chair of the Council on Data Protection, Quintiles' internal privacy board, she coordinates the monitoring of the company's policies and procedures for protection of individually identifiable information, including the protection of research subjects' confidential health information. In 2002, she served as Assistant Secretary for the new trade association, the Association of Clinical Research Organizations, and currently participates in ACRO's Policies and Practices and Ethics and Clinical Practice Committees.

Dr. Beach graduated cum laude from Georgetown Law Center and then served as a judicial clerk for the District of Columbia Court of Appeals. Thereafter, she was an associate attorney with two Washington, D.C., law firms: Akin, Gump, Strauss, Hauer & Feld and Hyman, Phelps & McNamara, P.C., where she specialized in civil litigation and food, drug, and medical device law, respectively. She is admitted to the Bars of the District of Columbia, Virginia, Maryland, and North Carolina and is admitted to practice before the United States Supreme Court. Prior to law school, Judith received her B.S. degree summa cum laude from Clemson University and her Ph.D. in Physiology and Pharmacology from Duke University. She was a Fellow in Reproductive Endocrinology at the University of California San Francisco, and then a clinical investigator at Walter Reed in Washington, D.C. Dr. Beach has numerous publications in the fields of both science and law and is an honorary editor of the journal Pharmaceutical Development and Regulation. Dr. Beach has been elected as a member of the prestigious scientific societies, Sigma Xi and the Endocrine Society.


K. Arnold Chan, Department of Epidemiology, Harvard School of Public Health

Title: The Health Insurance Portability and Accountability Act (HIPAA) and its Implications on Epidemiological Research Using large Databases.

With the advance in information technology and the third party insurance scheme in the medical care delivery systems, large administrative databases in health care have been used around the world to address important public health questions. Unlike clinical trials and prospective observational studies, it is not feasible to obtain individual consent or authorization for studies in which these health care data are utilized. Under HIPAA regulations in the U.S., investigators can access these information without individual authorization if the Institution Review Board or the Privacy Board grants waiver of patient authorization. In order to obtain such waivers, investigators need to follow the "Minimal Necessary Principle" during data development, implement data transformation strategies to de-identify selected data elements, and to have robust data systems to safeguard Protected Health Information. Examples will be presented to illustrate how certain data development steps have been used within the HMO Research Network for various studies to meet HIPAA standard.


Lawrence H. Cox, Associate Director for the Office of Research and Methodology, NCHS, CDC

Title: Overview of Statistical Disclosure Limitation

I will provide a brief overview SDL including: identifying statistical disclosure (viz., what is statistical disclosure?); quantifying disclosure (viz., how do I know I have a disclosure problem; how much of a problem; when am I done solving the problem?); and, limiting disclosure (viz., how do I solve the problem; how well can I solve the problem?). Both tabular data and (unit record) microdata will be covered. Time permitting, I will comment on disclosure in statistical maps, models and data base query systems.

Biosketch:
Lawrence H. Cox, Ph.D. is Associate Director for Research and Methodology, National Center for Health Statistics, Centers for Disease Control and Prevention. Prior to joining NCHS, Dr. Cox served as the Senior Mathematical Statistician for the U.S. Environmental Protection Agency. Other previous positions include Senior Mathematical Statistician for the U.S. Census Bureau and Director, Board on Mathematical Sciences, U.S. National Academy of Sciences. He has taught for local universities, the Joint Program in Survey Methodology, and other organizations. Dr. Cox holds a Ph.D. in Mathematics from Brown University.

Dr. Cox has over 100 publications in the scientific literature. He is an elected Fellow of the American Statistical Association, served on the Board of Directors of the ASA and the National Computer Graphics Association, and is an Elected Member of the International Statistical Institute. He is the recipient of a Department of Commerce Medal and EPA Scientific and Technology Achievement Awards. His technical speciality is data confidentiality and statistical disclosure limitation, reflecting a broader interest in application of mathematical optimization methods to statistical problems. He has lectured and consulted in the United States and many foreign countries.


Lawrence H. Cox, Associate Director for the Office of Research and Methodology, NCHS, CDC

Title: Statistical Disclosure Limitation in Tabular Data and Related Mathematical and Computational Problems

I will discuss the three traditional SDL methods for tabular data--rounding, perturbing, and suppressing cell data--and a new imputation-based method called controlled tabular adjustment. Mathematical models will be presented and computational complexity and optimality issues discussed. The issue of limiting disclosure in tabular data while preserving distributional properties of data for analytical purposes will be addressed.

Biosketch:
Lawrence H. Cox, Ph.D. is Associate Director for Research and Methodology, National Center for Health Statistics, Centers for Disease Control and Prevention. Prior to joining NCHS, Dr. Cox served as the Senior Mathematical Statistician for the U.S. Environmental Protection Agency. Other previous positions include Senior Mathematical Statistician for the U.S. Census Bureau and Director, Board on Mathematical Sciences, U.S. National Academy of Sciences. He has taught for local universities, the Joint Program in Survey Methodology, and other organizations. Dr. Cox holds a Ph.D. in Mathematics from Brown University.

Dr. Cox has over 100 publications in the scientific literature. He is an elected Fellow of the American Statistical Association, served on the Board of Directors of the ASA and the National Computer Graphics Association, and is an Elected Member of the International Statistical Institute. He is the recipient of a Department of Commerce Medal and EPA Scientific and Technology Achievement Awards. His technical speciality is data confidentiality and statistical disclosure limitation, reflecting a broader interest in application of mathematical optimization methods to statistical problems. He has lectured and consulted in the United States and many foreign countries.


Richard D. De Veaux, Williams College, and Rafe Donahue and Robert D. Small, GlaxoSmithKline

Title: Using data mining techniques to harvest information from clinical trials

Objective: Data from 692 depressed patients in two eight-week clinical trials were mined to determine early predictors of study dropout. Patients were randomized to one of three after meeting entrance criteria and then treated for eight weeks. Clinical visits took place at baseline and days 7, 14, 21, 28, 42, and 56. Depression was measured via investigator-rated Hamilton Rating Scale for Depression (HAM-D). Other clinical measures included Hamilton Anxiety Scale (HAM-A), indicators of sexual dysfunction, and other adverse events.

Overall study dropout rate was 31%. Data available up to and including day 14 were mined in an attempt to determine early (within the first two weeks) predictors of eventual study dropout. Knowledge of such early warning signs could possibly improve patient retention and study quality.

A number of data mining techniques were applied to the data. The single greatest predictor of eventual dropout was the presence or absence of readings at day 14. Patient age also was relevant. We conclude that signs of study dropout may be evident very early in clinical trials. Every effort should be made to maintain enrollment of those patients who show early signs of eventual dropout.

Biosketch:
Dick De Veaux holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B.Princeton), Dance Education (M.A. Stanford) and Statistics (Ph.D., Stanford). He has taught at the Wharton School, the Princeton University School of Engineering, and, since 1994, has been a professor of Statistics in the Math and Stat Department of Williams College. He has won numerous teaching awards including a "Lifetime Award for Dedication and Excellence in Teaching" from the Engineering Council at Princeton. He has won both the Wilcoxon and Shewell awards from the American Society for Quality and was elected fellow of the ASA in 1998. He was the Program Chair for the 2001 Joint Statistical Meetings in Atlanta.

Dick has been a consultant for over 20 years for such Fortune 500 companies as Hewlett-Packard, Alcoa, Bank One, GlaxoSmithKline, Dupont, Pillsbury, Rohm and Haas, Ernst and Young, and General Electric. He holds two U.S. patents and is the author of over 25 refereed journal articles. His hobbies include cycling, swimming, singing (barbershop, doo wop and classical -- he is the head of the Diminished Faculty, a local doo wop group) -- and dancing (he was once a professional dancer and teaches Modern Dance during Winter Study). He is the father of four children ages 8, 10, 12, and 14. He is the co-author, with Paul Velleman, of an introductory textbook titled "Intro Stats" published by Addison-Wesley in 2003.


Giovanni Di Crescenzo, Telcordia

Title: Cryptographic Techniques for Confidentiality of Aggregate Statistics on Health Data

In discussing relationships between cryptography and health care, we argue that the latter area is finally approaching mature times for enhancements that use results from the former. Even more, we argue that cryptography has already produced secure systems that have quick applicability to health care. We exemplify this state of affair by showing that our previous result on privacy for stock market operations, published in Financial Cryptography 2001, after minor further analysis and modified design, naturally applies to solving the following privacy problem in health care statistics: how to allow collection and statistical analysis of data from medical records by keeping such records private both from other record holders and from the data collector itself.


Tyrone W A Grandison, Senior Software Engineer, QUEST Group, IBM Almaden Research Center.

Title: Software Demonstration of the use of Hippocratic database technology in supporting a health care provider

This presentation will give a broad overview of the Hippocratic Database project; highlighting the founding tenets, describing the prototype and showing how this technology can be used in a healthcare setting. The Hippocratic Database project is an initiative by IBM to develop a new comprehensive privacy management solution which supports automatic enforcement of privacy policies. Our architecture involves three main components. First, we allow a company to specify its privacy policy using a privacy language called EPAL or P3P. Second, we allow users to define their specific preferences for information access and usage. The information collection module checks the company's privacy policy against the user preferences. Finally, we provide secure querying capabilities that enforce corporate privacy policies and users' preferences.


Harry Guess, UNC

Title: "Context Setting" of Session II

A brief overview of the following topics: historical context of medical data privacy (Hippocratic Oath); growing body of worldwide laws and regulations on data privacy; clinical & epidemiologic research; differences in how regulations affect these; IRB approval in the post-HIPAA world; and regulatory options for de-identification of health care data.


Oliver Johnson, Merck & Co., Inc.

Title: Legal and Regulatory Framework in the United States and the European Union

This presentation will provide an overview of the primary legal and regulatory privacy regimes impacting human-subject biomedical research. It will begin with a short discussion of the historical basis for privacy regulation in the U.S. and Europe, offer a comparison of these approaches, and end with a more detailed discussion of how the Health Insurance Portability and Accountability Act (HIPAA) impacts U.S. human-subject research. Special focus will be given to identifying key challenges to, and practical strategies for conducting records-based research under HIPAA.

Biosketch:
Oliver is Chief Privacy Officer of Merck & Co., Inc., a role to which he was appointed by Merck's Management Committee in March 2001. Before his appointment Oliver, a lawyer, represented Merck's manufacturing, research, intellectual property licensing and European businesses. He also spent several years negotiating business deals as a member of Merck's Corporate Licensing group. In addition, from 1990 to 1998 Oliver served the Commonwealth of Pennsylvania under gubernatorial appointments to Pennsylvania's Real Estate Commission and Board of Medicine, both state regulatory bodies.

Before joining Merck in 1992, Oliver practiced law with a Philadelphia law firm. He received his B.A. from Williams College and his J.D. from the Georgetown University Law Center.


Jay J. Kim, ORM, NCHS, CDC

Title: Overview of Masking Schemes for Microdata

The U.S. Department of Health and Human Services (DHHS) has issued new national health information privacy standards. This is in response to the mandate of the Health Insurance Portability and Accountability Act (HIPAA) of 1996. The new standards provide protection for the privacy of certain individually identifiable health data.

This talk reviews the existing procedures for masking discrete and continuous variables in the files. The masking procedures for discrete variables include both those for the dichotomous and polychotomous variables. Those for the continuous variables include (1) additive nosie, (2) multiplicative noise, (3) rounding, (4) micro aggregation, (5) interval data, (6) data swapping, and (7) suppression and generalization. The statistical properties of the masked data and recoverability of the original data will be discussed. All of these procedures are not restricted to the health data and can be used for any types of data.

Biosketch:
Jay has been with the National Center for Health Statistics, CDC, as a mathematical statistician since May 2003. Before then he was with the Bureau of the Census for over 20 years. He also taught statistics and law at universities such as the George Washington University and Temple University. His specialty areas are confidentiality for micro data, sample design and sample estimation.


David Madigan, Department of Statistics, Rutgers University

Title: Data Mining Tutorial

Data Mining is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, computational biology, and internet commerce. The analysis of data on this scale presents exciting new computational and statistical challenges. This tutorial will provide an overview of current research in data mining with detailed descriptions of a couple of specific algorithms.


Benny Pinkas, HP Labs

Title: Private Analysis of Data Sets

Consider a scenario of two or more parties holding large private data sets, whose goal is to perform some simple analysis of the data while preserving privacy. In other words, given data sets X and Y, the parties' goal is to compute F(X,Y), for some function F, while hiding any other information about X and Y. It is well known that generic constructions can perform this secure computation with polynomial overhead for any polynomial-time F(), but our goal is to design privacy preserving constructions with linear or sublinear overhead, that can be applied to very large data sets. We describe such constructions, secure against both semi-honest and malicious adversaries, for two types of functions: (1) Computing the intersection of two sets, and (2) computing the k-ranked item (e.g. the median) of the union of the sets.


Tomas Sander, Hewlett Packard Laboratories

Title: Privacy Technologies and Challenges in their Deployment

The research community has developed a variety of privacy-enhancing technologies over the last two decades. Unfortunately only very few of these technologies have been successfully deployed. In this talk we will review several of these technological approaches and see what they accomplish. We analyze difficulties in deploying them and draw some lessons.


Gary Smith, School of Veterinary Medicine, University of Pennsylania.

Title: Privacy/Confidentiality Issues when Collecting Agricultural Data.

Modern spatial models of infectious disease epidemics in domestic animals are becoming increasing influential in informing policy decisions about disease control. Such models depend upon having accurate information concerning the location of farms, what species of animals are raised on each farm, and how many of each species are present. the exemplars for this kind of modeling are the foot and mouth disease models that were so influential during the 2001 foot and mouth disease epidemic in the Britain. It seems unlikely that we shall ever be able to apply similar models in the United States. The reason for this is that farmers are often very reluctant to provide information that may eventually find its way into the hands of local, state or federal government and be thus rendered accessible to the public at large.


Previous: Program
Working Group Index
DIMACS Homepage
Contacting the Center
Document last modified on December 11, 2003.