Balancing Data Confidentiality and Data Quality: A two-day tutorial sponsored by DIMACS and DyDAn

November 8 - 9, 2007
DIMACS Center, CoRE Building, Rutgers University

Larry Cox, CDC, ljtcox at
Presented under the auspices of the Special Focus on Computational and Mathematical Epidemiology, the Special Focus on Communication Security and Information Privacy and the Center for Dynamic Data Analysis (DyDAn).
Tutorial Objectives

Statistical summary data such as tabulations are built from data pertaining to individual entities (persons, households, businesses, organizations or groups). Statistical microdata are unit-record data containing multiple item responses pertaining to individual entities. Statistical data base query systems, once only a possibility, are becoming a reality. The need for data products that combine information across data bases and organizations is increasing and such data products arise in applications ranging over homeland security, health care, financial transactions, etc. Typically the data from which these statistical data products are built is reported at the individual entity level and is confidential.

Ethical survey practice demands that confidential data pertaining to individual persons or entities not be revealed through released data products. Ethical concerns are often reinforced by legislation or regulation, such as the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA) and the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Confidentiality concerns have been addressed by researchers and government statisticians over several decades, resulting in a suite of increasingly sophisticated and effective methods for statistical disclosure limitation (SDL), several of which have been implemented in software and incorporated in the survey practices of government statistical agencies in the U.S. and abroad. Until very recently, however, the effects of disclosure limitation methods on data quality, completeness and usability have been largely ignored. The interplay between data confidentiality and data quality is a central subject of this tutorial.

This tutorial has three objectives: (1) to familiarize the student with statistical disclosure limitation and SDL methods; (2) to examine potential effects of SDL methods on data completeness, quality and usability; and, (3) to present SDL methods that, in addition to protecting confidentiality effectively, limit abbreviation or deterioration in the usability, quality and completeness of the released data product(s). Practical data quality questions include: What effect does the SDL method have on key statistics? What effect does the SDL method have on the distribution of the original data? How easy are disclosure-limited data to analyze compared to original data? Is analysis based on disclosure-limited data an acceptable substitute for analysis based on original data?

This tutorial will cover the following topics: reasons for confidentiality protection; legal and regulatory requirements, including CIPSEA and HIPAA; legal and administrative solutions for restricting unauthorized access to confidential data; survey methods for restricting released data and for quantifying and limiting disclosure in tabulations, microdata and public use statistical data base query systems; using research data centers and controlled remote access to increase authorized access to confidential data; and, balancing the confidentiality protection provided by SDL methods with their effects on the usability, quality and completeness of released data products. Emphasis will be placed on recognizing disclosure and evaluating the effectiveness of disclosure limitation strategies and their effects on data quality by means of lecture, discussion and simple numeric examples. Classroom notes, mathematical preliminaries, URLs, and references on disclosure limitation will be provided. The tutorial is organized around types of data release-tabulations, microdata, data base query systems-but much of the material, particularly from the first day, is of general relevance.

The Instructor

Larry Cox has extensive experience in statistical disclosure and the development and implementation of disclosure limitation methods. He has published numerous papers, delivered many lectures, organized conferences and meetings, and taught several courses in the U.S. and abroad on privacy, confidentiality and statistical disclosure limitation. His research on SDL methods has led to adoption and automation of several of these methods by international statistical organizations for large-scale use. His recent research on quality-preserving SDL methods is at the forefront of this topic. Dr. Cox's professional experience includes consulting, teaching, and research. He has served as senior research statistician for three government agencies and as Director of the Board on Mathematical Sciences, National Academy of Sciences. He is an elected member of the International Statistical Institute and a Fellow of the American Statistical Association. He has served as Chair of the ASA Committee on Privacy and Confidentiality and on the ASA Board of Directors and the ISI Council.

Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on September 4, 2007.