DIMACS/Northeast Big Data Hub Workshop on Privacy and Security for Big Data

April 24 - 25, 2017
DIMACS Center, CoRE Building, Rutgers University

Organizing Committee:
René Bastón, Columbia University
Joseph Lorenzo Hall, The Center for Democracy and Technology
Adam Smith, Pennsylvania State University
Sean Smith, Dartmouth College
Rebecca Wright, Rutgers University
Moti Yung, Snapchat
Workshop email: ps-for-bd-organizers at dimacs.rutgers.edu
Presented under the auspices of the DIMACS Big Data Initiative on Privacy and Security, the DIMACS Special Focus on Cybersecurity and in collaboration with the Northeast Big Data Innovation Hub.


Ryan Baker, University of Pennsylvania

Title: Some Thoughts on Privacy and Security for Educational Data

As in many areas of big data, education has become an area of focus in academia, government, and industry. Privacy and security concerns have emerged, in each of these sectors; however, unlike in other sectors there has been relatively more concern than actual cases of mis-use of data thus far. In this talk I will discuss some of the developing trends in each of these sectors and ways that research may be able to support the goal of better educational outcomes while protecting students from mis-use of their data.

Alexandra Boldyreva, Georgia Institution of Technology

Title: Keyless Fuzzy Search on Masked Data

We seek to enhance the privacy of remotely stored data created in de-centralized environments where key management is problematic, for example in IoT applications. While strong data security in such keyless setting is impossible, we propose to mask the data so that accessing it is costly for the attacker the defense would therefore hamper mass harvesting or mining of the data. Towards this goal, we propose a notion of keyless fuzzy search (KlFS), where a user can query an encrypted database, retrieve its parts and decrypt the content only if it possesses some data "close to" the encrypted data. In particular, an attacker who wants to retrieve the information, has to come up with the right queries, which is costly if the data has some entropy. As a result, easy mass harvesting or data mining the encrypted data becomes prohibitive. And without such queries the data is protected.

We propose a formal security model that asks a scheme not to leak any information about the data and the queries except for some well-defined leakage function (as no efficient search functionality is possible if no information is leaked), as long as the attacker cannot guess the right query to make. We propose two general KlFS schemes: both use locality-sensitive hashes (LSH), cryptographic hashes and symmetric encryption as building blocks, and the second scheme also employs reusable fuzzy extractors. We then propose a KlFS scheme specifically suited for image search. For all schemes we prove correctness and security, and state and discuss the required assumptions and leakage functions. To demonstrate the feasibility of our KlFS for images, we implemented and evaluated a prototype system that supports image search by object similarity on encrypted database.

David Cash, Rutgers University

Title: Side-Channel Attacks on Shared Search Indexes

Many popular cloud services provide search interfaces for users to navigate their private data quickly. These interfaces typically find ranked results in response to keyword queries. On the backend, the standard approach to fast searching for large numbers of users employs “shared search indexes”, which pay little per-user overhead.

In this work we show that shared indexes, however, fail to sufficiently isolate users. We exhibit a novel side channel in relevance scores and develop techniques to show that an attacker can exploit the side channel in practice. This attack is effective against multiple cloud service providers, including a major service with millions of users. Finally we conclude with countermeasures.

Gabriela Ciocarlie, SRI International

Title: Security, Reliability and Accountability of Large Infrastructure Systems

In this talk, I will focus on the security, reliability and accountability of large infrastructure systems, combining security by design and data analysis for behavior modeling. Specialized environments, such as Industrial Control Systems (ICS) and radio networks for mobile broadband, can not use most commercial solutions, as these rely only on pre-defined signatures, do not handle zero-day attacks, and fail to recognize degradation and outages. For ICS, I will present a content-based analysis that characterizes normal command and data sequences applied at the network level, and introduce mechanisms for achieving a low false positive rate. For mobile broadband networks, anomaly detection can identify partial and complete degradations in cell-service performance, modeling cell and network behavior based on key performance indicators. Finally, for cloud computing environments, I will introduce the concept of Accountable Clouds, an approach that collects the provenance of data and commits it to long term storage for subsequent auditing, aiming to make cloud computation accountable without sacrificing data owners' privacy.

Jane Greenberg, Drexel

Title: Beyond Open Access: Addressing Data Sharing Agreements and Challenges

Data sharing between different stakeholders in industry and research/academia often involves excessive use restrictions due to legal documentation, policies, and privacy concerns. Well-intentioned data sharing plans frequently fail due to prohibitive investment requirements and protracted negotiations. We are addressing these challenges through the "A Licensing Model and Ecosystem for Data Sharing" initiative, as an NSF Northeast Big Data Innovation Hub (NEHDIH) Spoke initiative. In this presentation, I will report on early-stage work 1) creating a licensing model for sharing closed and not necessarily free data, 2) developing a prototype software platform that enforces data sharing conditions and restrictions, and 3) generating relevant metadata documenting licenses and ensuring data access and interpretation. I will also report on outcomes of the Drexel workshop, "Enabling Seamless Data Sharing in Industry and Academia."

Metadata Research Center, College of Computing and Informatics, Drexel University:
A Licensing Model and Ecosystem for Data Sharing: http://cci.drexel.edu/mrc/projects/a-licensing-model-and-ecosystem-for-data-sharing
Northeast Big Data Innovation Hub: http://nebigdatahub.org/

Trinabh Gupta, University of Texas at Austin

Title: Scalable and Private Media Consumption with Popcorn

The convenience of on-demand video delivery services (Netflix, YouTube, etc.) comes at a high risk to user privacy. Indeed, there have been incidents (accidental disclosures, etc.) where users' private information (knowledge of media diet, movie ratings, etc.) has leaked. With a vision of reducing these risks, I will describe Popcorn, a Netflix-like media delivery system that provably hides (even from the content distributor) which movie a user is watching, is otherwise consistent with the prevailing commercial regime (copyrights, etc.), and achieves plausibly deployable performance (the per-request dollar cost is 3.87 times that of a non-private system). Popcorn leverages the properties of media streaming to tailor private information retrieval (PIR) cryptographic protocols to the media domain.

Shai Halevi, IBM Research

Title: Privacy-Preserving Search of Similar Patients in Genomic Data

The growing availability of genomic data holds great promise for advancing medicine and research, but unlocking its full potential requires adequate methods for protecting the privacy of individuals whose genome data we use. One example of this tension is running Similar Patient Query on remote genomic data: In this setting a doctor that holds the genome of his/her patient may try to find other individuals with "close" genomic data (in edit distance), and use the data of these individuals to help diagnose and find effective treatment for that patient's conditions. This is clearly a desirable mode of operation, however, the privacy exposure implications are considerable, so we would like to carry out the above "closeness" computation in a privacy preserving manner.

Secure-computation techniques offer a way out of this dilemma, but the high cost of computing edit distance privately poses a great challenge. Wang et al. proposed recently [ACM-CCS'15] an efficient solution, for situations where the genome sequences are so close that edit distance between two genomes can be well approximated just by looking at the indexes in which they differ from the reference genome. However, this solution does not extend well to cases with high divergence among individual genomes, and different techniques are needed there.

In this work we put forward a new approach for highly efficient edit-distance approximation, that works well even in settings with much higher divergence. We present contributions both in the design the approximation method itself and in the protocol for computing it privately. Our tests indicate that our approximation method works well even in regions of the genome where the distance between individuals is 5% or more with many insertions and deletions (compared to 99.5% similarly with mostly substitutions, as considered by Wang et al.). As for speed, our protocol implementation takes just a few seconds to run on databases with thousands of records, each of length thousands of alleles, and it scales almost linearly with both the database size and the length of the sequences in it. As an example, in the datasets of the recent iDASH competition, it takes less than two seconds to find the nearest five records to the query, in a size 500 dataset of length 3500 sequences. This is 2-3 orders of magnitude faster than using straightforward approaches.

(Joint work with Gilad Asharov, Yehuda Lindell, and Tal Rabin.)

Vladimir Kolesnikov, Nokia Bell Labs

Title: Enabling Data Sharing with Secure Computation

Today, information is one of the most precious assets. It is particularly valuable when it is structured, searchable, and shared. At the same time, collection, storage, correlation and sharing of information, especially across ownership domains, may break business models, pose significant privacy, security and compliance risks, and is often illegal.

Secure multi-party computation (MPC) reconciles the fundamental conflict between data utility and privacy. Often described as "computation under encryption", MPC allows evaluating arbitrary functions on private inputs, while guaranteeing that each party learns nothing beyond its intended output.

I will sketch the basic Garbled Circuit approach to MPC and briefly discuss several recent practical improvements: Publicly Verifiable Covert (PVC) secure computation, and Free Hash. With PVC, we achieve the guarantee that while the adversary may cheat and win with constant probability (say, p = 1/2), he is also caught with constant probability. If caught, an irrefutable publicly verifiable proof of cheating is obtained, a strong cheating deterrent. In Free Hash, we design a new way to hash Garbled Circuits (GC) simply by XORing all gate table entries. Clearly, the adversary can find a GC G' hashing to the same value. We show that w.h.p. such G' will be invalid (fail GC evaluation). When integrated into GC MPC, Free Hash allows significant computational savings.

This talk is based on joint works with Alex Malozemoff ("PVC", Asiacrypt 2015), and with Leo Fan and Chaya Ganesh ("Free Hash", Eurocrypt 2017).

Andrei Lapetz, Boston University

Title: Design and Deployment of Usable Web-based MPC for Data Aggregation: Lessons and Trade-offs

Secure multi-party computation (MPC) is a cryptographic primitive that can have substantial social value by allowing companies, government agencies, and other organizations to benefit from collective data analysis when the raw data are encumbered by legal and corporate policy restrictions on data sharing. However, these benefits cannot be realized unless we design MPC systems with security properties that can be comprehended by decision-makers and with implementations that can be utilized by resource-constrained non-expert end-users. We describe our own experiences designing, implementing, and deploying in a real-world scenario a usable web application for secure data aggregation. The application allows groups of cooperating parties with minimal expertise and no specialized software or hardware to compute aggregations across their collective data sets without holding any individual party responsible for housing the data securely, and also without revealing the contributions of individual participants. The application was deployed twice (in 2015 and 2016) to collect payroll data across genders and demographics in support of a study of wage disparities undertaken by the City of Boston and the Boston Women's Workforce Council. The usability of the application drove participation by 40-70 of the largest employer organizations in the Greater Boston Area. Our experience provides concrete examples of (and further insights into) ways to negotiate the trade-offs and relationships between usability, security, participation, and correctness that future MPC deployments will encounter as the technology transitions into practice.

Janne Lindqvist, Rutgers University

Title: Transforming Speed Sequences into Road Rays on the Map with Elastic Pathing

Advances in technology have provided ways to monitor and measure driving behavior. Recently, this technology has been applied to usage-based automotive insurance policies that offer reduced insurance premiums to policy holders who opt-in to automotive monitoring. Several companies claim to measure only speed data, which they further claim preserves privacy. However, we have developed an algorithm - elastic pathing - that successfully tracks drivers' locations from speed data. The algorithm tracks drivers by assuming a start position, such as the driver's home address (which is typically known to insurance companies), and then estimates the possible routes by fitting the speed data to map data. To demonstrate the algorithm's real-world applicability, we evaluated its performance with driving datasets from central New Jersey and Seattle, Washington, representing suburban and urban areas. We are able to estimate destinations with error within 250 meters for 17% of the traces and within 500 meters for 24% of the traces in the New Jersey dataset, and with error within 250 and 500 meters for 15.5% and 27.5% of the traces, respectively, in the Seattle dataset. Our work shows that these insurance schemes enable a substantial breach of privacy. More details including source code and datasets are available at http://elasticpathing.org/

Ananth Raghunathan, Google

Title: Utilizing Large-Scale Randomized Response at Google: RAPPOR and its lessons

At Google, we are constantly trying to improve the techniques we use to protect our users' security and privacy. Over the past three years, we have enabled Chrome to learn statistics about the behavior of users in a manner that protects client privacy in novel ways. We have collected billions of reports through RAPPOR, a state-of-the-art randomized response algorithm developed here that guarantees differential privacy which is widely accepted as the strongest form of privacy.

In this talk, we walk through the techniques used in RAPPOR and describe stories, challenges, and lessons learnt in its deployment across Chrome. We briefly touch upon new directions of research at Google that protect user privacy many different settings. We also give principled ideas and insights into designing privacy-preserving data collection systems at scale.

Anand Sarwate, Rutgers University

Title: Privacy Protections as an Incentive for Collaborative Research on Human Health

There are complex legal and ethical issues around human health data, particularly data from research studies. At the same time, there is an increasing push towards making data more available; publicly-funded research results should be shared to improve reproducibility. A middle ground is to share access to the data. This can take the form of research consortia around a particular condition such as Alzheimer's or autism. Researchers within the consortium can perform joint analyses on their data, allowing for a much larger sample size. I will describe a platform currently under development for collaborative neuroimaging research and some of the algorithmic and privacy challenges that arise.

Sean Smith, Dartmouth

Title: User Circumvention of Cybersecurity: A Cross-Disciplinary Approach

Security practitioners often design, build, and maintain systems with a mental model of how users behave. Unfortunately, this model often conflicts with what users actually do. This talk discusses the costly security problems that result, and presents my team's work — drawing on computer security, AI, and sociology — on identifying, understanding, predicting, and addressing human-centric security problems.

Qiang Tang, New Jersey Institute of Technology

Title: Cliptography: Post-Snowden Cryptography

Despite the laudatory history of development of modern cryptography, applying cryptographic tools to reliably provide security and privacy in practice is notoriously difficult. This talk will focus on a fundamental practical challenge: guaranteeing security and privacy without explicit trust in the algorithms and implementations that underlie basic security infrastructure. While the dangers of entertaining adversarial implementation of cryptographic primitives seem obvious, the ramifications of such attacks are surprisingly dire: it turns out thatin wide generalityadversarial implementations of cryptographic (randomized) algorithms may leak private information while producing output that is statistically indistinguishable from that of a faithful implementation. Snowden revelations has shown us how privacy can be lost at a very large scale even when traditional cryptography seems to be used to protect Internet communication.

I will first briefly explain the intuition about the above-mentioned "subliminal channel" attacks. Inspired by several folklore practical wisdom, I will also introduce several simple but rigorous immunizing strategies on subverted randomized algorithms generically, which serve as the key building blocks for most of the cryptographic schemes. Our new design principles may suggest new standardization designs that help reducing the threats of subverted implementation. We also hope our works to stimulate a community-wise effort to tackle this fundamental challenge.

Based on joint work with Alex Russell, Moti Yung, and Hong-Sheng Zhou.

Desheng Zhang, Rutgers University

Title: Data-Driven Cyber-Physical Systems for Smart Cities: Addressing Mobility Challenges by Urban Systems with Urban Data

For the first time ever, we have more people living in urban areas than rural areas. Based on this inevitable urbanization, the research in our group aims to address sustainability challenges related to urban mobility (e.g., energy consumption and traffic congestion) by data-driven applications with a Cyber-Physical-Systems approach (CPS, also known as a broader term for Internet of Things). Under the context of the smart cities initiative proposed by the White House, in this talk, I will focus on data-driven modeling and applications for large-scale cross-domain urban systems, e.g., taxi, bus, subway, private vehicle, truck, cellphone, and smart payment systems. I will first show how cross-domain data from these systems can be collaboratively utilized to capture urban mobility in real time by a new technique called multi-view bounding, which addresses overfitting issues of existing mobility models driven by single-domain data. Then I will show how the captured real-time mobility can be used to design a practical service, i.e., mobility-driven ridesharing, to provide positive feedback to urban systems themselves, e.g., reducing energy consumption and traffic congestion. Finally, I will present some research challenges related to privacy and security in the context of the smart cities research. Details: https://www.cs.rutgers.edu/~dz220/

Elena Zheleva, University of Illinois at Chicago

Title: Data Science in Social Spaces: Personalization vs. Privacy

The abundance of personal data that people share online provides an opportunity to develop a new category of information technology products, ones that are powered by data and sophisticated user models. Data science products range from personalized recommendations to fostering healthy online communities. At the same time, these products raise many privacy concerns due to the sometimes sensitive nature of the data they use. In this talk, I will present my work on machine learning algorithms that utilize information from people's social environments, in order to infer their personal traits. I will discuss the privacy implications of such algorithms for social media users and businesses in the light of a specific Facebook case study.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 17, 2017.