Differential Privacy for Output Privacy
There are many reasons why an organization might want to achieve output privacy so that they can safely share or publish an "anonymized" dataset. Some historical cases include:
- AOL's release of users' search logs (2006, 20 million searches for over 650,000 users over 3 months)
- Netflix Prize dataset (2006, training data: 100,480,507 ratings from 480,189 users for 17,770 movies)
- Australian Medicare release (2016, medical billing records for 2.9 million people)
- 2010 US Census release (statistics with statistical disclosure protections applied for over 300 million people)
What do all of these have in common? They were all attacked and demonstrated to be at least partially reidentifiable. Infamously, the New York Times reidentified AOL Searcher No. 4417749 based on her search queries ranging from "numb fingers" to "60 single men" to "dog that urinates on everything." As Damien Desfontaines explains in the reading below, just about any (unmodified) data point can be linked back to an individual if you have access to other data sources, even if intuitively it doesn't seem like personal information. There are other risks too, such as reconstruction attacks, where statistics released about a dataset can be used to set up a system of equations. Depending on how complicated these equations are, you can either solve them to reconstruct the dataset with certainty, or produce multiple possible solutions to identify records that are likely to be in the original dataset. Damien's blog post introduces the techniques that have been tried - and have failed - to protect individual privacy in published datasets, such as pseudonymization (also known as tokenization), deidentification (also known as masking), suppressing outliers, adding random noise to individual values, reducing the granularity of the data or aggregating it, along with statistical properties supposedly providing privacy guarantees such as k-anonymity. All of them can be attacked! Differential privacy was developed in response to these attacks.
These failures of legacy techniques prove that we need something better. So, when does an anonymization method deserve our trust? It should at least address the four points in the previous section:
- it should avoid making assumptions on what is identifiable or secret in the data;
- it should be resistant to auxiliary data — its guarantee should hold no matter what an attacker might already know;
- it should provide a mathematical guarantee that doesn't rely on subjective intuition;
- and it should protect against possible future attacks, not just ones known today.
It turns out that this is exactly what differential privacy provides.
- It makes no assumptions on what a potential attacker might use in the data.
- Its guarantees do not depend on what auxiliary data the attacker has access to.
- It provides a quantifiable, provable guarantee about the worst-case privacy risk.
- And this guarantee holds for all possible attacks, so the guarantee is future-proof.
It has a host of other benefits, too. For example, it can quantify the total privacy cost of multiple data releases. It also offers much more flexibility: many kinds of data transformation and analyses can be performed with differential privacy.
Differential privacy (DP) is a mathematical guarantee provided by a class of different techniques that add noise to the outputs of data processing. The goal is to achieve plausible deniability for individuals: the output should be "essentially" the same regardless of whether or not any one individual's data was included in the input.
That...sounds like it might be mathematically impossible, right? The whole point of providing inputs to some data processing is that we want them to contribute to the output. If the dataset is large, each individual's contribution (for example, their influence on the mean value) may be very small, but it is still present. That's where the "essentially" comes in. DP works with a privacy budget, also called the privacy loss parameter (denoted epsilon, ε). This bounds each individual's contribution to the output. When this parameter is chosen well - and when DP is applied in data processing scenarios where it's a good fit - DP provides good privacy protection.
There are multiple variants of DP. The privacy budget can either be enforced interactively as each query occurs - for example, in the scenario of a data analyst submitting queries - or non-interactively, which is useful for publishing a set of anonymous statistics, as with the data releases we considered above. And the addition of noise for DP can either occur globally or locally. In the case of global DP, a trusted actor has access to the original dataset - with no plausible deniability for individuals at all - and applies DP to the data centrally. This means everyone contributing to the dataset needs to trust them not to abuse the data. In contrast, with local DP each user (or, more realistically, each user's software) applies DP to their individual input before it is submitted to the dataset, removing the need for a trusted central actor. This enables, for example, privacy-preserving telemetry, where DP is applied before the data is ever sent from the user's computer.
Further Reading
- The Census Bureau's Simulated Reconstruction-Abetted Re-identification Attack on the 2010 Census - US Census Bureau, YouTube (2021). This webinar examines in detail how the Census Bureau proved that their 2010 Census release was reidentifiable, motivating their use of DP for the 2020 Census. Their adoption of DP has not been uncontroversial; some have argued that adding noise to census statistics is risky because policy decisions are made on the basis of this data.
- Differential Privacy: A Primer for a Non-Technical Audience - Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David R. O’Brien, and Salil Vadhan, Vanderbilt Journal of Entertainment & Technology Law 21, no. 1 (2018)
- Damien Desfontaines' introduction to differential privacy series, featuring gentle introductions to core DP concepts, a guide to identifying whether DP is a good fit for your use case, and examples of DP in practice in industry.
- Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry - Gonzalo M. Garrido, Xiaoyuan Liu, Florian Matthes, and Dawn Song, Proceedings on Privacy Enhancing Technologies (2023).
- Publishing Wikipedia usage data with strong privacy guarantees - Damien Desfontaines, Tumult Labs (2023)
- Google's Differential Privacy libraries on GitHub