Differential Privacy for Output Privacy

View

There are many reasons why an organization might want to achieve output privacy so that they can safely share or publish an "anonymized" dataset. Some historical cases include:

What do all of these have in common? They were all attacked and demonstrated to be at least partially reidentifiable. Infamously, the New York Times reidentified AOL Searcher No. 4417749 based on her search queries ranging from "numb fingers" to "60 single men" to "dog that urinates on everything." As Damien Desfontaines explains in the reading below, just about any (unmodified) data point can be linked back to an individual if you have access to other data sources, even if intuitively it doesn't seem like personal information. There are other risks too, such as reconstruction attacks, where statistics released about a dataset can be used to set up a system of equations. Depending on how complicated these equations are, you can either solve them to reconstruct the dataset with certainty, or produce multiple possible solutions to identify records that are likely to be in the original dataset. Damien's blog post introduces the techniques that have been tried - and have failed - to protect individual privacy in published datasets, such as  pseudonymization (also known as tokenization), deidentification (also known as masking), suppressing outliers, adding random noise to individual values, reducing the granularity of the data or aggregating it, along with statistical properties supposedly providing privacy guarantees such as k-anonymity. All of them can be attacked! Differential privacy was developed in response to these attacks.

📚 Reading Assignment 1: What anonymization techniques can you trust? - Damien Desfontaines (2023)

These failures of legacy techniques prove that we need something better. So, when does an anonymization method deserve our trust? It should at least address the four points in the previous section:

  • it should avoid making assumptions on what is identifiable or secret in the data;
  • it should be resistant to auxiliary data — its guarantee should hold no matter what an attacker might already know;
  • it should provide a mathematical guarantee that doesn't rely on subjective intuition;
  • and it should protect against possible future attacks, not just ones known today.

It turns out that this is exactly what differential privacy provides.

  • It makes no assumptions on what a potential attacker might use in the data.
  • Its guarantees do not depend on what auxiliary data the attacker has access to.
  • It provides a quantifiable, provable guarantee about the worst-case privacy risk.
  • And this guarantee holds for all possible attacks, so the guarantee is future-proof.

It has a host of other benefits, too. For example, it can quantify the total privacy cost of multiple data releases. It also offers much more flexibility: many kinds of data transformation and analyses can be performed with differential privacy.


Differential privacy (DP) is a mathematical guarantee provided by a class of different techniques that add noise to the outputs of data processing. The goal is to achieve plausible deniability for individuals: the output should be "essentially" the same regardless of whether or not any one individual's data was included in the input.

That...sounds like it might be mathematically impossible, right? The whole point of providing inputs to some data processing is that we want them to contribute to the output. If the dataset is large, each individual's contribution (for example, their influence on the mean value) may be very small, but it is still present. That's where the "essentially" comes in. DP works with a privacy budget, also called the privacy loss parameter (denoted epsilon, ε). This bounds each individual's contribution to the output. When this parameter is chosen well - and when DP is applied in data processing scenarios where it's a good fit - DP provides good privacy protection.

There are multiple variants of DP. The privacy budget can either be enforced interactively as each query occurs - for example, in the scenario of a data analyst submitting queries - or non-interactively, which is useful for publishing a set of anonymous statistics, as with the data releases we considered above. And the addition of noise for DP can either occur globally or locally. In the case of global DP, a trusted actor has access to the original dataset - with no plausible deniability for individuals at all - and applies DP to the data centrally. This means everyone contributing to the dataset needs to trust them not to abuse the data. In contrast, with local DP each user (or, more realistically, each user's software) applies DP to their individual input before it is submitted to the dataset, removing the need for a trusted central actor. This enables, for example, privacy-preserving telemetry, where DP is applied before the data is ever sent from the user's computer.


Further Reading