PETs for Input Privacy

View
📚 Reading Assignment: UN Guide on Privacy-Enhancing Technologies for Official Statistics - United Nations Committee of Experts on Big Data and Data Science for Official Statistics (2023). This is a long guide, so you are welcome to just skim through it, but the case studies are fascinating. We will continue to quote the guide throughout this section.

Input privacy endeavors to allow two or more parties to submit data into a calculation without the other respective parties seeing data in clear. This is actually trickier to achieve in practice than it may first appear...

Broadly speaking, there are three popular approaches to input privacy:

  1. Finding a trusted third party or using a trusted legal entity, such as a national court system, to enforce pre-agreed contractual terms of use.
  2. Using pure cryptographic-based approaches.
  3. Leveraging trusted execution environments.


Public-Key Cryptography

You're using public-key cryptography to visit this page: it was sent to your browser encrypted using TLS, with a key pair your browser and the server negotiated in their initial handshake. This helps protect your privacy and ensures integrity of the data: while others on the network path can see that you requested this specific URL, they can't decrypt it to get the contents, and can't maliciously modify it before it reaches you because it's encrypted. Encryption of data at rest and in transit is so widely supported now in databases, cloud infrastructure, and software libraries, that it almost goes without saying that you should use it. Always. Deploy it everywhere you can and maintain close control of your private keys. Prefer keys stored in HSMs so that your cloud provider can't access your keys to steal your data or be legally compelled to provide it to intelligence agencies. While using HTTPS for websites is standard practice, ensuring end-to-end encryption may require careful design for specific product features, for example for user messaging.



Homomorphic Encryption

But what about when you're actually using the data - reading it, updating it, feeding it into a machine learning model, or analyzing it? It needs to be decrypted first, right? Not necessarily. Homomorphic encryption (HE) enables some computation to be carried out on encrypted data. Microsoft Edge's Password Monitor, for example, uses HE to check whether a user's saved passwords have ever been leaked in a password breach. The computation applied to the encrypted data is private set intersection. Using HE keeps passwords private both from malicious outsiders while the data is in transit and from Microsoft itself: the Edge servers never learn the unencrypted password values.

Fully-Homomorphic Encryption (FHE) enables any possible computation to be carried out on encrypted data, i.e. it is Turing complete. This means that at a minimum it must be possible to add and multiply encrypted data, and then more complex operations can be built by combining these.

Homomorphic encryption (HE) is a cryptographic technology that allows for direct computation (addition and multiplication) on encrypted data. It enables a party that provides data to outsource a computation. That is, no party aside from the party providing the data learns anything about the data in homomorphic encryption computations. Furthermore, only the party providing the data has the key with which to decrypt the output...

Homomorphic encryption refers to a family of encryption schemes with a special algebraic structure that allows computations to be performed directly on encrypted data. Homomorphic encryption offers post-quantum security, but can result in a high computational overhead and large expansion of data representation. Thus, ideal applications have a relatively small but critical encrypted computation component, include a persistent storage aspect, and are hard or impossible to implement using other methods. The most commonly used (fully) homomorphic encryption schemes at this time are the Brakerski-Gentry- Vaikuntanathan (BGV) and the Brakerski-Fan-Vercauteren (BFV) schemes.

- UN Guide on Privacy-Enhancing Technologies for Official Statistics


Secure Multi-Party Computation

Secure Multi-Party Computation (sMPC) enables multiple data processors to jointly perform computation on data without having to trust each other. Like FHE, this also relies on encryption. Each of them provides an input and this input remains private to the other processors, but the computation is conducted on all their inputs combined. Imagine, for example, that a set of competing healthcare providers would like to pool their patient data for medical research, but patient privacy is paramount to them. By using MPC, they can all benefit from research outputs conducted across their combined input data without compromising their patients' privacy.

sMPC has been applied to many use cases. An illustrative use case is that of sharing individually identifiable data among a group of several government agencies to compute statistics and make policy decisions based on those statistics. For example, a recent use case allowed five distinct agencies in County Government in the USA to share their unique data and compute the answers to queries such as, "How many persons that were incarcerated during a certain period had previously taken advantage of publicly provided mental health services or public housing?" The data provided by each agency included personal identifiers (for example, Social Security numbers), along with personal data such as mental health visit records and criminal records. sMPC was used to allow queries to be answered while keeping the input data strictly confidential to each party that provided it...

sMPC computation is based on one of several technologies. The most common technology choices are circuit garbling and linear secret sharing. The former is typically used in the case of two parties, while the latter may be used for groups of two to many parties. In both technologies, parties first agree on a function to be computed, and express that function as a logic circuit. While many functions can be described as circuits, some cannot, so not all functions are practically computable in sMPC. While a given set of sMPC primitives can technically be Turing complete, typical sMPC protocols are non-branching, fixed-length programs in order to be reasonably efficient...

sMPC technology performance depends heavily on the functions to be securely computed. A typical metric for sMPC performance is computational slowdown – the ratio of the latency of computation in sMPC to the latency of the same computation done without sMPC security. For general computation such as the calculations needed to process typical relational database query operators, recent results show a slowdown up to 10,000 times.

- UN Guide on Privacy-Enhancing Technologies for Official Statistics


Trusted Execution Environments

Trusted execution environments (TEEs), also known as secure enclaves, provide hardware security protections against an untrusted host. A TEE is a segregated processing environment within the CPU (or a dedicated secure coprocessor) which aims to provide data confidentiality, data integrity, and code integrity. Even if the operating system of the host computer has malware or the owner of the computer is malicious, data should remain secure within the TEE and it should not be possible to change the code the TEE is running. In a privacy context, the classic use case for this is to protect your data from an untrusted SaaS vendor. For example, Signal uses a TEE (Intel SGX) to provide private contact discovery. Their goal was to enable users to find their phone contacts on Signal without Signal (or any government who might have legal powers over them) ever having access to users' contacts. As described in the article, this is very challenging to implement in practice due to the technical limitations of SGX. Another limitation to keep in mind is that TEEs are not secure against all kinds of attacks, for example side-channel attacks.



Further Reading

  • The UK ICO has a great Guide to PETs with accessible examples and case studies. It includes some PETs not covered in this course, such as federated learning and synthetic data.
  • If you're unfamiliar with cryptography basics, it can be quite intimidating getting started with PETs such as homomorphic encryption. Dan Boneh's free online Cryptography course is a great starting point, as is Jean-Philippe Aumasson's Serious Cryptography book.
  • Check out OpenMined's Foundations of Private Computation course, which includes practical coding exercises working with HE, MPC, and other PETs.
  • Zama provides a range of open-source libaries for FHE, including TFHE-rs, which implements one FHE variant in Rust, and Concrete, which provides a high-level API in Python.