OLCreate: PUB_6214_1.0: Privacy as Code

Privacy as Code

View

There are a range of open-source and commercial privacy tools available. New tools are continually being created, so we encourage you to do your own research to identify the best tools for your organization and whether you might need to build your own tooling. However, the below examples are good starting points to help you understand what is currently possible.

Static Code Scanning

SAST tools (Static Application Security Testing) play a key role in the Secure Software Development Lifecycle. They can automatically detect some basic security concerns such as vulnerable libraries or code vulnerable to SQL injection. By integrating them into build pipelines - for example as a build stage that must pass in order for a pull request to be mergeable into the main branch - you can enforce security policies within your codebase. Many tools provide results in SARIF (the Static Analysis Results Interchange Format). As its name suggests, this shared format enables results from different scans to be combined and interchanged across your build tooling. For example, Jenkins and GitHub both have SARIF integrations.

A screenshot from GitHub showing how SAST results can be integrated into a pull request and block merging if a scan fails
A screenshot from GitHub showing how SAST results can be integrated into a pull request and block merging if a scan fails.

Similarly, static code analysis tools exist for privacy which you can integrate into your build pipelines. For example, Privado (not to be confused with the VPN of the same name!) is an open-source tool that identifies personal data fields in code using regexes and maps data flows of these fields from data sources to data sinks. The scan results include recommended fixes based on pre-defined or custom rules and can be used to automatically generate reports (such as records of processing activities or a Data Safety Report for the Android Play Store). The tool currently supports Java and Python, while their enterprise offering supports additional languages.

Privacy Annotations

A complementary approach to scanning is to ask developers to manually annotate their code to indicate where personal data is being processed. This is clearly very time-intensive to apply to legacy codebases but provides improved accuracy over static code scanning if your product processes personal data in a custom way that tools fail to detect. It also encourages developers to be more privacy-aware as they write their code, and enables you to build automation around these annotations, for example to automatically generate data flow maps for all your organization's codebases.

However, be careful not to overwhelm your developers with additional concerns beyond their engineering work. Are you already practising DevSecOps? How comfortable are your developers with that? Do they perhaps need additional training and support before it's feasible to extend this into DevSecPrivOps? Privacy annotations impose additional cognitive overhead and require devs to have basic privacy knowledge. Provide privacy training and introduce privacy as code gradually, incorporating developers' feedback into your tooling and processes.

This paper by Hjerppe et al. describes how you could use 3 Java annotations as a basis for introducing code-level policy enforcement (e.g. personal data fields should never be logged):
```
      @PersonalData
      @PersonalDataHandler
      @PersonalDataEndpoint
      
```
Alternatively, Fides Lang is an open-source privacy taxonomy and description language (designed for compatibility with GDPR, CCPA, LGPD and ISO 19944) that you can use to annotate systems and datasets with privacy properties in dedicated YAML files. You can build your own automation on top of this, or use its companion tool Fides, an open-source privacy engineering platform for privacy rights and data mapping automation.

💻 Exercise: check out the Fideslang Visual Explorer. Does this ontology make sense for your context? Could you map your data categories, data uses, and data subjects to it, or would you need to adapt or extend it? How would you do so?

Personal Data Detection

You might also want to detect personal data at runtime, for example because you don't think you have complete coverage yet with your code analysis, or because you process a lot of data input by your users. For example, you might want to scan user input for personal data and then redact this - or display a warning to the user - if it is detected. Tools for doing so are also referred to as PII detection or Data Loss Prevention (DLP) tools.

Nightfall provides APIs for ML-powered personal data detection. Currently their APIs are free for 3GB per month. They also have an enterprise DLP offering which integrates with third-party systems such as GitHub and Slack to detect (and redact) personal data.
AWS's Comprehend service can be used to detect and redact various personal data fields. However, the number of supported data fields is fairly limited. For S3 buckets, AWS Macie can be used to detect sensitive data. Similarly, Microsoft Azure's Cognitive Search offers PII detection functionality, and both Azure and GCP offer dedicated DLP services.
If you already use observability software, it may support personal data detection. See for example Datadog's Sensitive Data Scanner.

My OpenLearn Create Profile

About this course

Introduction to Privacy Engineering

Privacy as Code

Static Code Scanning

Privacy Annotations

Personal Data Detection

<- Back

Next ->