Data Masking for GDPR Policy Alignment

This project demonstrates practical data masking techniques using Python to align with GDPR (General Data Protection Regulation) policies. It serves as a learning resource for Data Engineers and Developers to understand how to protect Personal Identifiable Information (PII) in datasets.

🛡️ GDPR Context

Under GDPR (specifically Article 32), organizations must implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk. Key concepts include:

Pseudonymization: Processing personal data such that it can no longer be attributed to a specific data subject without the use of additional information (which is kept separately).
Data Minimization: Collecting and processing only the data that is necessary for the purpose.
Integrity and Confidentiality: Ensuring data is protected against unauthorized access.

🚀 Project Overview

The included Python script (data_masking_demo.py) generates dummy PII data and applies several masking techniques to transform it into a GDPR-compliant format for analysis (e.g., for a Data Warehouse or Analytics environment).

Techniques Demonstrated

Pseudonymization (Hashing):
- Applied to: User ID
- Method: SHA-256 Hashing with salt.
- Why: Allows tracking unique users without exposing their real IDs.
Redaction:
- Applied to: Email, Phone Number
- Method: Partial masking (e.g., j*****e@example.com).
- Why: Hides direct contact info while preserving domain or format for validation.
Generalization (Bucketing):
- Applied to: Date of Birth -> Age Group
- Method: Converting specific dates into ranges (e.g., 30-39).
- Why: Reduces precision to prevent re-identification while keeping the data useful for demographic analysis.
Suppression:
- Applied to: IP Address
- Method: Dropping the column entirely.
- Why: If the data isn't needed for the specific analysis, remove it (Data Minimization).
Perturbation:
- Applied to: Salary
- Method: Rounding values.
- Why: Reduces precision of sensitive financial data.

🛠️ Setup and Usage

Prerequisites

Python 3.x
pip (Python package manager)

Installation

Clone this repository (or download the files).

Create a virtual environment (recommended):

# Windows
python -m venv venv
.\venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Running the Demo

Execute the Python script:

python data_masking_demo.py

📊 Example Output

When you run the script, you will see a comparison of the Original and Masked data.

Original Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Date of Birth   IP Address       Salary
bdd640fb-0667...                      Daniel Doyle        garzaanthony@example.org 538.990.8386   1982-03-12      192.168.1.1      38420

Masked Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Age Group       Salary
5ce9bbbe5a61...                       Daniel Doyle        g**********y@example.org *******8386    40-49           38000

Note: The IP Address column is removed, Date of Birth is replaced by Age Group, and User ID is hashed.

⚠️ Disclaimer

This code is for educational purposes. In a production environment, ensure you manage encryption keys securely (e.g., using a Key Management Service) and follow your organization's specific compliance requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Masking for GDPR Policy Alignment

🛡️ GDPR Context

🚀 Project Overview

Techniques Demonstrated

🛠️ Setup and Usage

Prerequisites

Installation

Running the Demo

📊 Example Output

⚠️ Disclaimer

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Masking for GDPR Policy Alignment

🛡️ GDPR Context

🚀 Project Overview

Techniques Demonstrated

🛠️ Setup and Usage

Prerequisites

Installation

Running the Demo

📊 Example Output

⚠️ Disclaimer