Skip to content

Latest commit

 

History

History
98 lines (69 loc) · 3.82 KB

File metadata and controls

98 lines (69 loc) · 3.82 KB

Data Masking for GDPR Policy Alignment

This project demonstrates practical data masking techniques using Python to align with GDPR (General Data Protection Regulation) policies. It serves as a learning resource for Data Engineers and Developers to understand how to protect Personal Identifiable Information (PII) in datasets.

🛡️ GDPR Context

Under GDPR (specifically Article 32), organizations must implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk. Key concepts include:

  • Pseudonymization: Processing personal data such that it can no longer be attributed to a specific data subject without the use of additional information (which is kept separately).
  • Data Minimization: Collecting and processing only the data that is necessary for the purpose.
  • Integrity and Confidentiality: Ensuring data is protected against unauthorized access.

🚀 Project Overview

The included Python script (data_masking_demo.py) generates dummy PII data and applies several masking techniques to transform it into a GDPR-compliant format for analysis (e.g., for a Data Warehouse or Analytics environment).

Techniques Demonstrated

  1. Pseudonymization (Hashing):

    • Applied to: User ID
    • Method: SHA-256 Hashing with salt.
    • Why: Allows tracking unique users without exposing their real IDs.
  2. Redaction:

    • Applied to: Email, Phone Number
    • Method: Partial masking (e.g., j*****e@example.com).
    • Why: Hides direct contact info while preserving domain or format for validation.
  3. Generalization (Bucketing):

    • Applied to: Date of Birth -> Age Group
    • Method: Converting specific dates into ranges (e.g., 30-39).
    • Why: Reduces precision to prevent re-identification while keeping the data useful for demographic analysis.
  4. Suppression:

    • Applied to: IP Address
    • Method: Dropping the column entirely.
    • Why: If the data isn't needed for the specific analysis, remove it (Data Minimization).
  5. Perturbation:

    • Applied to: Salary
    • Method: Rounding values.
    • Why: Reduces precision of sensitive financial data.

🛠️ Setup and Usage

Prerequisites

  • Python 3.x
  • pip (Python package manager)

Installation

  1. Clone this repository (or download the files).

  2. Create a virtual environment (recommended):

    # Windows
    python -m venv venv
    .\venv\Scripts\activate
    
    # macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt

Running the Demo

Execute the Python script:

python data_masking_demo.py

📊 Example Output

When you run the script, you will see a comparison of the Original and Masked data.

Original Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Date of Birth   IP Address       Salary
bdd640fb-0667...                      Daniel Doyle        garzaanthony@example.org 538.990.8386   1982-03-12      192.168.1.1      38420

Masked Data (Snippet):

User ID                               Full Name           Email                   Phone Number    Age Group       Salary
5ce9bbbe5a61...                       Daniel Doyle        g**********y@example.org *******8386    40-49           38000

Note: The IP Address column is removed, Date of Birth is replaced by Age Group, and User ID is hashed.

⚠️ Disclaimer

This code is for educational purposes. In a production environment, ensure you manage encryption keys securely (e.g., using a Key Management Service) and follow your organization's specific compliance requirements.