Skip to content

naveenkanaparthi-git/aws-cdc-data-platform

Repository files navigation

AWS CDC Data Platform

A production-grade data engineering platform implementing Change Data Capture (CDC) patterns for incremental data processing on AWS.

Problem Statement

Modern data platforms require real-time or near-real-time synchronization of transactional data into analytical systems. This project demonstrates an end-to-end CDC pipeline that:

  • Captures change events from a transactional database (simulated PostgreSQL)
  • Ingests changes incrementally to minimize processing overhead
  • Transforms data using modern ELT patterns
  • Loads into a data warehouse with SCD Type 2 support for historical tracking
  • Implements data quality checks and observability

Business Context: A retail e-commerce platform needs to sync customer, order, and product data from transactional systems to analytics for real-time reporting, inventory management, and customer insights.

Architecture

┌─────────────────┐
│  PostgreSQL     │
│  (Simulated)    │──┐
│  Change Logs    │  │
└─────────────────┘  │
                     │ CDC Events
                     ▼
┌─────────────────────────────────┐
│  CDC Ingestion Service          │
│  (Python - Debezium-style)      │
└─────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  S3 Raw Zone          │
         │  s3://raw/cdc/        │
         │  Partitioned by date  │
         └───────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Apache Airflow       │
         │  Orchestration        │
         └───────────────────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
┌──────────────┐      ┌──────────────────┐
│  dbt Models  │      │  Data Quality   │
│  (ELT)       │      │  Checks         │
└──────────────┘      └──────────────────┘
         │                       │
         └───────────┬───────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  S3 Curated Zone       │
         │  s3://curated/         │
         └───────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Amazon Redshift      │
         │  Analytics Warehouse  │
         │  (SCD Type 2)         │
         └───────────────────────┘

See architecture/diagram.md for detailed diagrams.

Technology Stack

  • Ingestion: Python 3.11+, boto3, psycopg2
  • Storage: Amazon S3 (raw/curated zones)
  • Orchestration: Apache Airflow 2.8+
  • Transformation: dbt-core 1.7+
  • Warehouse: Amazon Redshift
  • Infrastructure: Terraform, Docker
  • CI/CD: GitHub Actions
  • Data Quality: Custom validation framework

What I Built

1. CDC Source Simulation

  • Simulated PostgreSQL database with change log generation
  • Realistic transaction patterns (INSERT, UPDATE, DELETE)
  • Configurable event frequency and volume

2. CDC Ingestion Service

  • Python-based CDC capture service
  • Idempotent writes to S3 raw zone
  • Partitioning by date and table
  • Handles schema evolution

3. Airflow Orchestration

  • DAGs for incremental CDC processing
  • Backfill capabilities
  • Error handling and retries
  • SLA monitoring

4. dbt Transformation Layer

  • Incremental models for efficiency
  • SCD Type 2 implementation for historical tracking
  • Data quality tests
  • Documentation and lineage

5. Data Quality Framework

  • Schema validation
  • Freshness checks
  • Record count validation
  • Null checks

6. Infrastructure as Code

  • Terraform modules for AWS resources
  • S3 buckets with lifecycle policies
  • IAM roles and policies
  • Redshift cluster configuration

Key Skills Demonstrated

  • CDC Patterns: Change Data Capture implementation
  • Incremental Processing: Efficient delta loads
  • SCD Type 2: Slowly Changing Dimension handling
  • Data Lake Architecture: Raw/curated zone patterns
  • ELT vs ETL: Modern transformation approach
  • Orchestration: Airflow DAG design and best practices
  • IaC: Terraform for cloud infrastructure
  • Data Quality: Validation and monitoring
  • Backfill Strategies: Historical data loading
  • Partitioning: S3 and Redshift optimization

Repository Structure

aws-cdc-data-platform/
├── README.md
├── architecture/
│   └── diagram.md
├── ingestion/
│   ├── cdc_capture.py
│   ├── source_simulator.py
│   └── requirements.txt
├── transformation/
│   ├── dbt_project.yml
│   ├── models/
│   │   ├── staging/
│   │   ├── intermediate/
│   │   └── marts/
│   └── profiles.yml.example
├── orchestration/
│   ├── dags/
│   │   ├── cdc_pipeline.py
│   │   └── backfill_dag.py
│   └── requirements.txt
├── quality/
│   ├── quality_checks.py
│   └── requirements.txt
├── infra/
│   └── terraform/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── ci/
│   └── .github/
│       └── workflows/
│           └── pipeline.yml
├── config/
│   └── config.yaml.example
└── docs/
    ├── setup.md
    └── architecture.md

Getting Started

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • AWS CLI configured
  • Terraform >= 1.5
  • dbt-core and dbt-redshift
  • PostgreSQL (for source simulation)

Quick Start

  1. Set up environment:

    python -m venv venv
    source venv/bin/activate
    pip install -r ingestion/requirements.txt
  2. Configure AWS credentials:

    export AWS_ACCESS_KEY_ID=your_key
    export AWS_SECRET_ACCESS_KEY=your_secret
    export AWS_DEFAULT_REGION=us-east-1
  3. Deploy infrastructure:

    cd infra/terraform
    terraform init
    terraform plan
    terraform apply
  4. Run CDC simulation:

    python ingestion/source_simulator.py
  5. Start Airflow:

    cd orchestration
    docker-compose up -d
  6. Run dbt transformations:

    cd transformation
    dbt run
    dbt test

For detailed setup instructions, see docs/setup.md.

Configuration

Copy config/config.yaml.example to config/config.yaml and update with your AWS resources:

aws:
  region: us-east-1
  s3:
    raw_bucket: your-raw-bucket
    curated_bucket: your-curated-bucket
  redshift:
    cluster: your-cluster
    database: analytics
    schema: public

cdc:
  batch_size: 1000
  poll_interval: 60
  tables:
    - customers
    - orders
    - products

Data Flow

  1. Source: Simulated PostgreSQL generates change events
  2. Capture: CDC service reads change logs and writes to S3 raw zone
  3. Orchestration: Airflow triggers processing on schedule
  4. Transformation: dbt models process raw data incrementally
  5. Quality: Validation checks ensure data integrity
  6. Warehouse: Processed data loaded to Redshift with SCD Type 2

Monitoring & Observability

  • Airflow UI for pipeline monitoring
  • CloudWatch logs for ingestion service
  • dbt run results and test outcomes
  • Data quality metrics dashboard (optional)

Portfolio Summary

Problem: Build a production CDC pipeline for incremental data synchronization from transactional to analytical systems.

Architecture: Simulated PostgreSQL → CDC Capture → S3 Raw → Airflow → dbt → S3 Curated → Redshift

Stack: Python, Apache Airflow, dbt, AWS (S3, Redshift), Terraform

What I Built: End-to-end CDC pipeline with incremental processing, SCD Type 2, data quality checks, and infrastructure automation.

Key Skills: CDC patterns, incremental ETL, data lake architecture, Airflow orchestration, dbt transformations, Terraform IaC, data quality engineering.

Documentation

License

MIT

Author

Built as part of a data engineering portfolio demonstrating production-grade patterns and best practices.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors