AWS CDC Data Platform

A production-grade data engineering platform implementing Change Data Capture (CDC) patterns for incremental data processing on AWS.

Problem Statement

Modern data platforms require real-time or near-real-time synchronization of transactional data into analytical systems. This project demonstrates an end-to-end CDC pipeline that:

Captures change events from a transactional database (simulated PostgreSQL)
Ingests changes incrementally to minimize processing overhead
Transforms data using modern ELT patterns
Loads into a data warehouse with SCD Type 2 support for historical tracking
Implements data quality checks and observability

Business Context: A retail e-commerce platform needs to sync customer, order, and product data from transactional systems to analytics for real-time reporting, inventory management, and customer insights.

Architecture

┌─────────────────┐
│  PostgreSQL     │
│  (Simulated)    │──┐
│  Change Logs    │  │
└─────────────────┘  │
                     │ CDC Events
                     ▼
┌─────────────────────────────────┐
│  CDC Ingestion Service          │
│  (Python - Debezium-style)      │
└─────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  S3 Raw Zone          │
         │  s3://raw/cdc/        │
         │  Partitioned by date  │
         └───────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Apache Airflow       │
         │  Orchestration        │
         └───────────────────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
         ▼                       ▼
┌──────────────┐      ┌──────────────────┐
│  dbt Models  │      │  Data Quality   │
│  (ELT)       │      │  Checks         │
└──────────────┘      └──────────────────┘
         │                       │
         └───────────┬───────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  S3 Curated Zone       │
         │  s3://curated/         │
         └───────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │  Amazon Redshift      │
         │  Analytics Warehouse  │
         │  (SCD Type 2)         │
         └───────────────────────┘

See architecture/diagram.md for detailed diagrams.

Technology Stack

Ingestion: Python 3.11+, boto3, psycopg2
Storage: Amazon S3 (raw/curated zones)
Orchestration: Apache Airflow 2.8+
Transformation: dbt-core 1.7+
Warehouse: Amazon Redshift
Infrastructure: Terraform, Docker
CI/CD: GitHub Actions
Data Quality: Custom validation framework

What I Built

1. CDC Source Simulation

Simulated PostgreSQL database with change log generation
Realistic transaction patterns (INSERT, UPDATE, DELETE)
Configurable event frequency and volume

2. CDC Ingestion Service

Python-based CDC capture service
Idempotent writes to S3 raw zone
Partitioning by date and table
Handles schema evolution

3. Airflow Orchestration

DAGs for incremental CDC processing
Backfill capabilities
Error handling and retries
SLA monitoring

4. dbt Transformation Layer

Incremental models for efficiency
SCD Type 2 implementation for historical tracking
Data quality tests
Documentation and lineage

5. Data Quality Framework

Schema validation
Freshness checks
Record count validation
Null checks

6. Infrastructure as Code

Terraform modules for AWS resources
S3 buckets with lifecycle policies
IAM roles and policies
Redshift cluster configuration

Key Skills Demonstrated

CDC Patterns: Change Data Capture implementation
Incremental Processing: Efficient delta loads
SCD Type 2: Slowly Changing Dimension handling
Data Lake Architecture: Raw/curated zone patterns
ELT vs ETL: Modern transformation approach
Orchestration: Airflow DAG design and best practices
IaC: Terraform for cloud infrastructure
Data Quality: Validation and monitoring
Backfill Strategies: Historical data loading
Partitioning: S3 and Redshift optimization

Repository Structure

aws-cdc-data-platform/
├── README.md
├── architecture/
│   └── diagram.md
├── ingestion/
│   ├── cdc_capture.py
│   ├── source_simulator.py
│   └── requirements.txt
├── transformation/
│   ├── dbt_project.yml
│   ├── models/
│   │   ├── staging/
│   │   ├── intermediate/
│   │   └── marts/
│   └── profiles.yml.example
├── orchestration/
│   ├── dags/
│   │   ├── cdc_pipeline.py
│   │   └── backfill_dag.py
│   └── requirements.txt
├── quality/
│   ├── quality_checks.py
│   └── requirements.txt
├── infra/
│   └── terraform/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── ci/
│   └── .github/
│       └── workflows/
│           └── pipeline.yml
├── config/
│   └── config.yaml.example
└── docs/
    ├── setup.md
    └── architecture.md

Getting Started

Prerequisites

Python 3.11+
Docker & Docker Compose
AWS CLI configured
Terraform >= 1.5
dbt-core and dbt-redshift
PostgreSQL (for source simulation)

Quick Start

Set up environment:

python -m venv venv
source venv/bin/activate
pip install -r ingestion/requirements.txt

Configure AWS credentials:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Deploy infrastructure:

cd infra/terraform
terraform init
terraform plan
terraform apply

Run CDC simulation:
```
python ingestion/source_simulator.py
```
Start Airflow:
```
cd orchestration
docker-compose up -d
```
Run dbt transformations:
```
cd transformation
dbt run
dbt test
```

For detailed setup instructions, see docs/setup.md.

Configuration

Copy config/config.yaml.example to config/config.yaml and update with your AWS resources:

aws:
  region: us-east-1
  s3:
    raw_bucket: your-raw-bucket
    curated_bucket: your-curated-bucket
  redshift:
    cluster: your-cluster
    database: analytics
    schema: public

cdc:
  batch_size: 1000
  poll_interval: 60
  tables:
    - customers
    - orders
    - products

Data Flow

Source: Simulated PostgreSQL generates change events
Capture: CDC service reads change logs and writes to S3 raw zone
Orchestration: Airflow triggers processing on schedule
Transformation: dbt models process raw data incrementally
Quality: Validation checks ensure data integrity
Warehouse: Processed data loaded to Redshift with SCD Type 2

Monitoring & Observability

Airflow UI for pipeline monitoring
CloudWatch logs for ingestion service
dbt run results and test outcomes
Data quality metrics dashboard (optional)

Portfolio Summary

Problem: Build a production CDC pipeline for incremental data synchronization from transactional to analytical systems.

Architecture: Simulated PostgreSQL → CDC Capture → S3 Raw → Airflow → dbt → S3 Curated → Redshift

Stack: Python, Apache Airflow, dbt, AWS (S3, Redshift), Terraform

What I Built: End-to-end CDC pipeline with incremental processing, SCD Type 2, data quality checks, and infrastructure automation.

Key Skills: CDC patterns, incremental ETL, data lake architecture, Airflow orchestration, dbt transformations, Terraform IaC, data quality engineering.

Documentation

License

MIT

Author

Built as part of a data engineering portfolio demonstrating production-grade patterns and best practices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS CDC Data Platform

Problem Statement

Architecture

Technology Stack

What I Built

1. CDC Source Simulation

2. CDC Ingestion Service

3. Airflow Orchestration

4. dbt Transformation Layer

5. Data Quality Framework

6. Infrastructure as Code

Key Skills Demonstrated

Repository Structure

Getting Started

Prerequisites

Quick Start

Configuration

Data Flow

Monitoring & Observability

Portfolio Summary

Documentation

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
architecture		architecture
ci/.github/workflows		ci/.github/workflows
config		config
data_samples		data_samples
docs		docs
infra/terraform		infra/terraform
ingestion		ingestion
orchestration		orchestration
quality		quality
transformation		transformation
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AWS CDC Data Platform

Problem Statement

Architecture

Technology Stack

What I Built

1. CDC Source Simulation

2. CDC Ingestion Service

3. Airflow Orchestration

4. dbt Transformation Layer

5. Data Quality Framework

6. Infrastructure as Code

Key Skills Demonstrated

Repository Structure

Getting Started

Prerequisites

Quick Start

Configuration

Data Flow

Monitoring & Observability

Portfolio Summary

Documentation

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages