A production-grade data engineering platform implementing Change Data Capture (CDC) patterns for incremental data processing on AWS.
Modern data platforms require real-time or near-real-time synchronization of transactional data into analytical systems. This project demonstrates an end-to-end CDC pipeline that:
- Captures change events from a transactional database (simulated PostgreSQL)
- Ingests changes incrementally to minimize processing overhead
- Transforms data using modern ELT patterns
- Loads into a data warehouse with SCD Type 2 support for historical tracking
- Implements data quality checks and observability
Business Context: A retail e-commerce platform needs to sync customer, order, and product data from transactional systems to analytics for real-time reporting, inventory management, and customer insights.
┌─────────────────┐
│ PostgreSQL │
│ (Simulated) │──┐
│ Change Logs │ │
└─────────────────┘ │
│ CDC Events
▼
┌─────────────────────────────────┐
│ CDC Ingestion Service │
│ (Python - Debezium-style) │
└─────────────────────────────────┘
│
▼
┌───────────────────────┐
│ S3 Raw Zone │
│ s3://raw/cdc/ │
│ Partitioned by date │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Apache Airflow │
│ Orchestration │
└───────────────────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ dbt Models │ │ Data Quality │
│ (ELT) │ │ Checks │
└──────────────┘ └──────────────────┘
│ │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ S3 Curated Zone │
│ s3://curated/ │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Amazon Redshift │
│ Analytics Warehouse │
│ (SCD Type 2) │
└───────────────────────┘
See architecture/diagram.md for detailed diagrams.
- Ingestion: Python 3.11+, boto3, psycopg2
- Storage: Amazon S3 (raw/curated zones)
- Orchestration: Apache Airflow 2.8+
- Transformation: dbt-core 1.7+
- Warehouse: Amazon Redshift
- Infrastructure: Terraform, Docker
- CI/CD: GitHub Actions
- Data Quality: Custom validation framework
- Simulated PostgreSQL database with change log generation
- Realistic transaction patterns (INSERT, UPDATE, DELETE)
- Configurable event frequency and volume
- Python-based CDC capture service
- Idempotent writes to S3 raw zone
- Partitioning by date and table
- Handles schema evolution
- DAGs for incremental CDC processing
- Backfill capabilities
- Error handling and retries
- SLA monitoring
- Incremental models for efficiency
- SCD Type 2 implementation for historical tracking
- Data quality tests
- Documentation and lineage
- Schema validation
- Freshness checks
- Record count validation
- Null checks
- Terraform modules for AWS resources
- S3 buckets with lifecycle policies
- IAM roles and policies
- Redshift cluster configuration
- CDC Patterns: Change Data Capture implementation
- Incremental Processing: Efficient delta loads
- SCD Type 2: Slowly Changing Dimension handling
- Data Lake Architecture: Raw/curated zone patterns
- ELT vs ETL: Modern transformation approach
- Orchestration: Airflow DAG design and best practices
- IaC: Terraform for cloud infrastructure
- Data Quality: Validation and monitoring
- Backfill Strategies: Historical data loading
- Partitioning: S3 and Redshift optimization
aws-cdc-data-platform/
├── README.md
├── architecture/
│ └── diagram.md
├── ingestion/
│ ├── cdc_capture.py
│ ├── source_simulator.py
│ └── requirements.txt
├── transformation/
│ ├── dbt_project.yml
│ ├── models/
│ │ ├── staging/
│ │ ├── intermediate/
│ │ └── marts/
│ └── profiles.yml.example
├── orchestration/
│ ├── dags/
│ │ ├── cdc_pipeline.py
│ │ └── backfill_dag.py
│ └── requirements.txt
├── quality/
│ ├── quality_checks.py
│ └── requirements.txt
├── infra/
│ └── terraform/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── ci/
│ └── .github/
│ └── workflows/
│ └── pipeline.yml
├── config/
│ └── config.yaml.example
└── docs/
├── setup.md
└── architecture.md
- Python 3.11+
- Docker & Docker Compose
- AWS CLI configured
- Terraform >= 1.5
- dbt-core and dbt-redshift
- PostgreSQL (for source simulation)
-
Set up environment:
python -m venv venv source venv/bin/activate pip install -r ingestion/requirements.txt -
Configure AWS credentials:
export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-east-1
-
Deploy infrastructure:
cd infra/terraform terraform init terraform plan terraform apply -
Run CDC simulation:
python ingestion/source_simulator.py
-
Start Airflow:
cd orchestration docker-compose up -d -
Run dbt transformations:
cd transformation dbt run dbt test
For detailed setup instructions, see docs/setup.md.
Copy config/config.yaml.example to config/config.yaml and update with your AWS resources:
aws:
region: us-east-1
s3:
raw_bucket: your-raw-bucket
curated_bucket: your-curated-bucket
redshift:
cluster: your-cluster
database: analytics
schema: public
cdc:
batch_size: 1000
poll_interval: 60
tables:
- customers
- orders
- products- Source: Simulated PostgreSQL generates change events
- Capture: CDC service reads change logs and writes to S3 raw zone
- Orchestration: Airflow triggers processing on schedule
- Transformation: dbt models process raw data incrementally
- Quality: Validation checks ensure data integrity
- Warehouse: Processed data loaded to Redshift with SCD Type 2
- Airflow UI for pipeline monitoring
- CloudWatch logs for ingestion service
- dbt run results and test outcomes
- Data quality metrics dashboard (optional)
Problem: Build a production CDC pipeline for incremental data synchronization from transactional to analytical systems.
Architecture: Simulated PostgreSQL → CDC Capture → S3 Raw → Airflow → dbt → S3 Curated → Redshift
Stack: Python, Apache Airflow, dbt, AWS (S3, Redshift), Terraform
What I Built: End-to-end CDC pipeline with incremental processing, SCD Type 2, data quality checks, and infrastructure automation.
Key Skills: CDC patterns, incremental ETL, data lake architecture, Airflow orchestration, dbt transformations, Terraform IaC, data quality engineering.
MIT
Built as part of a data engineering portfolio demonstrating production-grade patterns and best practices.