A production-style data warehouse pipeline built with Airflow, PostgreSQL, CDC architecture, and SCD Type 2 modeling.
This project simulates how modern analytics engineering and data platform teams build reliable, incremental, and idempotent ETL systems.
Raw Data ↓ ODS Layer (Append-only) ↓ CDC Event Layer (Idempotent Event Store) ↓ DWD Layer (SCD Type 2) ↓ DWS Layer (Aggregated Analytics)
Orchestrated by Apache Airflow
- Python
- PostgreSQL
- Apache Airflow
- Pandas
- Docker
- Append-only raw ingestion
- Minimal transformation
- Immutable ingestion history
- Event-driven CDC architecture
- Unique event_id for idempotency
- Replay-safe pipeline design
- Exactly-once style processing
Implements:
- SCD Type 2
- Historical version tracking
- Current-state snapshot
- Logical delete support
Key columns:
- effective_start
- effective_end
- is_current
- is_deleted
Checkpoint table:
meta.cdc_checkpoint
Supports:
- Incremental loads
- Backfill
- Replay
- Restart recovery
Business-facing aggregated metrics:
- employee_count
- avg_salary
- max_salary
Pipeline DAG:
employee_warehouse_final_consistent
Task Flow:
ODS → CDC → DWD → DWS
Raw operational data.
Immutable event store.
Historical dimensional model using SCD2.
Analytics-ready summary tables.
This project demonstrates:
- Enterprise ETL architecture
- CDC pipeline design
- Idempotent processing
- Incremental computation
- Historical data modeling
- Airflow orchestration
- Data warehouse layering
- Kafka streaming ingestion
- dbt transformation layer
- Great Expectations data quality
- Spark distributed processing
- Real-time CDC with Debezium
- Iceberg / Delta Lake support
Cindy Tan Data / Analytics Engineer~~~~