Skip to content

Cindy-txr/Employee-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Employee Data Warehouse Pipeline

A production-style data warehouse pipeline built with Airflow, PostgreSQL, CDC architecture, and SCD Type 2 modeling.

This project simulates how modern analytics engineering and data platform teams build reliable, incremental, and idempotent ETL systems.


Architecture

Raw Data ↓ ODS Layer (Append-only) ↓ CDC Event Layer (Idempotent Event Store) ↓ DWD Layer (SCD Type 2) ↓ DWS Layer (Aggregated Analytics)

Orchestrated by Apache Airflow


Tech Stack

  • Python
  • PostgreSQL
  • Apache Airflow
  • Pandas
  • Docker

Key Features

1. ODS Layer

  • Append-only raw ingestion
  • Minimal transformation
  • Immutable ingestion history

2. CDC Layer

  • Event-driven CDC architecture
  • Unique event_id for idempotency
  • Replay-safe pipeline design
  • Exactly-once style processing

3. DWD Layer

Implements:

  • SCD Type 2
  • Historical version tracking
  • Current-state snapshot
  • Logical delete support

Key columns:

  • effective_start
  • effective_end
  • is_current
  • is_deleted

4. Incremental Processing

Checkpoint table:

meta.cdc_checkpoint

Supports:

  • Incremental loads
  • Backfill
  • Replay
  • Restart recovery

5. DWS Layer

Business-facing aggregated metrics:

  • employee_count
  • avg_salary
  • max_salary

Airflow DAG

Pipeline DAG:

employee_warehouse_final_consistent

Task Flow:

ODS → CDC → DWD → DWS


Data Modeling

ODS

Raw operational data.

CDC

Immutable event store.

DWD

Historical dimensional model using SCD2.

DWS

Analytics-ready summary tables.


Project Highlights

This project demonstrates:

  • Enterprise ETL architecture
  • CDC pipeline design
  • Idempotent processing
  • Incremental computation
  • Historical data modeling
  • Airflow orchestration
  • Data warehouse layering

Future Improvements

  • Kafka streaming ingestion
  • dbt transformation layer
  • Great Expectations data quality
  • Spark distributed processing
  • Real-time CDC with Debezium
  • Iceberg / Delta Lake support

Author

Cindy Tan Data / Analytics Engineer~~~~

About

Production-style Data Warehouse project using Airflow + PostgreSQL with CDC event layer, SCD2 modeling, checkpoint-based incremental loading, and idempotent pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors