Employee Data Warehouse Pipeline

A production-style data warehouse pipeline built with Airflow, PostgreSQL, CDC architecture, and SCD Type 2 modeling.

This project simulates how modern analytics engineering and data platform teams build reliable, incremental, and idempotent ETL systems.

Architecture

Raw Data ↓ ODS Layer (Append-only) ↓ CDC Event Layer (Idempotent Event Store) ↓ DWD Layer (SCD Type 2) ↓ DWS Layer (Aggregated Analytics)

Orchestrated by Apache Airflow

Tech Stack

Python
PostgreSQL
Apache Airflow
Pandas
Docker

Key Features

1. ODS Layer

Append-only raw ingestion
Minimal transformation
Immutable ingestion history

2. CDC Layer

Event-driven CDC architecture
Unique event_id for idempotency
Replay-safe pipeline design
Exactly-once style processing

3. DWD Layer

Implements:

SCD Type 2
Historical version tracking
Current-state snapshot
Logical delete support

Key columns:

effective_start
effective_end
is_current
is_deleted

4. Incremental Processing

Checkpoint table:

meta.cdc_checkpoint

Supports:

Incremental loads
Backfill
Replay
Restart recovery

5. DWS Layer

Business-facing aggregated metrics:

employee_count
avg_salary
max_salary

Airflow DAG

Pipeline DAG:

employee_warehouse_final_consistent

Task Flow:

ODS → CDC → DWD → DWS

Data Modeling

ODS

Raw operational data.

CDC

Immutable event store.

DWD

Historical dimensional model using SCD2.

DWS

Analytics-ready summary tables.

Project Highlights

This project demonstrates:

Enterprise ETL architecture
CDC pipeline design
Idempotent processing
Incremental computation
Historical data modeling
Airflow orchestration
Data warehouse layering

Future Improvements

Kafka streaming ingestion
dbt transformation layer
Great Expectations data quality
Spark distributed processing
Real-time CDC with Debezium
Iceberg / Delta Lake support

Author

Cindy Tan Data / Analytics Engineer~~~~

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.venv		.venv
__pycache__		__pycache__
airflow		airflow
data		data
init		init
src		src
.gitignore		.gitignore
README.md		README.md
clean_data.csv		clean_data.csv
config.py		config.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Employee Data Warehouse Pipeline

Architecture

Tech Stack

Key Features

1. ODS Layer

2. CDC Layer

3. DWD Layer

4. Incremental Processing

5. DWS Layer

Airflow DAG

Data Modeling

ODS

CDC

DWD

DWS

Project Highlights

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Employee Data Warehouse Pipeline

Architecture

Tech Stack

Key Features

1. ODS Layer

2. CDC Layer

3. DWD Layer

4. Incremental Processing

5. DWS Layer

Airflow DAG

Data Modeling

ODS

CDC

DWD

DWS

Project Highlights

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages