This project implements Slowly Changing Dimension (SCD) Type 2 on customer data using AWS S3 + AWS Glue (PySpark) + Athena.
It preserves full history of changes instead of overwriting customer attributes.
When customer attributes (like city/state) change, SCD Type 2:
- expires the old record (
is_current=false, setsend_date) - inserts a new record (
is_current=true,end_date=NULL) - keeps historical versions for audit and analytics
S3 Raw (customers_day1.csv, customers_day2.csv)
→ AWS Glue PySpark (SCD Type 2 logic)
→ S3 Silver (Parquet dimension table)
→ Athena validation queries
Raw:
s3://surya-project3-scd/raw/customers/customers_day1.csvs3://surya-project3-scd/raw/customers/customers_day2.csv
Silver:
s3://surya-project3-scd/silver/customers_scd2/
- customer_id
- customer_unique_id
- customer_city
- customer_state
- start_date
- end_date
- is_current
Result after running SCD:
is_current = true→ 99441 rowsis_current = false→ 10 rows
This confirms 10 customers had attribute changes and history was preserved.
- AWS S3
- AWS Glue
- PySpark
- Athena
- Parquet
Built a real-world historical dimension pipeline using SCD Type 2 logic with AWS Glue, and validated results using Athena queries.