This project implements a SQL-based data modeling pipeline to track actor performance across time using cumulative history, Type 2 Slowly Changing Dimensions (SCD), and incremental data processing.
It reflects core data engineering patterns and supports powerful analytics β all modeled in PostgreSQL.
- PostgreSQL
- SQL (CTEs, window functions, enums, arrays)
- Data Engineering Concepts:
- Dimensional Modeling
- Cumulative Table concept
- Slowly Changing Dimensions (SCD Type 2)
- Analytical Reporting
The dataset was loaded using a PostgreSQL .dump file from a real server (not included here).
- sql
- 01_DDL_actors.sql -- Base tables and custom types (films, quality_class)
- 02_generate_cumulative_actor_table.sql -- Cumulative actor generation (1969β2021)
- 03_DDL_SCD_table.sql -- Slowly changing dimensions Type-2 table
- 04_backfill_scd_history.sql -- SCD Type 2 logic to track quality/status changes
- 05_incremental_scd_update.sql -- logic for Incremental update of SCD.
- 06_actor_analytics.sql -- SQL queries for actor performance and trends.
-
actorstable stores actor information by year, including:- List of films acted in (with rating, votes, ID)
- Quality class (
star,good,average,bad) - Active/inactive status
- year (slowly evolving dimension)
-
Custom Types:
films: composite type to store each film's metadataquality_class: enum to classify actor performance
In 02_generate_cumulative_actor_table.sql, a cumulative data that is generated for each actor:
- Combines data from the current and previous year
- Aggregates all films an actor has done up to that point
- Calculates average rating to classify actors:
>8βstar>7βgood>6βaverage- else β
bad
- Tracks activity status (
is_active) based on presence in the current year - Merges newly released films with historical records
This process helps simulate a time-traveling data warehouse view of actor careers.
In 04_backfill_scd_history.sql, implemented SCD Type 2 logic:
- Detects changes in:
is_activestatus (active β inactive)quality_class(e.g., average β star)
- Assigns each unique streak a
streak_identifier - Aggregates start and end years for each state
- Backfills an
actors_scd_tablewith historical segments for each actor's journey
This allows detailed tracking of actor evolution over time without losing history.
In actor_analytics.sql, performed rich analytics using the modeled data:
- Yearly Active Actor Counts
- Quality Class Distribution Over Time
- Actors Who Improved from Average to Star
- Longest Continuous Active Streaks
- First Year Becoming Star
- Comeback Actors (Inactive β Active Again)
- Find top 10 actors who sustained star quality the longest across their career.
- Identifies actors whose final known career segment was as a star
These insights help understand:
- Talent evolution
- Industry dynamics
- Career peaks and declines
- Talent analytics for casting agencies
- Career trajectory dashboards
- Data warehouse modeling practice
- Teaching dimensional modeling and SCD concepts
β
How to model historical data
β
Use of custom types and arrays in SQL
β
Implementation of SCD Type 2 manually
β
Aggregations with window functions
β
Building analytics on top of data models