Open Source data engineering demo project using dbt, DuckDB, dlt, Dagster and Metabase. Two storage modes for the delta tables are supported: local and Microsoft Fabric Onelake.
-
Updated
Jun 2, 2026 - Python
Open Source data engineering demo project using dbt, DuckDB, dlt, Dagster and Metabase. Two storage modes for the delta tables are supported: local and Microsoft Fabric Onelake.
SCD2 implementation using pyspark
A modern banking data pipeline built with Dagster and DBT!
end-to-end data pipeline system built as part of the Coursera open-source Data Engineering program. It unifies diverse data sources, implements SCD2 historical tracking, and orchestrates workflows using industry-standard tools.
P&C insurance claims lakehouse: Azure ADLS + Databricks (PySpark/Delta) + Snowflake + dbt, real-time FNOL fraud signals via Kafka, Airflow-orchestrated, Terraform-provisioned, OIDC-secured, with data contracts, lineage, and ADRs throughout.
Advanced Healthcare Claims Pipeline using Snowflake, Snowpipe, Streams, Tasks, SCD Type 2, and AWS S3. Automates ingestion, CDC, dimensional modeling, and data quality checks for healthcare patient and claims data.
Fortune-500-grade banking analytics platform: OLTP -> medallion lakehouse -> Kimball star schema -> semantic layer -> 9-tab executive dashboard + 5 ML models (churn, fraud, segmentation, forecasting). Production-ready, governed, fully tested.
Production-grade parameterized ETL pipeline implementing SCD Type 2 for travel booking data using Databricks, Delta Lake, and ADLS — includes data quality checks, incremental fact table build, Z-Order optimization, and SQL reporting.
Batch retail data lakehouse on Databricks: Delta Live Tables (bronze → silver → gold), Unity Catalog, synthetic data generator, and an executive analytics dashboard.
End-to-end Medicare data engineering pipeline: API ingestion, PostgreSQL 17, dbt, dimensional modeling (Kimball/SCD2), Apache Airflow orchestration, and Evidence.dev dashboard. Built on a QEMU/KVM Rocky Linux VM.
Production-grade CDC pipeline: MySQL → Debezium → Kinesis → S3 → AWS Glue (PySpark) → Redshift + Postgres + OpenSearch. Multi-sink fanout with SCD2, idempotency tracking, and 13 modular Terraform modules.
Modern data stack reference: dbt + BigQuery + Airflow (Cloud Composer) with medallion layering, SCD2 snapshots, exposures, freshness SLAs, and 45× cost reduction via partition + cluster + incremental tuning.
This repo contains details about travel booking project executed on Databricks, Thanks
End-to-End Data Warehouse & ETL Pipeline using SQL Server, Airflow, and Medallion Architecture
Multi-tenant IoT telemetry Lakehouse on Databricks + Delta Lake. PySpark, Auto Loader, DLT, medallion architecture, Terraform IaC.
Add a description, image, and links to the scd2 topic page so that developers can more easily learn about it.
To associate your repository with the scd2 topic, visit your repo's landing page and select "manage topics."