Production-grade data pipeline using Write-Audit-Publish pattern β the same approach used at Netflix, Airbnb, and Spotify.
Most pipelines write directly to production. If bad data slips through, stakeholders see it before you can fix it.
Never write to production directly. Instead:
Sensor β Staging β DQ Checks β Production β Cleanup
| Task | What It Does | Why It Matters |
|---|---|---|
wait_for_polygon_tickers |
Waits for upstream data | Don't run on empty tables |
fetch_to_staging |
Loads to staging table | Isolate unvalidated data |
run_dq_checks |
4 quality validations | Catch issues early |
exchange_to_production |
Atomic partition swap | Zero-downtime publish |
cleanup_staging |
Drops temp table | No orphan data |
β Row count > 0 # Table not empty
β No NULL close prices # Required field present
β All prices > 0 # Business logic validation
β All volumes >= 0 # No negative volumesIf ANY check fails β pipeline stops β bad data never reaches production.
- Orchestration: Apache Airflow
- Data Lake: PyIceberg + AWS Glue Catalog
- Storage: S3 (Iceberg format)
- Pattern: Write-Audit-Publish (idempotent, backfillable)
- β Idempotency (same input = same output)
- β Partition-scoped overwrites (preserve history)
- β Sensor-based dependencies
- β Staging table isolation
- β Automated data quality gates
Pattern learned from Zach Wilson β ex-Netflix, ex-Airbnb Data Engineer β via DataExpert.io bootcamp.
- airflow-soda-integration - Soda CLI + Airflow for DQ checks
- LuBot.ai - AI analytics platform with 17 nightly DQ workers
β If this helped you, star the repo!
