====================
Reference-Driven Dataset Schema Alignment Engine
Align. Audit. Harmonize.
Schema Harmonizer enforces structural and metadata consistency between datasets using a reference-driven approach.
It was built to solve a common but under-engineered problem:
Longitudinal datasets drift silently --- variable names, labels, value categories, and missing definitions change across waves.
Instead of manually rewriting metadata or generating thousands of lines of syntax, this utility performs deterministic schema harmonization in seconds.
- 📐 Schema Harmonizer
- 🚀 What It Does
- 🧠 Why It Matters
- ⚙ Architecture Overview
- 📂 Project Structure
- 🛠 Configuration Example
- 📊 Example Scale Test
- 🧾 Drift Reporting
- 🔍 Rename Audit
- 🧩 Why SPSS?
- 🏗 Design Principles
- 📎 Installation
- ▶ Run
- 📜 License
🔹 Applies configurable variable renaming
🔹 Detects schema drift (reference-only / target-only variables)
🔹 Aligns metadata from reference dataset:
-
Variable labels
-
Value labels
-
Measure levels
-
Missing value definitions
🔹 Produces audit artifacts:
-
rename_report.json -
drift_report.json
🔹 Writes harmonized output dataset
All in a reproducible, config-driven pipeline.
Schema drift is one of the biggest risks in:
-
Longitudinal analytics
-
Survey research
-
Batch data processing
-
Legacy-to-modern migrations
-
Governance-heavy environments
Manual metadata correction:
-
Is error-prone
-
Is slow
-
Does not scale
Schema Harmonizer replaces that with:
✔ Deterministic alignment
✔ Clear drift visibility
✔ Execution traceability
✔ Config-based identity mapping
Consistency across time becomes engineered --- not manual.
Reference Dataset (.sav)
│
▼
Target Dataset (.sav)
│
▼
Rename Layer (YAML-driven)
│
▼
Drift Detection
│
▼
Metadata Alignment Engine
│
▼
Harmonized Output
│
├── outputs/
└── logs/
Separation of concerns:
-
data/→ Input datasets -
outputs/→ Harmonized result -
logs/→ Audit artifacts
schema-harmonizer/
│
├── src/
│ ├── main.py
│ ├── rename.py
│ ├── metadata.py
│ ├── aligner.py
│ └── file_io.py
│
├── configs/
│ └── rename_mapping.yaml
│
├── data/
│ ├── reference.sav
│ └── target.sav
│
├── outputs/
│
├── logs/
│
└── README.md
rename_mapping:
id: caseid
question_1: q1
question_2: q2
question_3: q3
strict_mode: false
Rename mapping aligns identity before metadata harmonization.
Strict mode optionally halts execution if schema drift is detected.
Tested on:
-
500 rows
-
90+ variables
-
Drift on both reference and target sides
Execution time: < 1 second
Produces:
-
Harmonized
.savoutput -
Rename audit report
-
Drift audit report
Deterministic. Reproducible. Fast.
Example drift_report.json:
{
"total_reference_variables": 104,
"total_target_variables": 99,
"total_common": 91,
"total_reference_only": 13,
"total_target_only": 8
}
This makes schema evolution visible instead of silent.
rename_report.json captures:
-
Requested renames
-
Successful renames
-
Missing rename sources
-
Final column count
This adds governance transparency to identity transformations.
The current implementation uses .sav as the I/O layer via pyreadstat.
However, the architecture is not SPSS-dependent.
The harmonization logic operates at the:
-
DataFrame level
-
Metadata abstraction layer
This design allows adaptation to:
-
CSV-based pipelines
-
Parquet datasets
-
Database schema alignment
-
Feature store synchronization
-
Any structured tabular format
SPSS is the reference layer --- not the limitation.
✔ Config-driven
✔ Deterministic execution
✔ Defensive renaming
✔ Drift awareness
✔ Scalable structure
✔ Clean separation of inputs, outputs, and logs
No unnecessary abstraction.
No overengineering.
Focused utility.
-
CLI interface
-
Format-agnostic adapters
-
Metadata change diff reporting
-
Automated longitudinal wave alignment
-
Integration with ETL orchestration tools
pip install -r requirements.txt
python src/main.py
Schema Harmonizer was built to enforce longitudinal consistency through engineering rather than manual correction.
It demonstrates:
-
Analytics engineering discipline
-
Metadata governance awareness
-
System thinking applied to real-world data workflows
Consistency across datasets should not rely on memory.
It should be enforced by design.
This project is licensed under the MIT License.