📐 Schema Harmonizer

====================

Reference-Driven Dataset Schema Alignment Engine

Align. Audit. Harmonize.

Schema Harmonizer enforces structural and metadata consistency between datasets using a reference-driven approach.

It was built to solve a common but under-engineered problem:

Longitudinal datasets drift silently --- variable names, labels, value categories, and missing definitions change across waves.

Instead of manually rewriting metadata or generating thousands of lines of syntax, this utility performs deterministic schema harmonization in seconds.

🚀 What It Does

🔹 Applies configurable variable renaming
🔹 Detects schema drift (reference-only / target-only variables)
🔹 Aligns metadata from reference dataset:

Variable labels
Value labels
Measure levels
Missing value definitions

🔹 Produces audit artifacts:

rename_report.json
drift_report.json

🔹 Writes harmonized output dataset

All in a reproducible, config-driven pipeline.

🧠 Why It Matters

Schema drift is one of the biggest risks in:

Longitudinal analytics
Survey research
Batch data processing
Legacy-to-modern migrations
Governance-heavy environments

Manual metadata correction:

Is error-prone
Is slow
Does not scale

Schema Harmonizer replaces that with:

✔ Deterministic alignment
✔ Clear drift visibility
✔ Execution traceability
✔ Config-based identity mapping

Consistency across time becomes engineered --- not manual.

⚙ Architecture Overview

Reference Dataset (.sav)
        │
        ▼
Target Dataset (.sav)
        │
        ▼
Rename Layer (YAML-driven)
        │
        ▼
Drift Detection
        │
        ▼
Metadata Alignment Engine
        │
        ▼
Harmonized Output
        │
        ├── outputs/
        └── logs/

Separation of concerns:

data/ → Input datasets
outputs/ → Harmonized result
logs/ → Audit artifacts

📂 Project Structure

schema-harmonizer/
│
├── src/
│   ├── main.py
│   ├── rename.py
│   ├── metadata.py
│   ├── aligner.py
│   └── file_io.py
│
├── configs/
│   └── rename_mapping.yaml
│
├── data/
│   ├── reference.sav
│   └── target.sav
│
├── outputs/
│
├── logs/
│
└── README.md

🛠 Configuration Example

rename_mapping:
  id: caseid
  question_1: q1
  question_2: q2
  question_3: q3

strict_mode: false

Rename mapping aligns identity before metadata harmonization.

Strict mode optionally halts execution if schema drift is detected.

📊 Example Scale Test

Tested on:

500 rows
90+ variables
Drift on both reference and target sides

Execution time: < 1 second

Produces:

Harmonized .sav output
Rename audit report
Drift audit report

Deterministic. Reproducible. Fast.

🧾 Drift Reporting

Example drift_report.json:

{
  "total_reference_variables": 104,
  "total_target_variables": 99,
  "total_common": 91,
  "total_reference_only": 13,
  "total_target_only": 8
}

This makes schema evolution visible instead of silent.

🔍 Rename Audit

rename_report.json captures:

Requested renames
Successful renames
Missing rename sources
Final column count

This adds governance transparency to identity transformations.

🧩 Why SPSS?

The current implementation uses .sav as the I/O layer via pyreadstat.

However, the architecture is not SPSS-dependent.

The harmonization logic operates at the:

DataFrame level
Metadata abstraction layer

This design allows adaptation to:

CSV-based pipelines
Parquet datasets
Database schema alignment
Feature store synchronization
Any structured tabular format

SPSS is the reference layer --- not the limitation.

🏗 Design *Principles

✔ Config-driven
✔ Deterministic execution
✔ Defensive renaming
✔ Drift awareness
✔ Scalable structure
✔ Clean separation of inputs, outputs, and logs

No unnecessary abstraction.
No overengineering.
Focused utility.

📌 Potential Extensions

CLI interface
Format-agnostic adapters
Metadata change diff reporting
Automated longitudinal wave alignment
Integration with ETL orchestration tools

📎 Installation

pip install -r requirements.txt

▶ Run

python src/main.py

💡 Final Note

Schema Harmonizer was built to enforce longitudinal consistency through engineering rather than manual correction.

It demonstrates:

Analytics engineering discipline
Metadata governance awareness
System thinking applied to real-world data workflows

Consistency across datasets should not rely on memory.

It should be enforced by design.

📜 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📐 Schema Harmonizer

📑 Table of Contents

🚀 What It Does

🧠 Why It Matters

⚙ Architecture Overview

📂 Project Structure

🛠 Configuration Example

📊 Example Scale Test

🧾 Drift Reporting

🔍 Rename Audit

🧩 Why SPSS?

🏗 Design *Principles

📌 Potential Extensions

📎 Installation

▶ Run

💡 Final Note

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
data		data
logs		logs
outputs		outputs
src		src
LICENSE		LICENSE
README.md		README.md
metadata_verification_script.py		metadata_verification_script.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📐 Schema Harmonizer

📑 Table of Contents

🚀 What It Does

🧠 Why It Matters

⚙ Architecture Overview

📂 Project Structure

🛠 Configuration Example

📊 Example Scale Test

🧾 Drift Reporting

🔍 Rename Audit

🧩 Why SPSS?

🏗 Design *Principles

📌 Potential Extensions

📎 Installation

▶ Run

💡 Final Note

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages