Skip to content

TheJayesh25/schema-harmonizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📐 Schema Harmonizer

====================

Reference-Driven Dataset Schema Alignment Engine

Align. Audit. Harmonize.

Schema Harmonizer enforces structural and metadata consistency between datasets using a reference-driven approach.

It was built to solve a common but under-engineered problem:

Longitudinal datasets drift silently --- variable names, labels, value categories, and missing definitions change across waves.

Instead of manually rewriting metadata or generating thousands of lines of syntax, this utility performs deterministic schema harmonization in seconds.


📑 Table of Contents


🚀 What It Does


🔹 Applies configurable variable renaming
🔹 Detects schema drift (reference-only / target-only variables)
🔹 Aligns metadata from reference dataset:

  • Variable labels

  • Value labels

  • Measure levels

  • Missing value definitions

🔹 Produces audit artifacts:

  • rename_report.json

  • drift_report.json

🔹 Writes harmonized output dataset

All in a reproducible, config-driven pipeline.


🧠 Why It Matters


Schema drift is one of the biggest risks in:

  • Longitudinal analytics

  • Survey research

  • Batch data processing

  • Legacy-to-modern migrations

  • Governance-heavy environments

Manual metadata correction:

  • Is error-prone

  • Is slow

  • Does not scale

Schema Harmonizer replaces that with:

✔ Deterministic alignment
✔ Clear drift visibility
✔ Execution traceability
✔ Config-based identity mapping

Consistency across time becomes engineered --- not manual.


⚙ Architecture Overview


Reference Dataset (.sav)
        │
        ▼
Target Dataset (.sav)
        │
        ▼
Rename Layer (YAML-driven)
        │
        ▼
Drift Detection
        │
        ▼
Metadata Alignment Engine
        │
        ▼
Harmonized Output
        │
        ├── outputs/
        └── logs/

Separation of concerns:

  • data/ → Input datasets

  • outputs/ → Harmonized result

  • logs/ → Audit artifacts


📂 Project Structure


schema-harmonizer/
│
├── src/
│   ├── main.py
│   ├── rename.py
│   ├── metadata.py
│   ├── aligner.py
│   └── file_io.py
│
├── configs/
│   └── rename_mapping.yaml
│
├── data/
│   ├── reference.sav
│   └── target.sav
│
├── outputs/
│
├── logs/
│
└── README.md


🛠 Configuration Example


rename_mapping:
  id: caseid
  question_1: q1
  question_2: q2
  question_3: q3

strict_mode: false

Rename mapping aligns identity before metadata harmonization.

Strict mode optionally halts execution if schema drift is detected.


📊 Example Scale Test


Tested on:

  • 500 rows

  • 90+ variables

  • Drift on both reference and target sides

Execution time: < 1 second

Produces:

  • Harmonized .sav output

  • Rename audit report

  • Drift audit report

Deterministic. Reproducible. Fast.


🧾 Drift Reporting


Example drift_report.json:

{
  "total_reference_variables": 104,
  "total_target_variables": 99,
  "total_common": 91,
  "total_reference_only": 13,
  "total_target_only": 8
}

This makes schema evolution visible instead of silent.


🔍 Rename Audit


rename_report.json captures:

  • Requested renames

  • Successful renames

  • Missing rename sources

  • Final column count

This adds governance transparency to identity transformations.


🧩 Why SPSS?


The current implementation uses .sav as the I/O layer via pyreadstat.

However, the architecture is not SPSS-dependent.

The harmonization logic operates at the:

  • DataFrame level

  • Metadata abstraction layer

This design allows adaptation to:

  • CSV-based pipelines

  • Parquet datasets

  • Database schema alignment

  • Feature store synchronization

  • Any structured tabular format

SPSS is the reference layer --- not the limitation.


🏗 Design *Principles


✔ Config-driven
✔ Deterministic execution
✔ Defensive renaming
✔ Drift awareness
✔ Scalable structure
✔ Clean separation of inputs, outputs, and logs

No unnecessary abstraction.
No overengineering.
Focused utility.


📌 Potential Extensions


  • CLI interface

  • Format-agnostic adapters

  • Metadata change diff reporting

  • Automated longitudinal wave alignment

  • Integration with ETL orchestration tools


📎 Installation


pip install -r requirements.txt

▶ Run


python src/main.py

💡 Final Note


Schema Harmonizer was built to enforce longitudinal consistency through engineering rather than manual correction.

It demonstrates:

  • Analytics engineering discipline

  • Metadata governance awareness

  • System thinking applied to real-world data workflows

Consistency across datasets should not rely on memory.

It should be enforced by design.

📜 License

This project is licensed under the MIT License.

About

Schema Harmonizer is a reference-driven schema alignment utility built to enforce metadata consistency across evolving tabular datasets. It detects schema drift, aligns variable identity through config-based mappings, and harmonizes labels, value labels, measures, and missing definitions all in a deterministic, auditable pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages