Skip to content

Proposal: Input Matrix Validation & Data Diagnostics Module (validate_panel) #24

@fedemolina

Description

@fedemolina

Summary

Add an Input Matrix Validation & Data Diagnostics module (ct.validate_panel) to diagnose data quality issues before modeling. Recent experiments showed non-negative constraints can fail with poorly scaled or near-singular matrices; this module surfaces such issues early with actionable fixes.


Relation

Complements the upcoming PanelDataset proposal, which will use ct.validate_panel internally for preprocessing and data health checks.

Rationale

  • Prevent failures: Detect ill-conditioned matrices, extreme scale differences, and rank deficiency that cause solver instability.
  • Standardize preprocessing: Provide consistent scaling/imputation advice across SC, IFE, and matrix completion methods.
  • Reproducibility: Gate CI pipelines on critical thresholds (e.g., κ > 10⁶).

Scope & Metrics

1. Shape & Panel Integrity

  • Balanced vs. unbalanced panel; % missing per unit/time; block-missing detection.
  • Pre/post-treatment window lengths; enforce method-specific minimums.

2. Missingness & Sparsity

  • Overall missing rate; correlation of missingness with observables (MAR hints).
  • Density = nnz / (N·T); block structure visualization; imputation readiness.

3. Scaling & Outliers

  • Per-column range, IQR, robust scale (MAD), z-score distributions.
  • Scale ratio max(range_j) / min(range_j); flag if > 10²–10³.
  • Outlier detection via robust z-scores (> 3.5 MAD) or leverage proxies.

4. Multicollinearity, Rank & Conditioning (I consider that this one could be critical)

  • Numerical rank; condition number κ(X) = σ_max / σ_min (warn > 10³, critical > 10⁶).
  • VIF-like diagnostics; correlation heatmaps; near-duplicate column detection.
  • Coherence μ(U) = (n/r)·max ||U_i||² for matrix completion (spiky vector warning).

5. Pre-treatment Stability (Causal-Aware)

  • Change-point detection in last 20% of pre-period (confounding risk).
  • Pre-trend smoothness (2nd difference variance); guides factor rank selection.
  • Treated vs. donor divergence (acceleration test for anticipation effects).

6. Treatment & Design Checks

  • Overlap diagnostics; pre-trend similarity between treated and donors.
  • Donor weight entropy (SC degeneracy: 1 donor > 50% weight).

7. Constraint Sensitivity

  • Simulate re-scaling impact on κ(X) for non-negative constraint methods.
  • "What-if" probe: estimate stability gain from column standardization.

Outputs

ValidationReport object with:

  • .summary(): Dict of key metrics (programmatic use).
  • .to_markdown(): Formatted report for logs/issues.
  • .plot(): Missingness heatmap, scree plot, scale comparison bars, ACF.
  • .suggestions: Prioritized fixes with severity (info/warn/critical):
    • "Standardize columns A, B (scale ratio 1.2e3)"
    • "κ=4.7e6: add ridge λ=0.01 or drop column D"
    • "Unit 7: 80% pre-period missing → exclude or impute with low-rank k=2"

Minimal API

report = ct.validate_panel(
    O,                      # N×T outcome matrix or long-form DataFrame
    Z=None,           # N×T treatment matrix
    X=None,           # Optional covariates
    unit_ids=None, 
    time_ids=None,
    options=ValidationOptions(
        min_pre_period=10,
        compute_condition_number=True,
        coherence_check=True,
        outlier_method="robust_z",
        imputation_advice=True
    )
)

# We can pass a long data frame

| unit_id | time_id | y    | z | x1  | x2  |
| ------- | ------- | ---- | - | --- | --- |
| A       | 2020    | 12.1 | 0 | 3.1 | 0.7 |
| A       | 2021    | 13.2 | 1 | 3.3 | 0.8 |
| B       | 2020    | 9.7  | 0 | 1.8 | 0.5 |
| B       | 2021    | 11.0 | 0 | 2.0 | 0.6 |

report = ct.validate_panel(
    O=df,
    unit_ids="unit_id",
    time_ids="time_id",
    outcome_col="y",
    treat_col="z",
    covar_cols=["x1", "x2"],
    options=ct.ValidationOptions(
        compute_condition_number=True,
        imputation_advice=True
    )
)

# Inside validate_panel, we can call something like this (or better implement a class that handle the dataset as in the other issue):

def validate_panel(O, Z=None, X=None, unit_ids=None, time_ids=None,
                   outcome_col=None, treat_col=None, covar_cols=None, options=None):
    if isinstance(O, pd.DataFrame):
        if unit_ids is None or time_ids is None:
            raise ValueError("Please specify unit_ids and time_ids when passing a long DataFrame.")
        
        df = O.copy()

        y_col = outcome_col or "y"
        z_col = treat_col or "z"
        x_cols = covar_cols or []

        # Pivot to wide matrices
        O = df.pivot(index=unit_ids, columns=time_ids, values=y_col).to_numpy()
        Z = df.pivot(index=unit_ids, columns=time_ids, values=z_col).to_numpy() if z_col in df.columns else None
        X = np.stack([df.pivot(index=unit_ids, columns=time_ids, values=c).to_numpy() for c in x_cols]) if x_cols else None

    return _validate_matrices(O, Z, X, options)



# Check severity
if report.has_critical():
    print(report.to_markdown())
    raise ValueError("Critical data issues detected")

# Apply suggested fixes (phase 3, feature)
report.suggestions.apply_to(Y, X)

Implementation Phases

Phase 1 (MVP): Points 1–4 (shape, missingness, scaling, rank/conditioning).
Phase 2: Points 5–6 (pre-treatment + treatment design checks).
Phase 3: Auto-transformer generation (.get_transformers(), .apply_to()).


Integration

  • Run automatically in example notebooks before fitting.
  • Unit tests verify thresholds on synthetic ill-conditioned data.
  • CI fails on critical severity (e.g., κ > 10⁶, pre-period < 5).

Next Steps

  • Approve module name and API signature
  • Implement Phase 1 (Shape / Missingness / Scaling / Condition)
  • Add synthetic tests and example notebook
  • Integrate with PanelDataset

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions