Summary
Add an Input Matrix Validation & Data Diagnostics module (ct.validate_panel) to diagnose data quality issues before modeling. Recent experiments showed non-negative constraints can fail with poorly scaled or near-singular matrices; this module surfaces such issues early with actionable fixes.
Relation
Complements the upcoming PanelDataset proposal, which will use ct.validate_panel internally for preprocessing and data health checks.
Rationale
- Prevent failures: Detect ill-conditioned matrices, extreme scale differences, and rank deficiency that cause solver instability.
- Standardize preprocessing: Provide consistent scaling/imputation advice across SC, IFE, and matrix completion methods.
- Reproducibility: Gate CI pipelines on critical thresholds (e.g., κ > 10⁶).
Scope & Metrics
1. Shape & Panel Integrity
- Balanced vs. unbalanced panel; % missing per unit/time; block-missing detection.
- Pre/post-treatment window lengths; enforce method-specific minimums.
2. Missingness & Sparsity
- Overall missing rate; correlation of missingness with observables (MAR hints).
- Density = nnz / (N·T); block structure visualization; imputation readiness.
3. Scaling & Outliers
- Per-column range, IQR, robust scale (MAD), z-score distributions.
- Scale ratio max(range_j) / min(range_j); flag if > 10²–10³.
- Outlier detection via robust z-scores (> 3.5 MAD) or leverage proxies.
4. Multicollinearity, Rank & Conditioning (I consider that this one could be critical)
- Numerical rank; condition number κ(X) = σ_max / σ_min (warn > 10³, critical > 10⁶).
- VIF-like diagnostics; correlation heatmaps; near-duplicate column detection.
- Coherence μ(U) = (n/r)·max ||U_i||² for matrix completion (spiky vector warning).
5. Pre-treatment Stability (Causal-Aware)
- Change-point detection in last 20% of pre-period (confounding risk).
- Pre-trend smoothness (2nd difference variance); guides factor rank selection.
- Treated vs. donor divergence (acceleration test for anticipation effects).
6. Treatment & Design Checks
- Overlap diagnostics; pre-trend similarity between treated and donors.
- Donor weight entropy (SC degeneracy: 1 donor > 50% weight).
7. Constraint Sensitivity
- Simulate re-scaling impact on κ(X) for non-negative constraint methods.
- "What-if" probe: estimate stability gain from column standardization.
Outputs
ValidationReport object with:
.summary(): Dict of key metrics (programmatic use).
.to_markdown(): Formatted report for logs/issues.
.plot(): Missingness heatmap, scree plot, scale comparison bars, ACF.
.suggestions: Prioritized fixes with severity (info/warn/critical):
- "Standardize columns A, B (scale ratio 1.2e3)"
- "κ=4.7e6: add ridge λ=0.01 or drop column D"
- "Unit 7: 80% pre-period missing → exclude or impute with low-rank k=2"
Minimal API
report = ct.validate_panel(
O, # N×T outcome matrix or long-form DataFrame
Z=None, # N×T treatment matrix
X=None, # Optional covariates
unit_ids=None,
time_ids=None,
options=ValidationOptions(
min_pre_period=10,
compute_condition_number=True,
coherence_check=True,
outlier_method="robust_z",
imputation_advice=True
)
)
# We can pass a long data frame
| unit_id | time_id | y | z | x1 | x2 |
| ------- | ------- | ---- | - | --- | --- |
| A | 2020 | 12.1 | 0 | 3.1 | 0.7 |
| A | 2021 | 13.2 | 1 | 3.3 | 0.8 |
| B | 2020 | 9.7 | 0 | 1.8 | 0.5 |
| B | 2021 | 11.0 | 0 | 2.0 | 0.6 |
report = ct.validate_panel(
O=df,
unit_ids="unit_id",
time_ids="time_id",
outcome_col="y",
treat_col="z",
covar_cols=["x1", "x2"],
options=ct.ValidationOptions(
compute_condition_number=True,
imputation_advice=True
)
)
# Inside validate_panel, we can call something like this (or better implement a class that handle the dataset as in the other issue):
def validate_panel(O, Z=None, X=None, unit_ids=None, time_ids=None,
outcome_col=None, treat_col=None, covar_cols=None, options=None):
if isinstance(O, pd.DataFrame):
if unit_ids is None or time_ids is None:
raise ValueError("Please specify unit_ids and time_ids when passing a long DataFrame.")
df = O.copy()
y_col = outcome_col or "y"
z_col = treat_col or "z"
x_cols = covar_cols or []
# Pivot to wide matrices
O = df.pivot(index=unit_ids, columns=time_ids, values=y_col).to_numpy()
Z = df.pivot(index=unit_ids, columns=time_ids, values=z_col).to_numpy() if z_col in df.columns else None
X = np.stack([df.pivot(index=unit_ids, columns=time_ids, values=c).to_numpy() for c in x_cols]) if x_cols else None
return _validate_matrices(O, Z, X, options)
# Check severity
if report.has_critical():
print(report.to_markdown())
raise ValueError("Critical data issues detected")
# Apply suggested fixes (phase 3, feature)
report.suggestions.apply_to(Y, X)
Implementation Phases
Phase 1 (MVP): Points 1–4 (shape, missingness, scaling, rank/conditioning).
Phase 2: Points 5–6 (pre-treatment + treatment design checks).
Phase 3: Auto-transformer generation (.get_transformers(), .apply_to()).
Integration
- Run automatically in example notebooks before fitting.
- Unit tests verify thresholds on synthetic ill-conditioned data.
- CI fails on
critical severity (e.g., κ > 10⁶, pre-period < 5).
Next Steps
Summary
Add an Input Matrix Validation & Data Diagnostics module (
ct.validate_panel) to diagnose data quality issues before modeling. Recent experiments showed non-negative constraints can fail with poorly scaled or near-singular matrices; this module surfaces such issues early with actionable fixes.Relation
Complements the upcoming
PanelDatasetproposal, which will usect.validate_panelinternally for preprocessing and data health checks.Rationale
Scope & Metrics
1. Shape & Panel Integrity
2. Missingness & Sparsity
3. Scaling & Outliers
4. Multicollinearity, Rank & Conditioning (I consider that this one could be critical)
5. Pre-treatment Stability (Causal-Aware)
6. Treatment & Design Checks
7. Constraint Sensitivity
Outputs
ValidationReportobject with:.summary(): Dict of key metrics (programmatic use)..to_markdown(): Formatted report for logs/issues..plot(): Missingness heatmap, scree plot, scale comparison bars, ACF..suggestions: Prioritized fixes with severity (info/warn/critical):Minimal API
Implementation Phases
Phase 1 (MVP): Points 1–4 (shape, missingness, scaling, rank/conditioning).
Phase 2: Points 5–6 (pre-treatment + treatment design checks).
Phase 3: Auto-transformer generation (
.get_transformers(),.apply_to()).Integration
criticalseverity (e.g., κ > 10⁶, pre-period < 5).Next Steps