Proposal: Input Matrix Validation & Data Diagnostics Module (validate_panel)

## Summary
Add an Input Matrix Validation & Data Diagnostics module (`ct.validate_panel`) to diagnose data quality issues before modeling. Recent experiments showed  non-negative constraints can fail with poorly scaled or near-singular matrices;  this module surfaces such issues early with actionable fixes.

---

## Relation
Complements the upcoming `PanelDataset` proposal, which will use `ct.validate_panel` internally for preprocessing and data health checks.


## Rationale
- **Prevent failures**: Detect ill-conditioned matrices, extreme scale differences,  and rank deficiency that cause solver instability.
- **Standardize preprocessing**: Provide consistent scaling/imputation advice across SC, IFE, and matrix completion methods.
- **Reproducibility**: Gate CI pipelines on critical thresholds (e.g., κ > 10⁶).

---

## Scope & Metrics

**1. Shape & Panel Integrity**
- Balanced vs. unbalanced panel; % missing per unit/time; block-missing detection.
- Pre/post-treatment window lengths; enforce method-specific minimums.

**2. Missingness & Sparsity**
- Overall missing rate; correlation of missingness with observables (MAR hints).
- Density = nnz / (N·T); block structure visualization; imputation readiness.

**3. Scaling & Outliers**
- Per-column range, IQR, robust scale (MAD), z-score distributions.
- Scale ratio max(range_j) / min(range_j); flag if > 10²–10³.
- Outlier detection via robust z-scores (> 3.5 MAD) or leverage proxies.

**4. Multicollinearity, Rank & Conditioning** (I consider that this one could be critical)
- Numerical rank; condition number κ(X) = σ_max / σ_min (warn > 10³, critical > 10⁶).
- VIF-like diagnostics; correlation heatmaps; near-duplicate column detection.
- Coherence μ(U) = (n/r)·max ||U_i||² for matrix completion (spiky vector warning).

**5. Pre-treatment Stability (Causal-Aware)**
- Change-point detection in last 20% of pre-period (confounding risk).
- Pre-trend smoothness (2nd difference variance); guides factor rank selection.
- Treated vs. donor divergence (acceleration test for anticipation effects).

**6. Treatment & Design Checks**
- Overlap diagnostics; pre-trend similarity between treated and donors.
- Donor weight entropy (SC degeneracy: 1 donor > 50% weight).

**7. Constraint Sensitivity**
- Simulate re-scaling impact on κ(X) for non-negative constraint methods.
- "What-if" probe: estimate stability gain from column standardization.

---

## Outputs
`ValidationReport` object with:
- **`.summary()`**: Dict of key metrics (programmatic use).
- **`.to_markdown()`**: Formatted report for logs/issues.
- **`.plot()`**: Missingness heatmap, scree plot, scale comparison bars, ACF.
- **`.suggestions`**: Prioritized fixes with severity (info/warn/critical):
  - *"Standardize columns A, B (scale ratio 1.2e3)"*
  - *"κ=4.7e6: add ridge λ=0.01 or drop column D"*
  - *"Unit 7: 80% pre-period missing → exclude or impute with low-rank k=2"*

---

## Minimal API
```python
report = ct.validate_panel(
    O,                      # N×T outcome matrix or long-form DataFrame
    Z=None,           # N×T treatment matrix
    X=None,           # Optional covariates
    unit_ids=None, 
    time_ids=None,
    options=ValidationOptions(
        min_pre_period=10,
        compute_condition_number=True,
        coherence_check=True,
        outlier_method="robust_z",
        imputation_advice=True
    )
)

# We can pass a long data frame

| unit_id | time_id | y    | z | x1  | x2  |
| ------- | ------- | ---- | - | --- | --- |
| A       | 2020    | 12.1 | 0 | 3.1 | 0.7 |
| A       | 2021    | 13.2 | 1 | 3.3 | 0.8 |
| B       | 2020    | 9.7  | 0 | 1.8 | 0.5 |
| B       | 2021    | 11.0 | 0 | 2.0 | 0.6 |

report = ct.validate_panel(
    O=df,
    unit_ids="unit_id",
    time_ids="time_id",
    outcome_col="y",
    treat_col="z",
    covar_cols=["x1", "x2"],
    options=ct.ValidationOptions(
        compute_condition_number=True,
        imputation_advice=True
    )
)

# Inside validate_panel, we can call something like this (or better implement a class that handle the dataset as in the other issue):

def validate_panel(O, Z=None, X=None, unit_ids=None, time_ids=None,
                   outcome_col=None, treat_col=None, covar_cols=None, options=None):
    if isinstance(O, pd.DataFrame):
        if unit_ids is None or time_ids is None:
            raise ValueError("Please specify unit_ids and time_ids when passing a long DataFrame.")
        
        df = O.copy()

        y_col = outcome_col or "y"
        z_col = treat_col or "z"
        x_cols = covar_cols or []

        # Pivot to wide matrices
        O = df.pivot(index=unit_ids, columns=time_ids, values=y_col).to_numpy()
        Z = df.pivot(index=unit_ids, columns=time_ids, values=z_col).to_numpy() if z_col in df.columns else None
        X = np.stack([df.pivot(index=unit_ids, columns=time_ids, values=c).to_numpy() for c in x_cols]) if x_cols else None

    return _validate_matrices(O, Z, X, options)



# Check severity
if report.has_critical():
    print(report.to_markdown())
    raise ValueError("Critical data issues detected")

# Apply suggested fixes (phase 3, feature)
report.suggestions.apply_to(Y, X)
```

---

## Implementation Phases
**Phase 1 (MVP)**: Points 1–4 (shape, missingness, scaling, rank/conditioning).  
**Phase 2**: Points 5–6 (pre-treatment + treatment design checks).  
**Phase 3**: Auto-transformer generation (`.get_transformers()`, `.apply_to()`).

---

## Integration
- Run automatically in example notebooks before fitting.
- Unit tests verify thresholds on synthetic ill-conditioned data.
- CI fails on `critical` severity (e.g., κ > 10⁶, pre-period < 5).

### Next Steps
- [ ] Approve module name and API signature
- [ ] Implement Phase 1 (Shape / Missingness / Scaling / Condition)
- [ ] Add synthetic tests and example notebook
- [ ] Integrate with PanelDataset


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Input Matrix Validation & Data Diagnostics Module (validate_panel) #24

Summary

Relation

Rationale

Scope & Metrics

Outputs

Minimal API

Implementation Phases

Integration

Next Steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: Input Matrix Validation & Data Diagnostics Module (validate_panel) #24

Description

Summary

Relation

Rationale

Scope & Metrics

Outputs

Minimal API

Implementation Phases

Integration

Next Steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions