- The Cleaning Pipeline: Handling NaNs, missing data (interpolation), and duplicates.
- The "Fix at Source" Principle: Digital cleaning is a last resort. Physical improvements to the measurement chain (sensors, environments) are preferred.
- Normalization: Essential for numerical stability in NNs, but must be done consistently between training and inference to avoid "scaling leakage."
- Label Uncertainty: In materials science, the "ground truth" is often subjective (e.g., manual grain boundary identification).
- Inter-annotator Variance: Different experts may segment the same micrograph differently. This variance sets an upper bound on meaningful model performance.
- Spatial Leakage: High correlation between different regions of the same physical sample. Randomly splitting pixels or patches into train/test results in over-optimistic performance. Solution: Split by sample or batch.
- Temporal Leakage: Using future process data to predict past outcomes in a time-series. Solution: Use "sliding window" or "walk-forward" validation.
- Information Leakage: Including features that are indirectly derived from the target variable.
- "Garbage In, Garbage Out": Case studies of ML failure due to bad data.
- The cost of data acquisition vs. the cost of bad labels.
- Systematic cleaning (Neuer Ch 3).
- Imputation strategies: Linear interpolation vs. Physics-based constraints.
- Normalization and Scaling: Min-Max, Z-score, and Robust scalers.
- The human element: Manual annotation and its pitfalls.
- Quantification of label uncertainty.
- Crowdsourcing vs. Expert consensus in materials science.
- Definition: Information from the test set influencing the training.
- Spatial Correlation: The "Same Sample" trap.
- Feature Leakage: The importance of independent variables.
- How to detect leakage: The "Too good to be true" heuristic.
- Grouped K-Fold validation.
- Nested Cross-Validation.
- Summary: A checklist for a "Correct" materials ML pipeline.
Summary for ML-PC Week 3:
This unit focuses on the most critical and often overlooked part of the ML pipeline: data integrity. We discuss systematic data cleaning and normalization techniques while highlighting the unique challenges of labeling experimental materials data, such as inter-annotator variance. A major focus is on Data Leakage, specifically how spatial and physical correlations in materials samples can lead to deceptively high model performance. We introduce robust validation strategies to ensure models generalize to truly unseen data.