Eventdisplay · GernotMaier · Mar 25, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -0,0 +1,11 @@
+---
+# Set update schedule for GitHub Actions
+
+version: 2
+updates:
+
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      # Check for updates to GitHub Actions every month
+      interval: "monthly"
diff --git a/README.md b/README.md
@@ -19,6 +19,59 @@ Stereo analysis methods implemented in Eventdisplay provide direction / energies
 
 Output is a single ROOT tree called `StereoAnalysis` with the same number of events as the input tree.
 
+### Training Stereo Reconstruction Models
+
+The stereo regression training pipeline uses multi-target XGBoost to predict residuals (deviations from baseline reconstructions):
+
+**Targets:** `[Xoff_residual, Yoff_residual, E_residual]` (residuals on direction and energy as reconstruction by the BDT stereo reconstruction method)
+
+**Key techniques:**
+
+- **Target standardization:** Targets are mean-centered and scaled to unit variance during training
+- **Energy-bin weighting:** Events are weighted inversely by energy bin density; bins with fewer than 10 events are excluded from training to prevent overfitting on low-statistics regions
+- **Multiplicity weighting:** Higher-multiplicity events (more telescopes) receive higher sample weights to prioritize high-confidence reconstructions
+- **Per-target SHAP importance:** Feature importance values computed during training for each target and cached for later analysis
+
+**Training command:**
+
+```bash
+eventdisplay-ml-train-xgb-stereo \
+    --input_file_list train_files.txt \
+    --model_prefix models/stereo_model \
+    --max_events 100000 \
+    --train_test_fraction 0.5 \
+    --max_cores 8
+```
+
+**Output:** Joblib model file containing:
+
+- XGBoost trained model object
+- Target standardization scalers (mean/std)
+- Feature list and SHAP importance rankings
+- Training metadata (random state, hyperparameters)
+
+### Applying Stereo Reconstruction Models
+
+The apply pipeline loads trained models and makes predictions:
+
+**Key safeguards:**
+
+- Invalid energy values (≤0 or NaN) produce NaN outputs but preserve all input event rows
+- Missing standardization parameters raise ValueError (prevents silent data corruption)
+- Output row count always equals input row count
+
+**Apply command:**
+
+```bash
+eventdisplay-ml-apply-xgb-stereo \
+    --input_file_list apply_files.txt \
+    --output_file_list output_files.txt \
+    --model_prefix models/stereo_model
+```
+
+
+**Output:** ROOT files with `StereoAnalysis` tree containing reconstructed Xoff, Yoff, and log10(E).
+
 ## Gamma/hadron separation using XGBoost
 
 Gamma/hadron separation is performed using XGB Boost classification trees. Features are image parameters and stereo reconstruction parameters provided by Eventdisplay.
@@ -27,6 +80,223 @@ The zenith angle dependence is accounted for by including the zenith angle as a
 
 Output is a single ROOT tree called `Classification` with the same number of events as the input tree. It contains the classification prediction (`Gamma_Prediction`) and boolean flags (e.g. `Is_Gamma_75` for 75% signal efficiency cut).
 
+## Diagnostic Tools
+
+The committed regression diagnostics in this branch are:
+
+### SHAP feature-importance summary
+
+ Tests: Feature importance
+
+- Load per-target SHAP importances cached in the trained model file
+- Create one top-20 feature plot per residual target (`Xoff_residual`, `Yoff_residual`, `E_residual`)
+
+Required inputs:
+
+- `--model_file`: trained stereo model `.joblib`
+- `--output_dir`: directory for generated PNGs
+
+Run:
+
+```bash
+  eventdisplay-ml-diagnostic-shap-summary \
+  --model_file models/stereo_model.joblib \
+  --output_dir diagnostics/
+```
+
+Outputs:
+
+- `diagnostics/shap_importance_Xoff_residual.png`
+- `diagnostics/shap_importance_Yoff_residual.png`
+- `diagnostics/shap_importance_E_residual.png`
+
+### Permutation importance
+
+- Rebuild the held-out test split from the model metadata and original input files
+- Shuffle one feature at a time and measure the relative RMSE increase per residual target
+- Validate predictive dependence on features rather than cached model attribution
+
+Required inputs:
+
+- `--model_file`: trained stereo model `.joblib`
+- `--output_dir`: directory for generated plots
+- `--top_n`: number of top features to include in the plot (optional)
+- `--input_file_list`: optional override if the path stored in the model metadata is no longer valid
+
+Run:
+
+```bash
+eventdisplay-ml-diagnostic-permutation-importance \
+  --model_file models/stereo_model.joblib \
+  --output_dir diagnostics/ \
+  --top_n 20
+```
+
+Optional override:
+
+```bash
+eventdisplay-ml-diagnostic-permutation-importance \
+  --model_file models/stereo_model.joblib \
+  --input_file_list files.txt \
+  --output_dir diagnostics/
+```
+
+Output:
+
+- `diagnostics/permutation_importance.png`
+
+Notes:
+
+- This diagnostic is slower than the SHAP summary because it rebuilds the processed test split.
+- It is the better choice when you want to measure actual performance sensitivity to each feature.
+
+### Generalization gap
+
+- Read the cached train/test RMSE summary written during training
+- Compare final train and test RMSE for each residual target
+- Quantify the overfitting gap after training is complete
+
+Required inputs:
+
+- `--model_file`: trained stereo model `.joblib`
+- `--output_dir`: directory for generated plots
+- `--input_file_list`: optional override if the path stored in the model metadata is no longer valid
+
+Run:
+
+```bash
+eventdisplay-ml-diagnostic-generalization-gap \
+  --model_file models/stereo_model.joblib \
+  --output_dir diagnostics/
+```
+
+Optional override:
+
+```bash
+eventdisplay-ml-diagnostic-generalization-gap \
+  --model_file models/stereo_model.joblib \
+  --input_file_list files.txt \
+  --output_dir diagnostics/
+```
+
+Output:
+
+- `diagnostics/generalization_gap.png`
+
+Notes:
+
+- This diagnostic measures final overfitting by comparing train and test residual RMSE.
+- Older model files without cached metrics fall back to rebuilding the original train/test split.
+- Unlike `plot_training_evaluation.py`, it summarizes final RMSE, not the per-iteration XGBoost training history.
+
+### Partial Dependence Plots
+
+- Visualize how each feature influences model predictions
+- Prove the model captures physics by checking that multiplicity reduces corrections and baselines show smooth relationships
+
+Required inputs:
+
+- `--model_file`: trained stereo model `.joblib`
+- `--output_dir`: directory for generated plots (optional; default: `diagnostics`)
+- `--features`: space-separated list of features to plot (optional; default: `DispNImages Xoff_weighted_bdt Yoff_weighted_bdt ErecS`)
+- `--input_file_list`: optional override if the path stored in the model metadata is no longer valid
+
+Run:
+
+```bash
+eventdisplay-ml-diagnostic-partial-dependence \
+  --model_file models/stereo_model.joblib \
+  --output_dir diagnostics/ \
+  --features DispNImages Xoff_weighted_bdt ErecS
+```
+
+Optional override:
+
+```bash
+eventdisplay-ml-diagnostic-partial-dependence \
+  --model_file models/stereo_model.joblib \
+  --input_file_list files.txt \
+  --features Xoff_weighted_bdt Yoff_weighted_bdt
+```
+
+Output:
+
+- `diagnostics/partial_dependence.png` (grid of feature × target subplots)
+
+Notes:
+
+- PDP displays predicted residual output as a function of a single feature while holding others constant
+- Multiplicity effect: high-multiplicity events should show smaller corrections (negative slope)
+- Baseline stability: baseline features (e.g., `weighted_bdt`) should show smooth, linear relationships
+- This diagnostic rebuilds the held-out test split and is slower than SHAP summary
+
+### Residual Normality Diagnostics
+
+- Validate that model residuals follow a normal distribution
+- Detect outlier events and check for systematic biases in reconstruction errors
+
+Required inputs:
+
+- `--model_file`: trained stereo model `.joblib`
+- `--output_dir`: directory for generated plots (optional; default: `diagnostics`)
+- `--input_file_list`: optional override if the path stored in the model metadata is no longer valid
+
+Run:
+
+```bash
+eventdisplay-ml-diagnostic-residual-normality \
+  --model_file models/stereo_model.joblib \
+  --output_dir diagnostics/
+```
+
+Optional override:
+
+```bash
+eventdisplay-ml-diagnostic-residual-normality \
+  --model_file models/stereo_model.joblib \
+  --input_file_list files.txt
+```
+
+Output:
+
+- Residual normality statistics printed to console:
+  - Mean and standard deviation per target
+  - Kolmogorov-Smirnov test p-value (normality test)
+  - Anderson-Darling test statistic and critical value
+  - Skewness and kurtosis
+  - Q-Q plot R² value
+  - Number of outliers (>3σ) per target
+- `diagnostics/residual_diagnostics.png` (single 2xN grid; generated on cache miss when reconstruction is required)
+
+Notes:
+
+- Residual normality stats are cached during training and loaded from the model file for fast retrieval
+- Diagnostic plots (histograms, Q-Q plots) are only generated when the split must be reconstructed
+- Invalid KS test or Anderson-Darling results (NaN/inf) are reported as special values
+- Outlier counts help identify events with unusually large reconstruction errors
+
+### Training-evaluation curves
+
+- Plot XGBoost training vs validation metric curves
+- Useful for checking convergence and overfitting behavior
+
+Required inputs:
+
+- `--model_file`: trained model `.joblib` containing an XGBoost model
+- `--output_file`: output image path (optional; if omitted, plot is shown interactively)
+
+Run:
+
+```bash
+eventdisplay-ml-plot-training-evaluation \
+  --model_file models/stereo_model.joblib \
+  --output_file diagnostics/training_curves.png
+```
+
+Output:
+
+- Figure with one panel per tracked metric (for example `rmse`), showing training and test curves.
+
 ## Generative AI disclosure
 
 Generative AI tools (including Claude, ChatGPT, and Gemini) were used to assist with code development, debugging, and documentation drafting. All AI-assisted outputs were reviewed, validated, and, where necessary, modified by the authors to ensure accuracy and reliability.

diff --git a/docs/changes/53.feature.md b/docs/changes/53.feature.md
@@ -1,14 +1,37 @@
-Fix critical bugs in stereo regression pipeline:
+## Stereo Regression: Training on Residuals with Standardization and Energy Weighting
 
-- **Fixed double log10 application**: E_residual was being computed with log10(ErecS) that had already been log10'd. Now ErecS/Erec remain in linear space during training/apply; log10 applied explicitly when needed.
-- **Fixed energy bin weighting**: Bins with fewer than 10 events now correctly get zero weight instead of being clamped; weight sorting preserves bin order.
-- **Fixed standardization inversion**: Added proper loading and validation of target_mean/target_std scalers in stereo apply pipeline to prevent KeyError crashes.
-- **Fixed ErecS validation**: Safe log10 computation during apply avoids RuntimeWarning for invalid values; all output rows preserved even with invalid energy.
-- **Fixed evaluation metrics**: ErecS in evaluation now properly converted to log10 space for energy resolution comparison.
-- **Fixed FutureWarning**: Series positional indexing converted to numpy arrays for future pandas compatibility.
+### Architectural Change
 
-New features and improvements:
+- **Training targets changed from absolute to residual values**: Models now predict residuals (deviations from baseline reconstructions) rather than absolute directions/energies. This allows XGBoost to learn corrections to existing Eventdisplay reconstructions (DispBDT, intersection method) and leverage their baseline accuracy as a starting point.
 
-- **Comprehensive test coverage**: Added `test_regression_apply.py` with full unit test suite covering standardization inversion, residual computation, ErecS handling, and final prediction reconstruction.
-- **Improved error messages**: Clear, actionable error messages when standardization parameters are missing or mismatched in apply pipeline.
-- **Data preservation guarantee**: Stereo apply pipeline now preserves all input rows even when encountering invalid energy values, ensuring output count equals input count.
+### Critical Bug Fixes
+
+- **Fixed double log10 application**: Energy residuals computed in linear space; log10 applied explicitly during evaluation
+- **Fixed standardization inversion**: Apply pipeline now loads and validates target_mean/target_std scalers (prevents KeyError)
+- **Fixed energy-bin weighting**: Bins with <10 events get zero weight; correct inverse weighting for balanced training
+- **Fixed ErecS validation**: Safe log10 computation during apply; all input rows preserved in output
+- **Fixed evaluation metrics**: Energy resolution compared in log10 space with proper baseline alignment
+- **Fixed FutureWarning**: Series positional indexing converted to numpy arrays for pandas compatibility
+
+### New Features
+
+- **Target standardization in training**: Residuals standardized to mean=0, std=1 during training to enable multi-target learning with balanced learning signals (direction and energy equally weighted)
+- **Energy-bin weighted training**: Events weighted inversely by energy bin density; bins with <10 events excluded to prevent overfitting on low-statistics regions
+- **Per-target SHAP importance caching**: Feature importances computed once during training for each target (Xoff_residual, Yoff_residual, E_residual), cached for diagnostic tools
+- **Diagnostic scripts**:
+  - `diagnostic_shap_summary.py`: Top-20 feature importance plots per residual target
+  - `plot_training_evaluation.py`: Energy resolution and residual distribution visualization
+- **Comprehensive test suites**: 20 new tests covering residual computation, standardization, energy weighting, apply inference
+- **Robust error handling**: Clear messages for missing scalers; guaranteed row-count preservation in apply pipeline
+
+### Enhanced Diagnostic Pipeline
+
+- **Generalization-gap metrics cached during training**: Train/test RMSE, gap %, and generalization ratio computed and cached in the model artifact, enabling fast overfitting assessment without recomputation
+- **Residual normality statistics cached during training**: Normality tests (Kolmogorov-Smirnov, Anderson-Darling), distribution shape metrics (skewness, kurtosis, Q-Q R²), and outlier counts computed once during training and cached for fast retrieval
+- **Diagnostic reconstruction from model metadata**: All regression diagnostics (generalization-gap, partial-dependence, residual-normality) now reconstruct the held-out test split from stored model metadata + input file list, enabling reproducibility and offline analysis without CSV exports
+- **Cache-first diagnostic workflows**: Diagnostic scripts load cached metrics first (fast) with graceful fallback to reconstruction if cache unavailable (backward compatible with older models)
+- **CLI entry points for all diagnostics**:
+  - `eventdisplay-ml-diagnostic-generalization-gap`: Quantify overfitting via train/test RMSE comparison
+  - `eventdisplay-ml-diagnostic-partial-dependence`: Validate model captures physics via partial dependence curves
+  - `eventdisplay-ml-diagnostic-residual-normality`: Validate residual normality and detect outliers
+- **Fixed sklearn FutureWarning**: Partial dependence plots convert feature data to float64 to avoid integer dtype warnings in newer scikit-learn versions
diff --git a/pyproject.toml b/pyproject.toml
@@ -62,6 +62,11 @@ urls."documentation" = "https://github.com/Eventdisplay/Eventdisplay-ML"
 urls."repository" = "https://github.com/Eventdisplay/Eventdisplay-ML"
 scripts.eventdisplay-ml-apply-xgb-classify = "eventdisplay_ml.scripts.apply_xgb_classify:main"
 scripts.eventdisplay-ml-apply-xgb-stereo = "eventdisplay_ml.scripts.apply_xgb_stereo:main"
+scripts.eventdisplay-ml-diagnostic-generalization-gap = "eventdisplay_ml.scripts.diagnostic_generalization_gap:main"
+scripts.eventdisplay-ml-diagnostic-partial-dependence = "eventdisplay_ml.scripts.diagnostic_partial_dependence:main"
+scripts.eventdisplay-ml-diagnostic-permutation-importance = "eventdisplay_ml.scripts.diagnostic_permutation_importance:main"
+scripts.eventdisplay-ml-diagnostic-residual-normality = "eventdisplay_ml.scripts.diagnostic_residual_normality:main"
+scripts.eventdisplay-ml-diagnostic-shap-summary = "eventdisplay_ml.scripts.diagnostic_shap_summary:main"
 scripts.eventdisplay-ml-plot-classification-performance-metrics = "eventdisplay_ml.scripts.plot_classification_performance_metrics:main"
 scripts.eventdisplay-ml-plot-classification-gamma-efficiency = "eventdisplay_ml.scripts.plot_classification_gamma_efficiency:main"
 scripts.eventdisplay-ml-plot-training-evaluation = "eventdisplay_ml.scripts.plot_training_evaluation:main"