A comprehensive machine learning pipeline for predictive modeling using hyperspectral imaging data. While originally developed for predicting volatile compounds and sugar concentrations in papaya fruit, this framework is designed to be adaptable for any hyperspectral imaging regression problem.
This framework processes raw hyperspectral absorbance data through various spectral transformations and applies multiple machine learning models to predict target variables of interest. The pipeline is designed for high-performance computing (HPC) environments and can process multiple variables and transformations in parallel, making it suitable for large-scale hyperspectral analysis projects.
- Flexible spectral preprocessing: Multiple transformation methods that can be combined
- Multiple feature selection methods: Including PLS regression coefficients, SPA, genetic algorithms, UVE, CARS, and PCA
- Comprehensive model suite: 9 different regression algorithms with hyperparameter optimization
- Scalable architecture: Designed for HPC environments with parallel processing
- Automated workflow: Batch processing of all transformation-variable-model combinations
- Reproducible results: Seed-based reproducibility and comprehensive logging
- Multiple data types: Supports flesh, skin, and combined tissue analysis
Hyperspectral_model_training_framework/
βββ Scripts/
β βββ 00-spectral-preprocessing.R # Spectral data preprocessing functions
β βββ 01-data-preparation.R # Data preparation and train-test splitting
β βββ 02-feature-selection.R # Feature selection algorithms (6 methods)
β βββ 03-model-functions.R # Machine learning models (9 algorithms)
β βββ 04-main-pipeline.R # Main pipeline orchestrator
βββ data/
β βββ absorbance_data.RDS # Raw hyperspectral data
β βββ all_response_data.csv # Target variables
β βββ response_names.txt # List of target variables
βββ input_data/ # Generated preprocessed spectral data
β βββ smooth_data/
β βββ derivative1_data/
β βββ derivative2_data/
β βββ [other transformation combinations]/
βββ results/ # Model outputs
βββ style/ # R Markdown styling files
The preprocessing pipeline applies various spectral transformations commonly used in chemometrics:
- Smoothing: Noise reduction using moving average filters
- Standard Normal Variate (SNV): Scatter correction and normalization
- Baseline correction: Polynomial detrending (1st or 2nd order)
- Derivatives: Enhanced feature extraction (1st and 2nd derivatives)
The framework automatically generates all valid combinations of these transformations, providing comprehensive data preparation for downstream modeling.
Six feature selection algorithms are implemented:
- PLSR B-coefficient filtering: Selects wavelengths based on PLS regression coefficients
- Successive Projections Algorithm (SPA): Minimizes collinearity between selected variables
- Genetic Algorithm (GA): Evolutionary optimization for feature subset selection
- Uninformative Variable Elimination (UVE): Removes variables with low reliability scores
- Competitive Adaptive Re-weighted Sampling (CARS): Iterative wavelength selection
- Principal Component Analysis (PCA): Dimension reduction and variable importance
- None: Uses all available wavelengths
Nine regression algorithms with hyperparameter optimization:
- Bayesian Regression (BayesA): With convergence diagnostics and adaptive parameter tuning
- Random Forest: Grid search over mtry and ntree parameters
- XGBoost: Gradient boosting with early stopping
- Support Vector Machine: RBF kernel with C and sigma optimization
- Neural Networks: Feed-forward networks with PCA preprocessing for high-dimensional data
- Ranger: Fast random forest implementation
- Partial Least Squares Regression (PLSR): With cross-validation for component selection
- Linear Support Vector Regression: L2-regularized linear models
- Elastic Net/Ridge Regression: Regularized linear models with alpha optimization
The main pipeline orchestrates the entire workflow with command-line interface support.
# Create conda environment with required packages
CONDA_NAME="hs_modelling"
mamba create -n $CONDA_NAME r-keras r-cli r-pacman r-tidyverse r-fs r-glue r-tidymodels r-furrr r-bglr r-randomforest r-caret r-glmnet r-coda r-xgboost r-kernlab r-ranger r-nnet r-pls r-getoptlong r-liblinear r::r-mdatools r::r-prospectr r-ga r-tictoc
conda activate $CONDA_NAME-
Prepare your hyperspectral data: Format as RDS file with:
- Column 1: Response variable
- Column 2: Sample ID
- Column 3: Type (F=flesh, S=skin, or other tissue types)
- Columns 4+: Wavelength measurements
-
Prepare response variables: CSV file with ID column and target variables
-
Update variable names: Create
response_names.txtwith your target variables -
Adjust preprocessing: Customize transformations in
00-spectral-preprocessing.Rif needed
# Run the preprocessing script
source("Scripts/00-spectral-preprocessing.R")This generates multiple preprocessed datasets in input_data/ directories.
Rscript Scripts/04-main-pipeline.R \
--input input_data/derivative1_data/derivative1.RDS \
--output results/derivative1_target_variable.RDS \
--seeds 5 \
--var target_variable \
--response data/all_response_data.csv \
--cpus 1 \
--models "rf,xgboost" \
--features "spa,ga" \
--verbose# Generate command file for all combinations
parallel --dry-run "Rscript Scripts/04-main-pipeline.R --input {1} --output results/{1/.}_{2}.RDS --seeds 5 --var {2} --response data/all_response_data.csv --cpus \$SLURM_CPUS_PER_TASK --verbose" \
::: $(find input_data/ -name "*.RDS") \
::: $(cat data/response_names.txt) > commands.txt
# Submit job array
sbatch -a 1-$(cat commands.txt | wc -l) \
--job-name=hyperspectral_modeling \
--cpus-per-task=2 \
--mem=8G \
--time=6:00:00 \
--export=ALL,CMDS_FILE=commands.txt,CONDA_NAME=hs_modelling \
array.slurmThe main pipeline (04-main-pipeline.R) accepts the following arguments:
--input: Input RDS file with preprocessed spectral data--output: Output file path for results (RDS/CSV/TSV/TXT)--seeds: Number of random seeds for reproducibility (default: 5)--var: Target variable name to predict--response: CSV file containing all response variables--cpus: Number of CPU cores to use (default: 1)--models: Models to run (comma-separated or 'all')- Options:
bayes,rf,xgboost,svm,nnet,ranger,plsr,linear_svr,elastic_net
- Options:
--features: Feature selection methods (comma-separated, 'all', or 'none')- Options:
plsr_bcoef,spa,ga,uve,cars,pca,none
- Options:
--verbose: Enable verbose output
This framework can be applied to various hyperspectral imaging problems including:
- Agriculture: Crop quality assessment, disease detection, nutrient analysis
- Food Science: Quality control, composition analysis, authenticity testing
- Environmental Monitoring: Vegetation analysis, pollution detection, soil characterization
- Materials Science: Composition analysis, quality control, defect detection
- Medical Imaging: Tissue analysis, diagnostic applications
Results include comprehensive metrics for each model:
- Performance metrics: RMSE, RΒ², MAE, NRMSE, correlation coefficients
- Training and test metrics: Separate evaluation for both datasets
- Model-specific outputs: Variable importance, hyperparameters, convergence diagnostics
- Timing information: Feature selection and model training times
- Reproducibility info: Seeds, preprocessing methods, timestamps
The framework calculates extensive metrics including:
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- R-squared and Adjusted R-squared
- Normalized RMSE (NRMSE)
- Ratio of Performance to Deviation (RPD)
- Pearson correlation coefficients
- Residual statistics
This is an early version of the framework with several areas for improvement:
- Model Training Efficiency: Some models (particularly Bayesian methods) are prone to overfitting and convergence issues. Better regularization strategies and more robust convergence diagnostics are needed.
- Classification Support: Currently only supports regression models; classification algorithms are not yet implemented for categorical predictions.
- Hyperparameter Optimization: Limited automated tuning capabilities - currently uses grid search which may not find optimal parameters efficiently.
- Memory Management: Large datasets with many wavelengths may require optimization for memory usage, especially with neural networks.
- Cross-validation Strategy: Currently uses simple train-test splits; nested cross-validation for hyperparameter tuning would be more robust.
- Implementation of classification models for categorical predictions (SVM classification, Random Forest classification, etc.)
- Advanced cross-validation strategies and nested CV for hyperparameter tuning
- Integration of deep learning approaches specifically designed for spectral analysis
- Enhanced feature selection methods specific to hyperspectral data (e.g., interval-based methods)
- Improved computational efficiency and memory management for large datasets
- Comprehensive visualization and reporting tools for model comparison
- Docker containerization for improved portability and reproducibility
- Automated model selection based on problem characteristics
- Integration with cloud computing platforms
We welcome contributions in any of these areas! Please feel free to submit issues, feature requests, or pull requests.
The framework was originally developed for predicting volatile compounds and sugar concentrations in papaya fruit, demonstrating its application in agricultural quality assessment. The pipeline processes hyperspectral data from both flesh and skin tissues, applies multiple preprocessing transformations, and evaluates numerous modeling approaches to identify the best prediction strategy for each target compound.
- Computational complexity: With 6 feature selection methods, 9 models, multiple seeds, and 3 dataset types, a single preprocessing method can generate hundreds of model runs
- Memory requirements: Neural networks and some feature selection methods may require substantial RAM for high-dimensional data
- Runtime: Bayesian methods and genetic algorithms can be computationally intensive; consider limiting these for initial exploration
- Josh Lomax
- Ido Bar
If you use this framework in your research, please cite TBA.
- Memory errors with neural networks: Reduce the number of wavelengths using PCA or feature selection
- Bayesian model convergence: The framework includes adaptive parameter tuning, but very noisy data may still cause issues
- Long runtimes: Start with a subset of models (
--models "rf,plsr") and feature selection methods for initial testing - Missing packages: Ensure all required packages are installed in your conda environment
- Use
--features "none,pca"for initial exploration to reduce computation time - Start with faster models like
rf,plsr,elastic_netbefore running computationally intensive methods - Monitor memory usage when processing high-dimensional data
- Use appropriate number of CPU cores based on your system capabilities