This repository contains the code accompanying the 2026 pub entitled "Cross-validation reveals strong batch effects in Raman spectra of biological samples".
The code is organized as a Python package called raman_batch_effects.
Most of the code in this repo was written using Claude Code. All code was reviewed by its human author and a second independent human reviewer. This README, and the pub itself, were written by a human.
data/ # Location of the raw data.
output/ # Location of generated figures and summary tables.
src/raman_batch_effects/ # Python package source.
data.zip # Raw spectral data and metadata.
The /data and /output directories are not included in this repo. The /data directory is created by unzipping the data.zip file using make unzip-data and the /output directory is created by running the scripts in this repo (see below).
The raw spectral data and metadata are stored in the data.zip file. This file is included in this repo. After cloning this repo, unzip the file using make unzip-data. This will create the /data directory.
This directory contains the raw spectral data (in CSV format) and metadata (as platemaps, also in CSV format). For details about the structure of the data, refer to the code that loads the data in src/raman_batch_effects/loaders.py.
We use uv to manage the dependencies and run the scripts.
First install the dependencies:
uv syncThen unzip the raw data:
make unzip-dataThis will unzip the data into the data/ directory. The size of the unzipped data is 40MB. After this, you should be able to run the scripts to generate the figures and results shown in the pub (see below).
All of the code in this repo was run on a MacBook Pro with an Apple M1 Max chip, 64GB of RAM, and macOS Ventura 13.
All figures are generated by scripts in the src/raman_batch_effects/scripts/ module. Each script writes its output to a date-stamped subdirectory under output/. On a modern laptop, the scripts should not take more than a few minutes to run.
Only a subset of the figures generated by these scripts were included in the pub. The filenames of the included figures can be found in the copy_figures.py script.
plot_spectra
This script generates plots of the raw and processed spectra, grouped by strain and species. The plot of strain-level mean spectra was used for Figure 1 of the pub.
uv run python -m raman_batch_effects.scripts.plot_spectraplot_cross_validation
This script plots confusion matrices for all prediction tasks and cross-validation strategies. Each plot is generated for two models (random forest and SVC) and three versions of the dataset (uncorrected, LMM-corrected, and ComBat-corrected). Per-fold performance metrics are saved as YAML files alongside the plots. A subset of these plots were used for Figures 2-6 of the pub.
uv run python -m raman_batch_effects.scripts.plot_cross_validationaggregate_cv_metrics
This script reads the per-run YAML metrics files produced by plot_cross_validation above and generates a summary file containing median metrics and a heatmap of the median values for one metric (the MCC). This heatmap was used for Figure 7 of the pub, and the median MCC values were quoted inline in the pub text.
uv run python -m raman_batch_effects.scripts.aggregate_cv_metricsAll plotting scripts accept the following flags:
--overwrite— Overwrite existing output files (by default, existing files are skipped).--reset— Remove the target output directory before running.--clear— Clear the joblib cache before running.