Skip to content

Arcadia-Science/raman-batch-effects-yeast

Repository files navigation

Batch effects in Raman spectroscopy of yeast cultures

This repository contains the code accompanying the 2026 pub entitled "Cross-validation reveals strong batch effects in Raman spectra of biological samples".

The code is organized as a Python package called raman_batch_effects.

Authorship

Most of the code in this repo was written using Claude Code. All code was reviewed by its human author and a second independent human reviewer. This README, and the pub itself, were written by a human.

Repository structure

data/                             # Location of the raw data.
output/                           # Location of generated figures and summary tables.
src/raman_batch_effects/          # Python package source.
data.zip                          # Raw spectral data and metadata.

The /data and /output directories are not included in this repo. The /data directory is created by unzipping the data.zip file using make unzip-data and the /output directory is created by running the scripts in this repo (see below).

Data

The raw spectral data and metadata are stored in the data.zip file. This file is included in this repo. After cloning this repo, unzip the file using make unzip-data. This will create the /data directory.

This directory contains the raw spectral data (in CSV format) and metadata (as platemaps, also in CSV format). For details about the structure of the data, refer to the code that loads the data in src/raman_batch_effects/loaders.py.

Environment setup

We use uv to manage the dependencies and run the scripts.

First install the dependencies:

uv sync

Then unzip the raw data:

make unzip-data

This will unzip the data into the data/ directory. The size of the unzipped data is 40MB. After this, you should be able to run the scripts to generate the figures and results shown in the pub (see below).

All of the code in this repo was run on a MacBook Pro with an Apple M1 Max chip, 64GB of RAM, and macOS Ventura 13.

Generating the figures and results shown in the pub

All figures are generated by scripts in the src/raman_batch_effects/scripts/ module. Each script writes its output to a date-stamped subdirectory under output/. On a modern laptop, the scripts should not take more than a few minutes to run.

Only a subset of the figures generated by these scripts were included in the pub. The filenames of the included figures can be found in the copy_figures.py script.

Scripts

plot_spectra

This script generates plots of the raw and processed spectra, grouped by strain and species. The plot of strain-level mean spectra was used for Figure 1 of the pub.

uv run python -m raman_batch_effects.scripts.plot_spectra

plot_cross_validation

This script plots confusion matrices for all prediction tasks and cross-validation strategies. Each plot is generated for two models (random forest and SVC) and three versions of the dataset (uncorrected, LMM-corrected, and ComBat-corrected). Per-fold performance metrics are saved as YAML files alongside the plots. A subset of these plots were used for Figures 2-6 of the pub.

uv run python -m raman_batch_effects.scripts.plot_cross_validation

aggregate_cv_metrics

This script reads the per-run YAML metrics files produced by plot_cross_validation above and generates a summary file containing median metrics and a heatmap of the median values for one metric (the MCC). This heatmap was used for Figure 7 of the pub, and the median MCC values were quoted inline in the pub text.

uv run python -m raman_batch_effects.scripts.aggregate_cv_metrics

Common CLI flags

All plotting scripts accept the following flags:

  • --overwrite — Overwrite existing output files (by default, existing files are skipped).
  • --reset — Remove the target output directory before running.
  • --clear — Clear the joblib cache before running.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors