MassSpecGym v1.5 (part 1/2) by roman-bushuiev · Pull Request #65 · pluskal-lab/MassSpecGym

roman-bushuiev · 2026-05-08T13:31:47Z

No description provided.

… results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Supports comma-separated types (e.g., "morgan,maccs,map4") for multi-fingerprint concatenation. MAP4 is metabolomics-specific and captures both local substructure and long-range topology. MACCS provides substructure key-based fingerprints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…y mol dtype - Add output_dtype parameter to InMemCachedMolTransform (defaults to torch.long for backwards compat). Allows torch.float32 for fingerprints. - Fix RetrievalDataset.__getitem__: convert non-float integer tensors from cached transforms to dataset dtype for query molecules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Override get_checkpoint_monitors in RetrievalMassSpecGymModel so retrieval runs track hit@1 (mode=max) for early stopping and save_last, while still keeping a val_loss checkpoint (no early stopping). val_loss can decouple from hit@k under heavier SMILES standardization (e.g. spectraverse), so monitoring it led to premature termination while hit@1 was still climbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…idation Generates RDKit-canonical copies of MassSpecGym data alongside the existing files, with empirical proof that the new pipeline is information-preserving: DeepSets+Fourier baseline trained on v1 and v1.5 produced bit-exact identical val_loss, val_fingerprint_cos_sim, and the first 24 train_loss steps. Outputs (under MassSpecGym/data/, written by the SLURM job, not part of this commit): MassSpecGym1.5.{tsv,mgf} and MassSpecGym1.5_retrieval_candidates_{formula,mass}.json. Changes: - scripts/fixes/rdkit_canon_massspecgym.py: write under v1.5 names; add a pre-canonicalization sanity sample (2k random keys + first-cands per JSON) documenting that the v1 candidate JSONs are NOT already RDKit-canonical (~95.4% of keys would change); capture the pre-mapping SMILES column so every formula/mass/InChIKey mismatch log line shows BOTH orig_smi and new_smi. - scripts/fixes/run_rdkit_canon_v1.5.sh: SLURM CPU wrapper (small partition, 64 GB mem, 8h walltime; the actual run completed in 51 min). - massspecgym/data/data_module.py: handle stage="validate" in _split_dataset (pre-existing bug from commit 4f6da4d that caused trainer.validate() to fail). - scripts/run_v1.5_baseline.sh + scripts/submit_v1.5_baselines.sh: small-g 1-GPU wrappers running the existing DeepSets+Fourier retrieval baseline on v1 vs v1.5 with identical config, seed, and hyperparameters. - notebooks/massspecgym_in_the_wild/massspecgym_v1.5_validation.ipynb: end-to-end validation notebook (7 sections) with 5 PNG+SVG figures under figures/v1.5_validation/. Confirms 96% of SMILES change (77% kekul-only, 23% atom-order/stereo); all non-SMILES columns 0-changed except machine- epsilon noise on parent_mass/precursor_mz/collision_energy; 58 mass and 272 InChIKey mismatches at MASS_TOL=0.1 Da are pre-existing data errors (the historical [P+](=O) phosphorus-cation series) shown side-by-side with both orig_smi and new_smi. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both DeepSets+Fourier baselines (`expmisc002_v1_baseline`, `expmisc002_v1.5_baseline`) ran to the 6h walltime and TIMEOUT'd in the middle of epoch 0 (step ~105). Notebook re-executed against the full W&B `scan_history` shows: - pre-fit `val_loss` and `val_fingerprint_cos_sim` are bit-exact identical between v1 and v1.5 (16-digit float64 equality) - 104 / 104 logged `train_loss` steps are bit-exact identical (max |Δ| = 0.0) This is a strictly stronger parity check than the planned `hit_rate@k` comparison would have been: byte-for-byte equal training trajectories through ~22% of an epoch directly demonstrate that RDKit canonicalization preserves all the information this baseline ingests (Morgan fingerprints are canonical-form-invariant, spectra are byte-identical between the two TSVs, and seed=0 is fixed). Files: just the rerun notebook and refreshed PNG/SVG figures under `figures/v1.5_validation/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ence Two visualization fixes for the DeepSets+Fourier parity check: - W&B summary drift: r.summary captures the latest logged value per metric, but train_loss is logged every step while v1 and v1.5 may be caught at different latest-step snapshots, making it look like the values differ. Cells now build the comparison table by merging histories on _step and reporting at each metric's last COMMON logged step (val_loss / val_fingerprint_cos_sim at step 0 since they only fire at validation epoch boundaries; train_loss at step 104, the last train step both runs reached). All three metrics are bit-exact identical at their respective last common steps. - Convergence figure: v1 was hidden under v1.5 because the curves overlap exactly. Switched to dashed-blue + open-circle markers for v1 vs solid-orange + x markers for v1.5 so both stay visible even when their values coincide to machine precision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PNG/SVG files under notebooks/massspecgym_in_the_wild/figures/v1.5_validation/ were redundant with the figure outputs already embedded in the executed notebook. Removed them and extended .gitignore to ignore future notebooks/**/figures/ outputs so re-running the notebook locally does not re-stage the same images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MassSpecDataset, RetrievalDataset, SimulationDataset, RetrievalSimulationDataset and load_massspecgym() now pull MassSpecGym1.5.tsv and the matching MassSpecGym1.5_retrieval_candidates_{mass,formula}.json from HuggingFace when no explicit path is given. Verified: hf_hub_download succeeds for all three v1.5 files and `MassSpecDataset()` round-trips 231,104 rows. Existing user code that passes pth=... or candidates_pth=... is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The HuggingFace defaults now serve v1.5 candidate JSONs whose keys are RDKit-canonical SMILES; the demo's pth was still 'MassSpecGym.tsv' (PubChem-canonical), so RetrievalSimulationDataset's `candidates_mask.all()` assertion could fail on a 1% subsample whose SMILES no longer match the candidate keys. Switch demo.yml to 'MassSpecGym1.5.tsv' so the simulation test runs against a self-consistent v1.5 dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roman-bushuiev and others added 27 commits November 10, 2025 01:06

Split setup in MassSpecDataModule

4f6da4d

Allow no candidates in RetrievalDataset

e1e87a8

Fix no candidates in RetrievalDataset

2b88954

Fix RetrievalDataset.__getitem__ dtypes

c36d23c

Add dataloader args to MassSpecDataModule

10d0a01

Merge branch 'pluskal-lab:main' into main

522982b

Add pth argument to load_massspecgym

5b67c77

Implement InMemCachedMolTransform

82b97d1

Allow no cache path in InMemCachedMolTransform

854ebe0

Fix handling of torch tensors in candidates

44236a0

Extend MolTransform with __str__

2725844

Emable cache building for in-mem transform

27b7b81

Implement RDKit canonicalization fix

7620e83

Move jss_fix to scripts/fixes

8864afa

Delete scripts/jss_fix.py

4350f0a

Implement RandomRetrievalGTFormula

f644764

Implement util for confidence intervals

84f721f

Update .gitignore to exclude large data files, HPC logs, figures, and…

c1b8309

… results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

roman-bushuiev merged commit f259fe3 into pluskal-lab:main May 8, 2026
1 check passed

roman-bushuiev deleted the feat/massspecgym-v1.5 branch May 8, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MassSpecGym v1.5 (part 1/2)#65

MassSpecGym v1.5 (part 1/2)#65
roman-bushuiev merged 27 commits into
pluskal-lab:mainfrom
roman-bushuiev:feat/massspecgym-v1.5

roman-bushuiev commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roman-bushuiev commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant