Skip to content

MassSpecGym v1.5 (part 1/2)#65

Merged
roman-bushuiev merged 27 commits into
pluskal-lab:mainfrom
roman-bushuiev:feat/massspecgym-v1.5
May 8, 2026
Merged

MassSpecGym v1.5 (part 1/2)#65
roman-bushuiev merged 27 commits into
pluskal-lab:mainfrom
roman-bushuiev:feat/massspecgym-v1.5

Conversation

@roman-bushuiev

Copy link
Copy Markdown
Contributor

No description provided.

roman-bushuiev and others added 27 commits November 10, 2025 01:06
… results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Supports comma-separated types (e.g., "morgan,maccs,map4") for multi-fingerprint
concatenation. MAP4 is metabolomics-specific and captures both local substructure
and long-range topology. MACCS provides substructure key-based fingerprints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y mol dtype

- Add output_dtype parameter to InMemCachedMolTransform (defaults to
  torch.long for backwards compat). Allows torch.float32 for fingerprints.
- Fix RetrievalDataset.__getitem__: convert non-float integer tensors
  from cached transforms to dataset dtype for query molecules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Override get_checkpoint_monitors in RetrievalMassSpecGymModel so retrieval
runs track hit@1 (mode=max) for early stopping and save_last, while still
keeping a val_loss checkpoint (no early stopping). val_loss can decouple
from hit@k under heavier SMILES standardization (e.g. spectraverse), so
monitoring it led to premature termination while hit@1 was still climbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…idation

Generates RDKit-canonical copies of MassSpecGym data alongside the existing
files, with empirical proof that the new pipeline is information-preserving:
DeepSets+Fourier baseline trained on v1 and v1.5 produced bit-exact identical
val_loss, val_fingerprint_cos_sim, and the first 24 train_loss steps.

Outputs (under MassSpecGym/data/, written by the SLURM job, not part of this
commit): MassSpecGym1.5.{tsv,mgf} and
MassSpecGym1.5_retrieval_candidates_{formula,mass}.json.

Changes:
- scripts/fixes/rdkit_canon_massspecgym.py: write under v1.5 names; add a
  pre-canonicalization sanity sample (2k random keys + first-cands per JSON)
  documenting that the v1 candidate JSONs are NOT already RDKit-canonical
  (~95.4% of keys would change); capture the pre-mapping SMILES column so
  every formula/mass/InChIKey mismatch log line shows BOTH orig_smi and
  new_smi.
- scripts/fixes/run_rdkit_canon_v1.5.sh: SLURM CPU wrapper (small partition,
  64 GB mem, 8h walltime; the actual run completed in 51 min).
- massspecgym/data/data_module.py: handle stage="validate" in
  _split_dataset (pre-existing bug from commit 4f6da4d that caused
  trainer.validate() to fail).
- scripts/run_v1.5_baseline.sh + scripts/submit_v1.5_baselines.sh: small-g
  1-GPU wrappers running the existing DeepSets+Fourier retrieval baseline
  on v1 vs v1.5 with identical config, seed, and hyperparameters.
- notebooks/massspecgym_in_the_wild/massspecgym_v1.5_validation.ipynb:
  end-to-end validation notebook (7 sections) with 5 PNG+SVG figures under
  figures/v1.5_validation/. Confirms 96% of SMILES change (77% kekul-only,
  23% atom-order/stereo); all non-SMILES columns 0-changed except machine-
  epsilon noise on parent_mass/precursor_mz/collision_energy; 58 mass and
  272 InChIKey mismatches at MASS_TOL=0.1 Da are pre-existing data errors
  (the historical [P+](=O) phosphorus-cation series) shown side-by-side
  with both orig_smi and new_smi.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both DeepSets+Fourier baselines (`expmisc002_v1_baseline`,
`expmisc002_v1.5_baseline`) ran to the 6h walltime and TIMEOUT'd in the middle
of epoch 0 (step ~105). Notebook re-executed against the full W&B
`scan_history` shows:

- pre-fit `val_loss` and `val_fingerprint_cos_sim` are bit-exact identical
  between v1 and v1.5 (16-digit float64 equality)
- 104 / 104 logged `train_loss` steps are bit-exact identical (max |Δ| = 0.0)

This is a strictly stronger parity check than the planned `hit_rate@k`
comparison would have been: byte-for-byte equal training trajectories
through ~22% of an epoch directly demonstrate that RDKit canonicalization
preserves all the information this baseline ingests (Morgan fingerprints
are canonical-form-invariant, spectra are byte-identical between the two
TSVs, and seed=0 is fixed).

Files: just the rerun notebook and refreshed PNG/SVG figures under
`figures/v1.5_validation/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ence

Two visualization fixes for the DeepSets+Fourier parity check:

- W&B summary drift: r.summary captures the latest logged value per
  metric, but train_loss is logged every step while v1 and v1.5 may be
  caught at different latest-step snapshots, making it look like the
  values differ. Cells now build the comparison table by merging
  histories on _step and reporting at each metric's last COMMON logged
  step (val_loss / val_fingerprint_cos_sim at step 0 since they only
  fire at validation epoch boundaries; train_loss at step 104, the
  last train step both runs reached). All three metrics are bit-exact
  identical at their respective last common steps.
- Convergence figure: v1 was hidden under v1.5 because the curves
  overlap exactly. Switched to dashed-blue + open-circle markers for
  v1 vs solid-orange + x markers for v1.5 so both stay visible even
  when their values coincide to machine precision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PNG/SVG files under notebooks/massspecgym_in_the_wild/figures/v1.5_validation/
were redundant with the figure outputs already embedded in the executed
notebook. Removed them and extended .gitignore to ignore future
notebooks/**/figures/ outputs so re-running the notebook locally does
not re-stage the same images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MassSpecDataset, RetrievalDataset, SimulationDataset, RetrievalSimulationDataset
and load_massspecgym() now pull MassSpecGym1.5.tsv and the matching
MassSpecGym1.5_retrieval_candidates_{mass,formula}.json from HuggingFace
when no explicit path is given. Verified: hf_hub_download succeeds for
all three v1.5 files and `MassSpecDataset()` round-trips 231,104 rows.

Existing user code that passes pth=... or candidates_pth=... is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HuggingFace defaults now serve v1.5 candidate JSONs whose keys are
RDKit-canonical SMILES; the demo's pth was still 'MassSpecGym.tsv'
(PubChem-canonical), so RetrievalSimulationDataset's
`candidates_mask.all()` assertion could fail on a 1% subsample whose
SMILES no longer match the candidate keys. Switch demo.yml to
'MassSpecGym1.5.tsv' so the simulation test runs against a
self-consistent v1.5 dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roman-bushuiev roman-bushuiev merged commit f259fe3 into pluskal-lab:main May 8, 2026
1 check passed
@roman-bushuiev roman-bushuiev deleted the feat/massspecgym-v1.5 branch May 8, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant