MassSpecGym v1.5 (part 1/2)#65
Merged
roman-bushuiev merged 27 commits intoMay 8, 2026
Merged
Conversation
… results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Supports comma-separated types (e.g., "morgan,maccs,map4") for multi-fingerprint concatenation. MAP4 is metabolomics-specific and captures both local substructure and long-range topology. MACCS provides substructure key-based fingerprints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y mol dtype - Add output_dtype parameter to InMemCachedMolTransform (defaults to torch.long for backwards compat). Allows torch.float32 for fingerprints. - Fix RetrievalDataset.__getitem__: convert non-float integer tensors from cached transforms to dataset dtype for query molecules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Override get_checkpoint_monitors in RetrievalMassSpecGymModel so retrieval runs track hit@1 (mode=max) for early stopping and save_last, while still keeping a val_loss checkpoint (no early stopping). val_loss can decouple from hit@k under heavier SMILES standardization (e.g. spectraverse), so monitoring it led to premature termination while hit@1 was still climbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…idation
Generates RDKit-canonical copies of MassSpecGym data alongside the existing
files, with empirical proof that the new pipeline is information-preserving:
DeepSets+Fourier baseline trained on v1 and v1.5 produced bit-exact identical
val_loss, val_fingerprint_cos_sim, and the first 24 train_loss steps.
Outputs (under MassSpecGym/data/, written by the SLURM job, not part of this
commit): MassSpecGym1.5.{tsv,mgf} and
MassSpecGym1.5_retrieval_candidates_{formula,mass}.json.
Changes:
- scripts/fixes/rdkit_canon_massspecgym.py: write under v1.5 names; add a
pre-canonicalization sanity sample (2k random keys + first-cands per JSON)
documenting that the v1 candidate JSONs are NOT already RDKit-canonical
(~95.4% of keys would change); capture the pre-mapping SMILES column so
every formula/mass/InChIKey mismatch log line shows BOTH orig_smi and
new_smi.
- scripts/fixes/run_rdkit_canon_v1.5.sh: SLURM CPU wrapper (small partition,
64 GB mem, 8h walltime; the actual run completed in 51 min).
- massspecgym/data/data_module.py: handle stage="validate" in
_split_dataset (pre-existing bug from commit 4f6da4d that caused
trainer.validate() to fail).
- scripts/run_v1.5_baseline.sh + scripts/submit_v1.5_baselines.sh: small-g
1-GPU wrappers running the existing DeepSets+Fourier retrieval baseline
on v1 vs v1.5 with identical config, seed, and hyperparameters.
- notebooks/massspecgym_in_the_wild/massspecgym_v1.5_validation.ipynb:
end-to-end validation notebook (7 sections) with 5 PNG+SVG figures under
figures/v1.5_validation/. Confirms 96% of SMILES change (77% kekul-only,
23% atom-order/stereo); all non-SMILES columns 0-changed except machine-
epsilon noise on parent_mass/precursor_mz/collision_energy; 58 mass and
272 InChIKey mismatches at MASS_TOL=0.1 Da are pre-existing data errors
(the historical [P+](=O) phosphorus-cation series) shown side-by-side
with both orig_smi and new_smi.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both DeepSets+Fourier baselines (`expmisc002_v1_baseline`, `expmisc002_v1.5_baseline`) ran to the 6h walltime and TIMEOUT'd in the middle of epoch 0 (step ~105). Notebook re-executed against the full W&B `scan_history` shows: - pre-fit `val_loss` and `val_fingerprint_cos_sim` are bit-exact identical between v1 and v1.5 (16-digit float64 equality) - 104 / 104 logged `train_loss` steps are bit-exact identical (max |Δ| = 0.0) This is a strictly stronger parity check than the planned `hit_rate@k` comparison would have been: byte-for-byte equal training trajectories through ~22% of an epoch directly demonstrate that RDKit canonicalization preserves all the information this baseline ingests (Morgan fingerprints are canonical-form-invariant, spectra are byte-identical between the two TSVs, and seed=0 is fixed). Files: just the rerun notebook and refreshed PNG/SVG figures under `figures/v1.5_validation/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ence Two visualization fixes for the DeepSets+Fourier parity check: - W&B summary drift: r.summary captures the latest logged value per metric, but train_loss is logged every step while v1 and v1.5 may be caught at different latest-step snapshots, making it look like the values differ. Cells now build the comparison table by merging histories on _step and reporting at each metric's last COMMON logged step (val_loss / val_fingerprint_cos_sim at step 0 since they only fire at validation epoch boundaries; train_loss at step 104, the last train step both runs reached). All three metrics are bit-exact identical at their respective last common steps. - Convergence figure: v1 was hidden under v1.5 because the curves overlap exactly. Switched to dashed-blue + open-circle markers for v1 vs solid-orange + x markers for v1.5 so both stay visible even when their values coincide to machine precision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PNG/SVG files under notebooks/massspecgym_in_the_wild/figures/v1.5_validation/ were redundant with the figure outputs already embedded in the executed notebook. Removed them and extended .gitignore to ignore future notebooks/**/figures/ outputs so re-running the notebook locally does not re-stage the same images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MassSpecDataset, RetrievalDataset, SimulationDataset, RetrievalSimulationDataset
and load_massspecgym() now pull MassSpecGym1.5.tsv and the matching
MassSpecGym1.5_retrieval_candidates_{mass,formula}.json from HuggingFace
when no explicit path is given. Verified: hf_hub_download succeeds for
all three v1.5 files and `MassSpecDataset()` round-trips 231,104 rows.
Existing user code that passes pth=... or candidates_pth=... is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HuggingFace defaults now serve v1.5 candidate JSONs whose keys are RDKit-canonical SMILES; the demo's pth was still 'MassSpecGym.tsv' (PubChem-canonical), so RetrievalSimulationDataset's `candidates_mask.all()` assertion could fail on a 1% subsample whose SMILES no longer match the candidate keys. Switch demo.yml to 'MassSpecGym1.5.tsv' so the simulation test runs against a self-consistent v1.5 dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.