Skip to content

[FEAT] Scaffold-based applicability domain and uncertainty quantification #502

@smcolby

Description

@smcolby

Proposed module: Scaffold-based applicability domain and uncertainty quantification

Objective: To provide empirical, instance-specific uncertainty bounds by mapping query compounds to discrete Bemis-Murcko scaffolds. Matches to well-represented training scaffolds receive an in-domain interpolation error bound, while novel scaffolds receive a conservative global out-of-domain extrapolation error bound.

1. Data partitioning and scaffold binning

  • Scaffold extraction: Generate the exact Bemis-Murcko scaffold (represented as a canonical SMILES string) for every molecule in the master dataset.
  • Frequency binning: Count the occurrence of each distinct scaffold.
  • Primary clusters: Isolate scaffolds with a frequency above a defined threshold to serve as the N independent, well-represented clusters.
  • Miscellaneous cluster: Pool all singletons and low-frequency scaffolds into a single "miscellaneous" group.
  • Intra-scaffold split: Perform an 80/20 train/validation split strictly within the N primary clusters to create active training pools and permanently held-out Interpolation validation pools.
  • k-fold grouping: Randomly assign the N primary clusters into k distinct folds (e.g., k=5 or k=10) to dictate the cross-validation splits.

2. Model training and error evaluation

  • k-fold Scaffold Cross-Validation: Iteratively loop through the k folds. In each iteration, hold out one entire fold of primary scaffolds. The miscellaneous cluster remains in the active training data across all folds.
  • Interpolation error (in-domain): Evaluate the model on the interpolation validation pools of the active training scaffolds. Calculate and record a high percentile (e.g., the 95th percentile) of the absolute errors for each specific primary scaffold SMILES.
  • Extrapolation error (out-of-domain): Evaluate the model on the held-out fold of primary scaffolds. Record the raw error distribution for these held-out scaffolds to contribute to the global out-of-domain profile.

3. Scaffold profiling and storage

  • Primary dictionary construction: Create a hash map where keys are the canonical SMILES strings of the N Primary Clusters and values are their specific 90th percentile Interpolation Errors.
  • Miscellaneous logging: Keep a record of the SMILES strings belonging to the miscellaneous cluster.
  • Global OOD definition: Aggregate the raw extrapolation errors from all k folds into a single global distribution. Calculate a high percentile (e.g., the 95th percentile) of this aggregate distribution to establish the global OOD baseline.

[Optional addditions for future implementation]

  • Generic framework profiling: For each primary cluster, strip the atom types and bond orders to generate a generic Murcko framework. Store these in a secondary dictionary to enable fuzzy matching.
  • Physicochemical profiling: Calculate and store the 5th and 95th percentile bounds for key physicochemical descriptors (e.g., LogP, TPSA, molecular weight) for the active training pool of each primary cluster.

4. Inference and error assignment

  • Query processing: Extract the canonical Bemis-Murcko scaffold SMILES from the query compound.
  • In-domain assignment: If the query's scaffold exists in the primary dictionary, assign its specific percentile-based interpolation error.
  • Miscellaneous/novel assignment: If the query's scaffold is found in the miscellaneous log, or is entirely absent from the training data, assign the global OOD baseline error.

[Optional additions for future implementation]

  • Two-tiered framework lookup: If the exact scaffold match fails, extract the query's generic Murcko framework. If this generic framework exists in the secondary dictionary, assign an intermediate error bound (e.g., a weighted average of the interpolation error and the global OOD baseline).
  • Physicochemical guardrail check: If a query matches a primary cluster (either exactly or via generic framework), calculate its physicochemical descriptors. If these fall outside the stored historical bounds for that specific scaffold, override the in-domain assignment and assign the global OOD baseline error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions