Skip to content

String models exploit biases in MoleculeNet SMILES dialect to inflate performance #2

@cyrusmaher

Description

@cyrusmaher

Following up on a conversation with Meng Liu, I wanted to link this bug. I confirmed it for ClinTox, but it may be present for other datasets:
deepchem/moleculenet#15

One set of solutions would be:

  • Refactoring input parsing code to be shared across models
  • Adding smiles canonicalization to input parsing: from rdkit import Chem; Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True)
  • Re-running string-based models on all benchmarks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions