Skip to content

Optional layer for non-InChI identifiers? #1

@Artoria2e5

Description

@Artoria2e5

The MInChI Demo page includes some interesting mixfiles (well, if you "copy branch" it's basically a JSON mixfile without mixfileVersion) with unknown InChI structures such as:

  • No structures at all: BSA blocking buffer + PBS; bechamel sauce
  • Partial lack: Dodecacarbonyltriiron

Right now the produced InChI is a little less than informative for these purposes. I propose adding an optional layer /x (external identifiers) to handle this problem.

/x layer

The /x layer consists of the following parts:

  • A main part, consisting of percent-encoded strings separated by the character &. Characters that MUST be encoded are / , &, unprintable characters, and whitespace characters. (I choose this style because it originates in an environment that uses & and /.)
    • The use of + in place of %20 for encoding a space is permitted. (Purely aesthetic reasons.)
  • A mandatory /n sublayer which is very similar to the /n layer, but with the ability to associate multiple strings to a substance as well as the ability to name a group. (This will cause some duplication of information in the nesting structure. We already do that with /g.)
  • An optional /t sublayer specifying the type of the identifier in the main part. This layer contains a string, each character being a description of the corresponding index in the &-separated field. Acceptable types include (each of these have a Mixfile counterpart):
    • f: formula (likely used when: unknown connectivity so unable to make InChI, has numbers in a range so unable to make InChI)
    • s: SMILES
    • n: Human-readable name
    • k: InChIKey
    • (I could specify one for Molfile here but the size would be comical. A URL-safe base64 encoding of gzipped Molfile? Nah sounds too complicated.)
    • (There are some additional database references that can be added, though these will NOT have a Mixfile counterpart. It could make sense to just write another "name" for now.)

The /x layer shall only appear on non-"standard MInChI", i.e. "MInChI=0.00.1" without the "S". There is too much variability for anything to be reproducible here. Lucky we don't have a MInChIKey...

Basic example (with whitespace added)

MInChI=0.00.1//n{{&}&}/g{{466wf-3&534wf-3}91wf-3&909wf-3}
 /xbutter&flour&flour+dispersed+in+butter&milk&bechamel+sauce
  /n{{1&2}3&4}5
  /tnnnnn

Example of three identifiers on the same thing:

MInChI=0.00.1/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3/n{&1}/g{1:5pp0&}
 /xOctacarbonyldicobalt&Co2(CO)8&PubChem_CID:25049
  /n{1,2,3&}
  /tnfn

On /n

When an /n sublayer is present, it should have the same "shape-of-braces" as the main /n layer. The format is the same as the main /n layer, with the exception that

  • each structure can have multiple descriptions for the main part. This is resolved by allowing the use of a comma , between numbers describing the same part.
  • each brace-grouping may have its own label. This is handled by permitting number-lists to be used after the closing brace, before the &. (This resembles Newick format.)

About names

/x is currently unused and a good sound match. I think it's an acceptable use of a letter, unless someone has some other use in mind (e.g. using /x like the x- prefix of MIME types for experimental/extensions in general).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions