Hardcoded PeptideL causes pipeline failure on Protein-DNA complexes (e.g. 3K5M)

Hi team,

I encountered an issue when processing Protein-DNA complexes (eg: 3K5M) and traced the root cause to the hardcoded polymer_type in the parsing stage.

Here is the detailed flow of how this causes the pipeline to fail:

1. Parsing Stage (parse_mmcif): In the loop processing chains, polymer_type is hardcoded to gemmi.PolymerType.PeptideL. Inside parse_polymer, this forces chain_type to be set to PROTEIN (around line 746).

Result: For 3K5M, the DNA chains (Chain B, C) are incorrectly labeled as PROTEIN.

2. Tokenization Stage (boltz_protein.py): The code attempts to filter out non-protein chains:

if chain["mol_type"] != const.chain_type_ids["PROTEIN"]:
    continue # Skip non-protein chains
Result: Since the DNA chains were mislabeled as PROTEIN in step 1, they are not skipped. The DNA residues (DA, DC, DG, DT) are then tokenized.

3. Training/Extraction Stage (extract_sequence_from_tokens): The code tries to map residues to standard amino acids. Since DNA residues (DA, DC...) are not in restype_3to1 (which only contains the 20 standard amino acids), they are identified as non-protein residues and filtered out.

Final Error: This leads to the error: "No protein sequence in 3k5m. Skipping." because the logic gets confused by the mislabeled chains.

Evidence from 3K5M.cif:

Entity 1: polypeptide(L) -> Chain A (Protein)

Entity 2: polydeoxyribonucleotide -> Chain B (DNA)

Entity 3: polydeoxyribonucleotide -> Chain C (DNA)

Is this intended behavior? I am not entirely sure if this hardcoding is a deliberate design choice (i.e., the current model is strictly intended for pure proteins only, and I shouldn't be feeding it complexes), or if this is an oversight in the parsing logic.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded PeptideL causes pipeline failure on Protein-DNA complexes (e.g. 3K5M) #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardcoded PeptideL causes pipeline failure on Protein-DNA complexes (e.g. 3K5M) #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions