Skip to content

Hardcoded PeptideL causes pipeline failure on Protein-DNA complexes (e.g. 3K5M) #43

@xinyuren-bio

Description

@xinyuren-bio

Hi team,

I encountered an issue when processing Protein-DNA complexes (eg: 3K5M) and traced the root cause to the hardcoded polymer_type in the parsing stage.

Here is the detailed flow of how this causes the pipeline to fail:

  1. Parsing Stage (parse_mmcif): In the loop processing chains, polymer_type is hardcoded to gemmi.PolymerType.PeptideL. Inside parse_polymer, this forces chain_type to be set to PROTEIN (around line 746).

Result: For 3K5M, the DNA chains (Chain B, C) are incorrectly labeled as PROTEIN.

  1. Tokenization Stage (boltz_protein.py): The code attempts to filter out non-protein chains:

if chain["mol_type"] != const.chain_type_ids["PROTEIN"]:
continue # Skip non-protein chains
Result: Since the DNA chains were mislabeled as PROTEIN in step 1, they are not skipped. The DNA residues (DA, DC, DG, DT) are then tokenized.

  1. Training/Extraction Stage (extract_sequence_from_tokens): The code tries to map residues to standard amino acids. Since DNA residues (DA, DC...) are not in restype_3to1 (which only contains the 20 standard amino acids), they are identified as non-protein residues and filtered out.

Final Error: This leads to the error: "No protein sequence in 3k5m. Skipping." because the logic gets confused by the mislabeled chains.

Evidence from 3K5M.cif:

Entity 1: polypeptide(L) -> Chain A (Protein)

Entity 2: polydeoxyribonucleotide -> Chain B (DNA)

Entity 3: polydeoxyribonucleotide -> Chain C (DNA)

Is this intended behavior? I am not entirely sure if this hardcoding is a deliberate design choice (i.e., the current model is strictly intended for pure proteins only, and I shouldn't be feeding it complexes), or if this is an oversight in the parsing logic.

Thanks!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions