-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Hi team,
I encountered an issue when processing Protein-DNA complexes (eg: 3K5M) and traced the root cause to the hardcoded polymer_type in the parsing stage.
Here is the detailed flow of how this causes the pipeline to fail:
- Parsing Stage (parse_mmcif): In the loop processing chains, polymer_type is hardcoded to gemmi.PolymerType.PeptideL. Inside parse_polymer, this forces chain_type to be set to PROTEIN (around line 746).
Result: For 3K5M, the DNA chains (Chain B, C) are incorrectly labeled as PROTEIN.
- Tokenization Stage (boltz_protein.py): The code attempts to filter out non-protein chains:
if chain["mol_type"] != const.chain_type_ids["PROTEIN"]:
continue # Skip non-protein chains
Result: Since the DNA chains were mislabeled as PROTEIN in step 1, they are not skipped. The DNA residues (DA, DC, DG, DT) are then tokenized.
- Training/Extraction Stage (extract_sequence_from_tokens): The code tries to map residues to standard amino acids. Since DNA residues (DA, DC...) are not in restype_3to1 (which only contains the 20 standard amino acids), they are identified as non-protein residues and filtered out.
Final Error: This leads to the error: "No protein sequence in 3k5m. Skipping." because the logic gets confused by the mislabeled chains.
Evidence from 3K5M.cif:
Entity 1: polypeptide(L) -> Chain A (Protein)
Entity 2: polydeoxyribonucleotide -> Chain B (DNA)
Entity 3: polydeoxyribonucleotide -> Chain C (DNA)
Is this intended behavior? I am not entirely sure if this hardcoding is a deliberate design choice (i.e., the current model is strictly intended for pure proteins only, and I shouldn't be feeding it complexes), or if this is an oversight in the parsing logic.
Thanks!