Skip to content

404 Extend CL validation and imports#423

Open
jorisfu wants to merge 28 commits into
crosslinkingfrom
404-cl-validation-with-confidence
Open

404 Extend CL validation and imports#423
jorisfu wants to merge 28 commits into
crosslinkingfrom
404-cl-validation-with-confidence

Conversation

@jorisfu
Copy link
Copy Markdown
Collaborator

@jorisfu jorisfu commented May 15, 2026

Description

fixes #404

Adds new options for CL validation against predicted models. The crosslinking validation steps now additionally support:

Max PAE

Uses the maxmimum PAE between the two CL binding sites $x$ and $y$ (note that AlphaFold PAE is asymmetrical) as a tolerance value: $t = \max ( \text{PAE}[x, y], \text{PAE}[y, x] )$. An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is considered valid iff
$$\max(l_{\text{CL}} - t, 0) \le d \le l_{\text{CL}} + t$$

Min PAE

Same as Max PAE but with $t = \min ( \text{PAE}[x, y], \text{PAE}[y, x] )$.

pLDDT adjusted

This uses the local error of the model prediction as a tolerance basis. Since pLDDT is between 0 and 100, we use the error factor $p_x = 1 - (\text{pLDDT}[x] / 100)$ for binding site $x$ and $p_y$ for binding site $y$ respectively.
Since we want a pLDDT of 100 to allow for no tolerance and a pLDDT of 0 for maximum tolerance and the $p_x$ and $p_y$ are different we calculate two different tolerance ranges for each half of the CL respectively.
Let $l_{\text{CL}}$ be the length of the crosslinker. We define the maximum tolerance for each half crosslinker as $t_{\text{max}} = l_{\text{CL}}$ (Intuitively, this means that half a CL can shrink/extend to the size of an entire CL if the prediction is at its lowest possible confidence.) From this, we define the tolerances $t_x = p_x \cdot t_{\text{max}}$ and $t_y = p_y \cdot t_{\text{max}}$. An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is then considered valid iff
$$\max(l_{\text{CL}} - t_x - t_y , 0) \le d \le l_{\text{CL}} + t_x + t_y$$

Changes

  • Changed input/output for PAE for AF/XL steps to numpy matrix
  • Added plddt_df output for multimer import (read from CIF, see comment in code)
  • Added pae_matrix output for multimer import (read from full_data)
  • Added validation types to forms, option types and the validation methods
  • Added tests for the new validation methods
  • Applied change from @tE3m on how we handle cif imports

Changes TODO

  • Monomer import: Change PAE to numpy matrix
  • Multimer import: Add PAE/pLDDT and shorten full_data
  • Fix current plot method failing
  • Adjust and extend Monomer tests
  • Adjust and extend Multimer tests

Testing

  • Sanity check the formulas
  • Create a standard CL workflow and validate some structures. Test out the different validation options and observe the plots and tables. Ideally check if the tables make sense and only reasonable CLs are labeled as valid.
  • Note that the plots are currently misleading for the other methods, that'll be another PR

On another note, we depend on the order of the pae_matrix equaling the order of CA atoms for each amino acid in the CIF. Adding atoms labelled CA to the CIF will break the PAE based validation right now, so it'd be great @tE3m if you could add another filter to the PAE matching so that we filter out CA's from PTMs here

PR checklist

Development

  • If necessary, I have updated the documentation (README, docstrings, etc.)
  • If necessary, I have created / updated tests.

Mergeability

  • crosslinking-branch has been merged into local branch to resolve conflicts
  • The tests and linter have passed AFTER local merge
  • The backend code has been formatted with black
  • The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

  • I have self-reviewed my code.
  • At least one other developer reviewed and approved the changes

@jorisfu jorisfu changed the title 404 cl validation with confidence 404 Extend CL validation and imports May 15, 2026
@jorisfu jorisfu marked this pull request as ready for review May 21, 2026 15:23
@jorisfu jorisfu self-assigned this May 21, 2026
Copy link
Copy Markdown
Collaborator

@3dot141592 3dot141592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything except this one thing looks good to me. The formulas look correct. Thank you for this mathematical PR :)

Comment on lines +489 to +494
index_lookup_df = (
cif_df[["_atom_site.label_asym_id", "_atom_site.label_seq_id"]]
.drop_duplicates()
.reset_index(drop=True)
)
index_lookup_df.reset_index(inplace=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think filtering PTM CA rows from the CIF is enough here. In the P28482 PTM example, the CIF has 360 CA residue rows, but PAE in full_data is 396x396 because some PTM residues show up multiple times in token_res_ids. So the indices after a PTM get shifted...?

Please let me know if I misunderstood something :/

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that you are right, which is sad for me cause I'll have to adjust the imports again to map tokens to AAs. Alphafold3 handles PAE per token pair, not AA pair. Somehow I missed this during my work on this PR

https://www.ebi.ac.uk/training/online/courses/alphafold/alphafold-3-and-alphafold-server/how-to-assess-the-quality-of-alphafold-3-predictions/

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@3dot141592 please re-review the changes with the new translation function that reduces the size of the PAE matrix so that this should work again

Copy link
Copy Markdown
Collaborator

@NeleRiediger NeleRiediger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generally seems fine. I made a few comments, but nothing too serious.
There are things I noticed while testing regarding the UI:

  • I find it a little confusing, that the fields to set the manual bounds aren't only visible if this option is chosen in the dropdown.
  • I was surprised, that I need to connect both the pae and the plddt regardless of the validation type I choose. Though if we decide that it is neater to require both these connections for any validation, those connections should already be done in the workflow in my opinion.

But again the thing itself seems to work fine.

CIF_DF = "cif_df"
AMINO_ACID_SEQUENCES_DF = "amino_acid_sequences_df"
PAE_DF = "pae_df" # pae = predicted aligned error
PAE_MATRIX = "pae_matrix" # pae = predicted aligned error
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me suspicious, since we just had a discussion where it was important that things stayed a dataframe. Not that I'm against changing this, it just feels like something that needs to be discussed at least briefly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this tomorrow surely, but using the PAE as it was previously is impossible (it's not really a DF, rather just a string packed into one row of a df) and casting it to a reasonable df would lead to worse performance in all aspects

Comment thread backend/protzilla/data_analysis/crosslinking_validation.py Outdated
Comment thread backend/protzilla/data_analysis/crosslinking_validation.py Outdated
@jorisfu
Copy link
Copy Markdown
Collaborator Author

jorisfu commented May 26, 2026

I find it a little confusing, that the fields to set the manual bounds aren't only visible if this option is chosen in the dropdown.

Agree, will adjust this to avoid further confusion

I was surprised, that I need to connect both the pae and the plddt regardless of the validation type I choose. Though if we decide that it is neater to require both these connections for any validation, those connections should already be done in the workflow in my opinion.

Agree as well, I might fix this but I believe I had some kind of reason to keep them required, will have to double-check this one

@jorisfu jorisfu requested a review from 3dot141592 May 27, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants