404 Extend CL validation and imports by jorisfu · Pull Request #423 · cschlaffner/PROTzilla

jorisfu · 2026-05-15T06:51:06Z

Description

fixes #404

Adds new options for CL validation against predicted models. The crosslinking validation steps now additionally support:

Max PAE

Uses the maxmimum PAE between the two CL binding sites $x$ and $y$ (note that AlphaFold PAE is asymmetrical) as a tolerance value: $t = \max ( \text{PAE}[x, y], \text{PAE}[y, x] )$. An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is considered valid iff
$$\max(l_{\text{CL}} - t, 0) \le d \le l_{\text{CL}} + t$$

Min PAE

Same as Max PAE but with $t = \min ( \text{PAE}[x, y], \text{PAE}[y, x] )$.

pLDDT adjusted

This uses the local error of the model prediction as a tolerance basis. Since pLDDT is between 0 and 100, we use the error factor $p_x = 1 - (\text{pLDDT}[x] / 100)$ for binding site $x$ and $p_y$ for binding site $y$ respectively.
Since we want a pLDDT of 100 to allow for no tolerance and a pLDDT of 0 for maximum tolerance and the $p_x$ and $p_y$ are different we calculate two different tolerance ranges for each half of the CL respectively.
Let $l_{\text{CL}}$ be the length of the crosslinker. We define the maximum tolerance for each half crosslinker as $t_{\text{max}} = l_{\text{CL}}$ (Intuitively, this means that half a CL can shrink/extend to the size of an entire CL if the prediction is at its lowest possible confidence.) From this, we define the tolerances $t_x = p_x \cdot t_{\text{max}}$ and $t_y = p_y \cdot t_{\text{max}}$. An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is then considered valid iff
$$\max(l_{\text{CL}} - t_x - t_y , 0) \le d \le l_{\text{CL}} + t_x + t_y$$

Changes

Changed input/output for PAE for AF/XL steps to numpy matrix
Added plddt_df output for multimer import (read from CIF, see comment in code)
Added pae_matrix output for multimer import (read from full_data)
Added validation types to forms, option types and the validation methods
Added tests for the new validation methods
Applied change from @tE3m on how we handle cif imports

Changes TODO

Monomer import: Change PAE to numpy matrix
Multimer import: Add PAE/pLDDT and shorten full_data
Fix current plot method failing
Adjust and extend Monomer tests
Adjust and extend Multimer tests

Testing

Sanity check the formulas
Create a standard CL workflow and validate some structures. Test out the different validation options and observe the plots and tables. Ideally check if the tables make sense and only reasonable CLs are labeled as valid.
Note that the plots are currently misleading for the other methods, that'll be another PR

On another note, we depend on the order of the pae_matrix equaling the order of CA atoms for each amino acid in the CIF. Adding atoms labelled CA to the CIF will break the PAE based validation right now, so it'd be great @tE3m if you could add another filter to the PAE matching so that we filter out CA's from PTMs here

PR checklist

Development

If necessary, I have updated the documentation (README, docstrings, etc.)
If necessary, I have created / updated tests.

Mergeability

crosslinking-branch has been merged into local branch to resolve conflicts
The tests and linter have passed AFTER local merge
The backend code has been formatted with black
The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

I have self-reviewed my code.
At least one other developer reviewed and approved the changes

3dot141592

Everything except this one thing looks good to me. The formulas look correct. Thank you for this mathematical PR :)

3dot141592 · 2026-05-26T06:47:40Z

+    index_lookup_df = (
+        cif_df[["_atom_site.label_asym_id", "_atom_site.label_seq_id"]]
+        .drop_duplicates()
+        .reset_index(drop=True)
+    )
+    index_lookup_df.reset_index(inplace=True)


I dont think filtering PTM CA rows from the CIF is enough here. In the P28482 PTM example, the CIF has 360 CA residue rows, but PAE in full_data is 396x396 because some PTM residues show up multiple times in token_res_ids. So the indices after a PTM get shifted...?

Please let me know if I misunderstood something :/

Seems that you are right, which is sad for me cause I'll have to adjust the imports again to map tokens to AAs. Alphafold3 handles PAE per token pair, not AA pair. Somehow I missed this during my work on this PR

https://www.ebi.ac.uk/training/online/courses/alphafold/alphafold-3-and-alphafold-server/how-to-assess-the-quality-of-alphafold-3-predictions/

@3dot141592 please re-review the changes with the new translation function that reduces the size of the PAE matrix so that this should work again

NeleRiediger

The code generally seems fine. I made a few comments, but nothing too serious.
There are things I noticed while testing regarding the UI:

I find it a little confusing, that the fields to set the manual bounds aren't only visible if this option is chosen in the dropdown.
I was surprised, that I need to connect both the pae and the plddt regardless of the validation type I choose. Though if we decide that it is neater to require both these connections for any validation, those connections should already be done in the workflow in my opinion.

But again the thing itself seems to work fine.

NeleRiediger · 2026-05-26T13:06:54Z

    CIF_DF = "cif_df"
    AMINO_ACID_SEQUENCES_DF = "amino_acid_sequences_df"
-    PAE_DF = "pae_df"  # pae = predicted aligned error
+    PAE_MATRIX = "pae_matrix"  # pae = predicted aligned error


This makes me suspicious, since we just had a discussion where it was important that things stayed a dataframe. Not that I'm against changing this, it just feels like something that needs to be discussed at least briefly.

We can discuss this tomorrow surely, but using the PAE as it was previously is impossible (it's not really a DF, rather just a string packed into one row of a df) and casting it to a reasonable df would lead to worse performance in all aspects

jorisfu · 2026-05-26T14:24:15Z

I find it a little confusing, that the fields to set the manual bounds aren't only visible if this option is chosen in the dropdown.

Agree, will adjust this to avoid further confusion

I was surprised, that I need to connect both the pae and the plddt regardless of the validation type I choose. Though if we decide that it is neater to require both these connections for any validation, those connections should already be done in the workflow in my opinion.

Agree as well, I might fix this but I believe I had some kind of reason to keep them required, will have to double-check this one

jorisfu added 5 commits May 11, 2026 11:27

feat: add pLDDT to CL results table

9f8b5b2

feat: add trivial PAE based validation

24eb514

feat: trivial plDDT based validation

1ddc103

fix: broken formula

77498d5

fix monomer validation test

0ab6fa0

jorisfu changed the title ~~404 cl validation with confidence~~ 404 Extend CL validation and imports May 15, 2026

jorisfu and others added 18 commits May 18, 2026 11:58

refactor: expose PAE as matrix for monomers

291bd7d

feat: PAE for multimers

23d5b60

feat: pLDDT for multimers

1e0e30e

feat: add PAE/plDDT consistently to multimer imports

eabd547

feat: proper PAE validation for multimers

bcb054f

fix: adjust existing cl validation tests

58ac815

fix: some alphafold import tests

634e712

tempfix: bridge monomer plots so method doesn't fail

2c66b35

tempfix: bridge multimer plots so method doesn't fail

7ecac2b

merge crosslinking

213e340

chore: remove obsolete todos

f2518c1

chore: adjust some tests

e13e2a3

chore: adjust some tests

0c645e5

feat: introduce parsing of _chem_comp table in cif-files

c1d55a1

chore: fix existing tests

a935874

chore: test for no pLDDT data within cif

d506787

chore: tests for PAE based CL validation

a034f65

chore: tests for pLDDT based CL validation

6ec24c0

jorisfu marked this pull request as ready for review May 21, 2026 15:23

jorisfu requested review from 3dot141592 and NeleRiediger May 21, 2026 15:24

jorisfu self-assigned this May 21, 2026

jorisfu added the new feature label May 21, 2026

jorisfu mentioned this pull request May 22, 2026

429 Plots for PAE/pLDDT based CL validation #435

Draft

8 tasks

3dot141592 reviewed May 26, 2026

View reviewed changes

NeleRiediger reviewed May 26, 2026

View reviewed changes

feat: AF3 to AF2 PAE matrix translation

ab538c7

jorisfu added 4 commits May 26, 2026 17:10

(AI) tests: PAE matrix reduction

e42066f

chore: remove unused imports

036de6b

feat: only make bounds fields visible if manual bounds is selected mode

e53e0cb

chore: black

f57bfc8

jorisfu requested a review from 3dot141592 May 27, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

404 Extend CL validation and imports#423

404 Extend CL validation and imports#423
jorisfu wants to merge 28 commits into
crosslinkingfrom
404-cl-validation-with-confidence

jorisfu commented May 15, 2026 •

edited by 3dot141592

Loading

Uh oh!

3dot141592 left a comment

Uh oh!

3dot141592 May 26, 2026

Uh oh!

jorisfu May 26, 2026

Uh oh!

jorisfu May 27, 2026

Uh oh!

NeleRiediger left a comment •

edited

Loading

Uh oh!

NeleRiediger May 26, 2026

Uh oh!

jorisfu May 26, 2026

Uh oh!

Uh oh!

Uh oh!

jorisfu commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jorisfu commented May 15, 2026 • edited by 3dot141592 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Max PAE

Min PAE

pLDDT adjusted

Changes

Changes TODO

Testing

PR checklist

Uh oh!

3dot141592 left a comment

Choose a reason for hiding this comment

Uh oh!

3dot141592 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

jorisfu May 26, 2026

Choose a reason for hiding this comment

Uh oh!

jorisfu May 27, 2026

Choose a reason for hiding this comment

Uh oh!

NeleRiediger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NeleRiediger May 26, 2026

Choose a reason for hiding this comment

Uh oh!

jorisfu May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisfu commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jorisfu commented May 15, 2026 •

edited by 3dot141592

Loading

NeleRiediger left a comment •

edited

Loading