404 Extend CL validation and imports#423
Conversation
3dot141592
left a comment
There was a problem hiding this comment.
Everything except this one thing looks good to me. The formulas look correct. Thank you for this mathematical PR :)
| index_lookup_df = ( | ||
| cif_df[["_atom_site.label_asym_id", "_atom_site.label_seq_id"]] | ||
| .drop_duplicates() | ||
| .reset_index(drop=True) | ||
| ) | ||
| index_lookup_df.reset_index(inplace=True) |
There was a problem hiding this comment.
I dont think filtering PTM CA rows from the CIF is enough here. In the P28482 PTM example, the CIF has 360 CA residue rows, but PAE in full_data is 396x396 because some PTM residues show up multiple times in token_res_ids. So the indices after a PTM get shifted...?
Please let me know if I misunderstood something :/
There was a problem hiding this comment.
Seems that you are right, which is sad for me cause I'll have to adjust the imports again to map tokens to AAs. Alphafold3 handles PAE per token pair, not AA pair. Somehow I missed this during my work on this PR
There was a problem hiding this comment.
@3dot141592 please re-review the changes with the new translation function that reduces the size of the PAE matrix so that this should work again
There was a problem hiding this comment.
The code generally seems fine. I made a few comments, but nothing too serious.
There are things I noticed while testing regarding the UI:
- I find it a little confusing, that the fields to set the manual bounds aren't only visible if this option is chosen in the dropdown.
- I was surprised, that I need to connect both the pae and the plddt regardless of the validation type I choose. Though if we decide that it is neater to require both these connections for any validation, those connections should already be done in the workflow in my opinion.
But again the thing itself seems to work fine.
| CIF_DF = "cif_df" | ||
| AMINO_ACID_SEQUENCES_DF = "amino_acid_sequences_df" | ||
| PAE_DF = "pae_df" # pae = predicted aligned error | ||
| PAE_MATRIX = "pae_matrix" # pae = predicted aligned error |
There was a problem hiding this comment.
This makes me suspicious, since we just had a discussion where it was important that things stayed a dataframe. Not that I'm against changing this, it just feels like something that needs to be discussed at least briefly.
There was a problem hiding this comment.
We can discuss this tomorrow surely, but using the PAE as it was previously is impossible (it's not really a DF, rather just a string packed into one row of a df) and casting it to a reasonable df would lead to worse performance in all aspects
Agree, will adjust this to avoid further confusion
Agree as well, I might fix this but I believe I had some kind of reason to keep them required, will have to double-check this one |
Description
fixes #404
Adds new options for CL validation against predicted models. The crosslinking validation steps now additionally support:
Max PAE
Uses the maxmimum PAE between the two CL binding sites$x$ and $y$ (note that AlphaFold PAE is asymmetrical) as a tolerance value: $t = \max ( \text{PAE}[x, y], \text{PAE}[y, x] )$ . An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is considered valid iff
$$\max(l_{\text{CL}} - t, 0) \le d \le l_{\text{CL}} + t$$
Min PAE
Same as Max PAE but with$t = \min ( \text{PAE}[x, y], \text{PAE}[y, x] )$ .
pLDDT adjusted
This uses the local error of the model prediction as a tolerance basis. Since pLDDT is between 0 and 100, we use the error factor$p_x = 1 - (\text{pLDDT}[x] / 100)$ for binding site $x$ and $p_y$ for binding site $y$ respectively.$p_x$ and $p_y$ are different we calculate two different tolerance ranges for each half of the CL respectively. $l_{\text{CL}}$ be the length of the crosslinker. We define the maximum tolerance for each half crosslinker as $t_{\text{max}} = l_{\text{CL}}$ (Intuitively, this means that half a CL can shrink/extend to the size of an entire CL if the prediction is at its lowest possible confidence.) From this, we define the tolerances $t_x = p_x \cdot t_{\text{max}}$ and $t_y = p_y \cdot t_{\text{max}}$ . An indentified crosslinker with length $l_{\text{CL}}$ and predicted distance $d$ between the binding sites is then considered valid iff
$$\max(l_{\text{CL}} - t_x - t_y , 0) \le d \le l_{\text{CL}} + t_x + t_y$$
Since we want a pLDDT of 100 to allow for no tolerance and a pLDDT of 0 for maximum tolerance and the
Let
Changes
plddt_dfoutput for multimer import (read from CIF, see comment in code)pae_matrixoutput for multimer import (read from full_data)Changes TODO
Testing
On another note, we depend on the order of the pae_matrix equaling the order of CA atoms for each amino acid in the CIF. Adding atoms labelled CA to the CIF will break the PAE based validation right now, so it'd be great @tE3m if you could add another filter to the PAE matching so that we filter out CA's from PTMs here
PR checklist
Development
Mergeability
blackpnpm formatand checked withpnpm lintCode review