ITRAP implementation#44
Conversation
| __name = "ITRAP" | ||
| __version = "0.0.1" | ||
|
|
||
| def __init__(self, umi_cols=None, umi_count_TRA=None, umi_count_TRB=None, filters=None): |
There was a problem hiding this comment.
would suggest moving umi_params to preprocess_data as it is data set specific and won't be necessary until preprocessing. Filter you can leave as it affects algo logic.
| def __init__(self, umi_cols=None, umi_count_TRA=None, umi_count_TRB=None, filters=None): | ||
| """ | ||
| Args: | ||
| umi_cols: List of columns containing UMI counts for pMHCs (default set to ['neg_control', 'pmhc1']) |
There was a problem hiding this comment.
this variable is already covered in preprocess_data under pmhc_key. I would suggest to just overload the param to accept also an interable of pmhc_keys.
| """ | ||
| Args: | ||
| umi_cols: List of columns containing UMI counts for pMHCs (default set to ['neg_control', 'pmhc1']) | ||
| umi_count_TRA: List of columns containing UMI counts for TRA (default: None) |
There was a problem hiding this comment.
could you stick to the convention to call fields in Mudata.X, var, obsm = xx_key
| for col in self.umi_cols_mhc: | ||
| data[col] = mdata['gex'][:, col].X.toarray().reshape(-1) | ||
|
|
||
| def calc_delta(x): |
There was a problem hiding this comment.
move the internal helper function to function first line of method declaration
| self.idx_to_specificity = {i: s for i, s in enumerate(self.umi_cols_mhc)} | ||
|
|
||
| data = mdata['airr'].obs.copy() | ||
| for col in self.umi_cols_mhc: |
There was a problem hiding this comment.
why not utilze x = gex[:, pmhc_key].X.toarray().reshape((N,))
and store that X in data.X or obsm? so you don't need to loop through the list of pMHCs and properly use the AnnData structure.
| self.specificity_to_idx = {s: i for i, s in enumerate(self.umi_cols_mhc)} | ||
| self.idx_to_specificity = {i: s for i, s in enumerate(self.umi_cols_mhc)} | ||
|
|
||
| data = mdata['airr'].obs.copy() |
There was a problem hiding this comment.
This can be expensive depending on what as been done to the MuData object - e.g. could store UMAP coords, TCR Similarities, PCA embeddings and more.
Why not just create an internal empty Anndata to store your algorithm-specific infos and extract only the relevant info from the appropriate fields of the input MuData object.
| self.data['assignment_before_filtering'] = self.data['assignment'].copy() | ||
| self.data.loc[~filters, 'assignment'] = 0 | ||
|
|
||
| return self.data['assignment'].values.astype(int), self.data['assignment'].values.astype(float) |
There was a problem hiding this comment.
why do you return twice the assignment?
Def not a clean solution. I understand why you did what you did. But I'd say if the current interface does not fit, we need to abstract that interface further and perhaps create a more flexible interface or a super-interface for threshold-based models and a child interface for probabilistic models that inherits the super interface
And shouldn't it be reverse ordered (first float, then int) according to your definition in the docs?
| if 'matching_HLA' in self.filters: | ||
| raise NotImplementedError("Matching HLA filter is not implemented yet.") | ||
|
|
||
| # Filter 4: Complete TCRs |
There was a problem hiding this comment.
TCR-related QC (filter 4 and 6) is available through scirpy.tl.chain_qc()
| filters &= data[k] >= thr | ||
|
|
||
| # TODO Other filters are not implemented yet, only makes sense once we have the respective data | ||
| # Filter 2: Hashing singlets |
There was a problem hiding this comment.
would remove demultiplexing is a preprocessing step that can require additional tools so out-of-scope here?
| raise NotImplementedError("Complete TCRs filter is not implemented yet.") | ||
|
|
||
| # Filter 5: Specificity multiplets | ||
| if 'specificity_multiplets' in self.filters: |
There was a problem hiding this comment.
Should be implementable. of course, makes only sense if multiple dextramer were tested. But a vectorized implementation should take care of that edge case as well.
|
Ready for review.
|
Resolves #3
Re-implementation of ITRAP, adapted to our data format and scenario. Did not (yet) implement filters based on other information beyond UMI count, as data availability and nomenclature can vary heavily between datasets.
Short summary:
To adapt to our case, I used the UMI count between epitope and negative control, and assigned filtered out cells as negative.
Alternatively, we can also assign specificity on a clonotype level, using the Wilcoxon test, though this is not the original ITRAP framework.