206 dimensionality reduction on proteins#390
Conversation
hendraet
left a comment
There was a problem hiding this comment.
Works as intended, but needs some minor refinements.
However, in its current state, the PR might not be as useful as imagined because I underspecified the issue. What would be needed to increase usefulness:
- Hover annotations for each data point in the scatter plot. Currently, we lose all information about Samples/Protein IDs in the scatter plot, which would be especially helpful in the protein case. (Could be an easy fix, currently we just deliberately exclude this information from the plot, but it is passed to the function)
- Having no metadata available to color proteins in a scatter plot is also less than ideal. However, we currently cannot import any metadata that does not include a
Samplecolumn, so one could probably only do it in a hacky way.
| ), | ||
| DropdownField( | ||
| name="sample_name", | ||
| label="Choose the column that contains the sample information", |
There was a problem hiding this comment.
Calling it sample and the variable sample_name might confuse the user because it is too close to the actual "Sample" column and it might not be clear that "Protein ID" could also be a valid "sample column" in this case.
|
|
||
| :return: returns a dictionary containing a list with a plotly figure and/or a list of messages | ||
| """ | ||
| if isinstance(metadata_df, pd.DataFrame): |
There was a problem hiding this comment.
If you want to plot proteins instead of samples, currently, you cannot connect a metadata dataframe - otherwise, this if and its error are triggered. I feel like we should make it more transparent to the users that they shouldn't connect metadata in this case
| return pd.pivot( | ||
| intensity_df, index="Sample", columns="Protein ID", values=values_name | ||
| ) | ||
| return pd.pivot(intensity_df, index=index, columns=columns, values=values_name) |
There was a problem hiding this comment.
Maybe we should add some guardrails to make sure that index and columns are not the same, because this will lead to a cryptic error message. (This should ideally never happen, but if I had a penny for every time I didn't check assumptions because they could never possibly happen...)
| input_df: pd.DataFrame, | ||
| metadata_df: pd.DataFrame | None = None, | ||
| metadata_column: str | None = None, | ||
| sample_name: str = "Sample", |
There was a problem hiding this comment.
See comment above, I feel like sample_name is misleading.
However, it seems like it is only for metadata processing. Since metadata is enforced to include a "Sample" column (different problem), and using "Protein ID" together with metadata will lead to errors anyway, one could probably also revert the whole sample_name completely.
| "df_name,n_components,method", | ||
| [ | ||
| ("dimension_reduction_df", 2, TSNEMethod.exact.value), | ||
| ("dimension_reduction_four_proteins_df", 2, TSNEMethod.exact.value), |
There was a problem hiding this comment.
Would like to have all of these tests also test 3 components
| "The column selected for annotation is not present in the corresponding metadata dataframe.", | ||
| ) | ||
|
|
||
| if sample_name not in input_df.columns: |
There was a problem hiding this comment.
Would like to see a test that checks the raising of this ValueError
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("n_components", [2]) |
There was a problem hiding this comment.
Either use more than one value for n_compontents or don't parametrize at all
Description
fixes #206
Added dimensionality reduction on protein level.
Changes
Before: dimensionality reduction only done on sample level (per default, not customisable)
Now: added drop down to choose which level the dimensionality reduction should be applied on.
Options:
Adapted scatter plot requirements to allow plotting of the results based on Protein ID.
Testing
Former tests are updated to pass with the new structure. New tests added for all functionalities.
PR checklist
Development
Mergeability
blackpnpm formatand checked withpnpm lintCode review