Skip to content

206 dimensionality reduction on proteins#390

Open
ferbsx wants to merge 11 commits into
devfrom
206-dimensionality-reduction-on-proteins
Open

206 dimensionality reduction on proteins#390
ferbsx wants to merge 11 commits into
devfrom
206-dimensionality-reduction-on-proteins

Conversation

@ferbsx
Copy link
Copy Markdown
Collaborator

@ferbsx ferbsx commented Apr 29, 2026

Description

fixes #206
Added dimensionality reduction on protein level.

Changes

Before: dimensionality reduction only done on sample level (per default, not customisable)
Now: added drop down to choose which level the dimensionality reduction should be applied on.

Options:

  • Sample [default]
  • Protein ID

Adapted scatter plot requirements to allow plotting of the results based on Protein ID.

Testing

Former tests are updated to pass with the new structure. New tests added for all functionalities.

PR checklist

Development

  • If necessary, I have updated the documentation (README, docstrings, etc.)
  • If necessary, I have created / updated tests.

Mergeability

  • main-branch has been merged into local branch to resolve conflicts
  • The tests and linter have passed AFTER local merge [only GSEA tests are failing which has been discussed]
  • The backend code has been formatted with black
  • The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

  • I have self-reviewed my code.
  • At least one other developer reviewed and approved the changes

@jorisfu jorisfu added the hackathon Viable issue for the April 2026 PROTzilla hackathon label Apr 29, 2026
@Yanjo96 Yanjo96 marked this pull request as ready for review April 30, 2026 06:37
@ferbsx ferbsx requested a review from hendraet April 30, 2026 06:38
@ferbsx ferbsx linked an issue Apr 30, 2026 that may be closed by this pull request
3 tasks
@Elena-kal Elena-kal requested review from tE3m May 6, 2026 09:34
Copy link
Copy Markdown
Collaborator

@hendraet hendraet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as intended, but needs some minor refinements.

However, in its current state, the PR might not be as useful as imagined because I underspecified the issue. What would be needed to increase usefulness:

  • Hover annotations for each data point in the scatter plot. Currently, we lose all information about Samples/Protein IDs in the scatter plot, which would be especially helpful in the protein case. (Could be an easy fix, currently we just deliberately exclude this information from the plot, but it is passed to the function)
  • Having no metadata available to color proteins in a scatter plot is also less than ideal. However, we currently cannot import any metadata that does not include a Sample column, so one could probably only do it in a hacky way.

),
DropdownField(
name="sample_name",
label="Choose the column that contains the sample information",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it sample and the variable sample_name might confuse the user because it is too close to the actual "Sample" column and it might not be clear that "Protein ID" could also be a valid "sample column" in this case.


:return: returns a dictionary containing a list with a plotly figure and/or a list of messages
"""
if isinstance(metadata_df, pd.DataFrame):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to plot proteins instead of samples, currently, you cannot connect a metadata dataframe - otherwise, this if and its error are triggered. I feel like we should make it more transparent to the users that they shouldn't connect metadata in this case

return pd.pivot(
intensity_df, index="Sample", columns="Protein ID", values=values_name
)
return pd.pivot(intensity_df, index=index, columns=columns, values=values_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add some guardrails to make sure that index and columns are not the same, because this will lead to a cryptic error message. (This should ideally never happen, but if I had a penny for every time I didn't check assumptions because they could never possibly happen...)

input_df: pd.DataFrame,
metadata_df: pd.DataFrame | None = None,
metadata_column: str | None = None,
sample_name: str = "Sample",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above, I feel like sample_name is misleading.
However, it seems like it is only for metadata processing. Since metadata is enforced to include a "Sample" column (different problem), and using "Protein ID" together with metadata will lead to errors anyway, one could probably also revert the whole sample_name completely.

"df_name,n_components,method",
[
("dimension_reduction_df", 2, TSNEMethod.exact.value),
("dimension_reduction_four_proteins_df", 2, TSNEMethod.exact.value),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to have all of these tests also test 3 components

"The column selected for annotation is not present in the corresponding metadata dataframe.",
)

if sample_name not in input_df.columns:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see a test that checks the raising of this ValueError

)


@pytest.mark.parametrize("n_components", [2])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use more than one value for n_compontents or don't parametrize at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hackathon Viable issue for the April 2026 PROTzilla hackathon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dimensionality Reduction on Proteins

4 participants