Skip to content

335 allow omitting of nan values in t tests#393

Open
silenus092 wants to merge 7 commits into
339-multithreading-for-t-testfrom
335-allow-omitting-of-nan-values-in-t-tests
Open

335 allow omitting of nan values in t tests#393
silenus092 wants to merge 7 commits into
339-multithreading-for-t-testfrom
335-allow-omitting-of-nan-values-in-t-tests

Conversation

@silenus092
Copy link
Copy Markdown
Collaborator

Description

fixes #335

This pull request introduces configurable handling of NaN (missing) values in the differential expression t-test analysis, allowing users to specify how NaNs are treated during statistical testing.

Changes

  • Before: NaN values are always omitted during t-tests.

  • Now: During t-tests NaN values are handled one of three ways: "Propagate", "Omit", and "Raise" see here for explanations.

  • Updated the t_test function to accept a nan_policy parameter and pass it to scipy.stats.ttest_ind, enabling the user to choose between propagating, omitting, or raising errors on NaNs. [1] [2] [3]

User interface and API changes:

  • Added a DropdownField for nan_policy selection in the data analysis form, defaulting to "Raise". [1] [2]

Testing and documentation:

  • Added and updated tests in test_differential_expression.py to cover all nan_policy behaviors (propagate, omit, raise), including edge cases where NaNs result in exclusion, valid results, or errors. Tests also clarify the distinction between p-value computation and fold change calculation when NaNs are present. [1] [2] [3] [4]

Testing

python -m pytest tests/protzilla/data_analysis/test_differential_expression.py -v

PR checklist

Development

  • If necessary, I have updated the documentation (README, docstrings, etc.)
  • If necessary, I have created / updated tests.

Mergeability

  • main-branch has been merged into local branch to resolve conflicts
  • The tests and linter have passed AFTER local merge
  • The backend code has been formatted with black
  • The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

  • I have self-reviewed my code.
  • At least one other developer reviewed and approved the changes

	modified:   backend/tests/protzilla/data_integration/test_enrichment_analysis.py
@witsyke
Copy link
Copy Markdown
Collaborator

witsyke commented Apr 29, 2026

Just as information #391 completely changes the way the t-test is calculated. Therefore the stats.ttest_ind() function is no longer used and has been replaced with a manual vectorized implementation. Therefore, these two PRs are incompatible.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  backend/protzilla/data_analysis
  differential_expression_t_test.py 270
  backend/protzilla/methods
  data_analysis.py
Project Total  

This report was generated by python-coverage-comment-action

@jorisfu jorisfu added the hackathon Viable issue for the April 2026 PROTzilla hackathon label Apr 29, 2026
silenus092 and others added 2 commits April 29, 2026 16:12
- Replace old scipy loop-based implementation with PR 391's vectorized version
- Add nan_policy parameter with conditional dropna/pivot logic:
    * 'raise'     → ValueError if any NaN found in intensity column
    * 'omit'      → dropna before pivot (NaN rows excluded from all stats + fold change)
    * 'propagate' → keep NaN; propagate into statistics so proteins are excluded
- Fix fold change NaN regression: with 'omit', median is now computed from clean
  data (post-dropna), so fold change is a valid number (not NaN as in old impl)
- Update tests accordingly: remove 'fold change is NaN under omit' assertions
…nto 335-allow-omitting-of-nan-values-in-t-tests

Co-authored-by: Copilot <copilot@github.com>
@silenus092 silenus092 changed the base branch from dev to 339-multithreading-for-t-test April 29, 2026 14:31
@silenus092
Copy link
Copy Markdown
Collaborator Author

silenus092 commented Apr 29, 2026

Ok,

branch 335-allow-omitting-of-nan-values-in-t-tests (allow omitting NaN values in t-tests) was originally based on dev,
but PR #391 introduced a new vectorized t-test implementation that replaced
the old per-protein scipy loop. This makes the original nan_policy approach incompatible with the new architecture.

This branch 335-allow-omitting-of-nan-values-in-t-tests must merge into 339-multithreading-for-t-test to adapt the nan_policy feature to work with the vectorized pipeline

@hendraet
Copy link
Copy Markdown
Collaborator

Ok,

branch 335-allow-omitting-of-nan-values-in-t-tests (allow omitting NaN values in t-tests) was originally based on dev, but PR #391 introduced a new vectorized t-test implementation that replaced the old per-protein scipy loop. This makes the original nan_policy approach incompatible with the new architecture.

This branch 335-allow-omitting-of-nan-values-in-t-tests must merge into 339-multithreading-for-t-test to adapt the nan_policy feature to work with the vectorized pipeline

I'm so sorry that I did not see this problem coming 🤦‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hackathon Viable issue for the April 2026 PROTzilla hackathon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants