Feat/imputation improvements by jorgeMFS · Pull Request #78 · jorgeMFS/PhenoQC

jorgeMFS · 2025-08-14T13:24:29Z

Summary by Sourcery

Revamp imputation configuration in the GUI to a flexible, parameter-driven interface, enhance quality metrics normalization and reporting, add CLI usability improvements, and introduce a full end-to-end clinical features script with accompanying example configs.

New Features:

Introduce a strategy-agnostic, config-driven imputation panel in the GUI with dynamic parameter widgets, per-column overrides via a data editor, and integrated tuning controls
Add a comprehensive end-to-end clinical_all_features script to generate test data, schemas, configs, and validate CLI runs

Enhancements:

Improve quality metrics widget to normalize both list and dict configurations and filter out unsupported metrics
Extend imputation bias reporting in PDF outputs to include PSI and Cramér’s V thresholds and triggers
Support a --metrics alias for the --quality-metrics CLI option
Lazy-load the GUI entrypoint to prevent heavy dependencies from loading on import

Documentation:

Update README with the --metrics alias, new imputation bias thresholds, and expanded example configuration sections

Chores:

Include example configuration and schema files in scripts/config for the clinical_all_features end-to-end script

…trics alias; report: respect dict-style enable flags for additional quality tables

…; Reporting: add categorical PSI/Cramér’s V to bias rule header and per-row triggers

…ation, full schema/config/custom mappings, online+offline runs, per-column imputation coverage, fallback output discovery)

sourcery-ai · 2025-08-14T13:24:35Z

Reviewer's Guide

This PR overhauls the imputation configuration UI to a strategy-agnostic, schema-driven panel with unified parameter rendering and per-column overrides, extends quality metrics normalization in both GUI and reporting, adds support for categorical bias thresholds, updates CLI aliases and documentation, and introduces a comprehensive clinical end-to-end testing script.

Sequence diagram for the new imputation configuration flow in the GUI

sequenceDiagram
    actor User
    participant StreamlitUI
    participant ConfigDict
    User->>StreamlitUI: Open imputation configuration panel
    StreamlitUI->>User: Show global strategy selectbox
    User->>StreamlitUI: Select strategy and set parameters
    StreamlitUI->>User: Show per-column overrides table
    User->>StreamlitUI: Edit per-column strategies and params
    StreamlitUI->>User: Open tuning expander and set tuning params
    User->>StreamlitUI: Save configuration
    StreamlitUI->>ConfigDict: Update imputation config with strategy, params, per_column, tuning
    StreamlitUI->>User: Show success message

Entity relationship diagram for imputation bias thresholds

erDiagram
    IMPUTATION_BIAS {
      smd_threshold float
      var_ratio_low float
      var_ratio_high float
      ks_alpha float
      psi_threshold float
      cramer_threshold float
    }
    QUALITY_METRICS {
      imputation_bias dict
      imputation_stability dict
    }
    QUALITY_METRICS ||--|{ IMPUTATION_BIAS : contains

Class diagram for the updated imputation configuration data structure

classDiagram
    class ImputationConfig {
      +strategy: str
      +params: dict
      +per_column: dict
      +tuning: dict
    }
    class PerColumnOverride {
      +strategy: str
      +params: dict
    }
    class TuningConfig {
      +enable: bool
      +mask_fraction: float
      +scoring: str
      +max_cells: int
      +random_state: int
      +grid: dict
    }
    ImputationConfig --> PerColumnOverride : per_column
    ImputationConfig --> TuningConfig : tuning

Class diagram for quality metrics normalization and selection

classDiagram
    class QualityMetricsConfig {
      +imputation_bias: dict
      +imputation_stability: dict
      +redundancy: dict
      +accuracy: dict
      +traceability: dict
      +timeliness: dict
    }
    class QualityMetricsSelection {
      +options: list
      +selected: list
    }
    QualityMetricsConfig <|-- QualityMetricsSelection

File-Level Changes

Change	Details	Files
Revamped imputation configuration panel to a generic schema-driven implementation	Introduced PARAM_SPECS mapping for strategy parameters Implemented _render_params helper for dynamic widget rendering Replaced manual per-column expanders with st.data_editor for overrides Consolidated global strategy, params, per-column overrides, and tuning into unified config persistence	`src/phenoqc/gui/gui.py`
Enhanced quality metrics widget to normalize dict/list inputs	Support list- and dict-style metric configs Filter out unknown keys and handle nested enable flags	`src/phenoqc/gui/views.py`
Expanded reporting bias diagnostics with categorical thresholds	Appended PSI and CramérV thresholds to warning rules text Added trigger logic for psi and cramers_v values in bias tables	`src/phenoqc/reporting.py`
Added CLI alias and updated usage documentation	Introduced --metrics as alias for --quality-metrics Updated README and usage.rst to reflect new alias and threshold options	`src/phenoqc/cli.py` `README.md` `docs/source/usage.rst`
Enabled lazy GUI import to avoid heavy dependencies at package import	Wrapped gui.main in lazy import stub in init.py	`src/phenoqc/gui/__init__.py`
Added comprehensive clinical end-to-end testing script	Script generates dataset, schema, config, and custom mapping Runs CLI in online/offline modes and verifies outputs Includes sample config and mapping files under scripts/config	`scripts/clinical_all_features_e2e.py` `scripts/config/clinical_all_features_config.yaml` `scripts/config/clinical_all_features_schema.json` `scripts/config/clinical_all_features_custom_mapping.json`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codecov-commenter · 2025-08-14T13:26:05Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 3.09278% with 94 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/phenoqc/gui/gui.py	0.00%	64 Missing ⚠️
src/phenoqc/reporting.py	0.00%	16 Missing ⚠️
src/phenoqc/gui/views.py	14.28%	10 Missing and 2 partials ⚠️
src/phenoqc/gui/__init__.py	33.33%	2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Files with missing lines	Coverage Δ
src/phenoqc/cli.py	`45.25% <ø> (ø)`
src/phenoqc/gui/__init__.py	`50.00% <33.33%> (-50.00%)`	⬇️
src/phenoqc/gui/views.py	`46.15% <14.28%> (-38.47%)`	⬇️
src/phenoqc/reporting.py	`38.72% <0.00%> (-1.02%)`	⬇️
src/phenoqc/gui/gui.py	`0.00% <0.00%> (-3.76%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
- Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
- Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.

## Individual Comments

### Comment 1
<location> `src/phenoqc/gui/gui.py:733` </location>
<code_context>
+        def _render_params(spec: dict, initial: Optional[dict] = None) -> dict:
</code_context>

<issue_to_address>
Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.

Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.

Suggested implementation:

```python
        def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict:
            initial = initial or {}
            context = context or "default"
            values: dict = {}
            for name, meta in spec.items():
                w = meta["widget"]
                key = f"{w}_{context}_{name}"
                if w == "int":
                    values[name] = st.number_input(
                        name,
                        value=int(initial.get(name, meta.get("default", 0))),
                        min_value=int(meta.get("min", -10000)),
                        max_value=int(meta.get("max", 10000)),
                        key=key,

```

1. For other widget types (e.g., "float", "selectbox", etc.) inside `_render_params`, update their calls to also use the unique `key=key` argument.
2. Wherever `_render_params` is called elsewhere in your codebase, you must now pass a suitable `context` string (e.g., strategy name, column name, etc.) to ensure uniqueness.
</issue_to_address>

### Comment 2
<location> `src/phenoqc/gui/gui.py:709` </location>
<code_context>
+            if not col:
+                continue
+            col_strategy = row.get("strategy") or strategy
             try:
-                grid_vals = [int(x.strip()) for x in grid_n.split(',') if x.strip()]
+                col_params = json.loads(row.get("params") or "{}")
+                if not isinstance(col_params, dict):
+                    raise ValueError("params must be a JSON object")
             except Exception:
-                grid_vals = [3, 5, 7]
-            tuning_cfg = {
-                'enable': True,
-                'mask_fraction': float(mask_fraction),
-                'scoring': scoring,
-                'max_cells': int(max_cells),
-                'random_state': int(random_state),
-                'grid': {'n_neighbors': grid_vals}
+                col_params = {}
+            per_column[col] = {"strategy": col_strategy, "params": col_params}
+
</code_context>

<issue_to_address>
Silent fallback to empty dict on JSON parse error may hide user mistakes.

Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.

Suggested implementation:

```python
+                col_params = json.loads(row.get("params") or "{}")
+                if not isinstance(col_params, dict):
+                    raise ValueError("params must be a JSON object")
             except Exception as e:
+                import logging
+                logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.")
+                col_params = {}
+            per_column[col] = {"strategy": col_strategy, "params": col_params}

```

If the file does not already import the `logging` module at the top, you should add:
```python
import logging
```
at the top of the file.
</issue_to_address>

## Security Issues

### Issue 1
<location> `scripts/clinical_all_features_e2e.py:326` </location>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-14T13:26:09Z

+        def _render_params(spec: dict, initial: Optional[dict] = None) -> dict:
+            initial = initial or {}
+            values: dict = {}
+            for name, meta in spec.items():
+                w = meta["widget"]
+                if w == "int":
+                    values[name] = st.number_input(
+                        name,
+                        value=int(initial.get(name, meta.get("default", 0))),
+                        min_value=int(meta.get("min", -10000)),


suggestion (bug_risk): Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.

Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.

Suggested implementation:

def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict: initial = initial or {} context = context or "default" values: dict = {} for name, meta in spec.items(): w = meta["widget"] key = f"{w}_{context}_{name}" if w == "int": values[name] = st.number_input( name, value=int(initial.get(name, meta.get("default", 0))), min_value=int(meta.get("min", -10000)), max_value=int(meta.get("max", 10000)), key=key,

For other widget types (e.g., "float", "selectbox", etc.) inside _render_params, update their calls to also use the unique key=key argument.

Wherever _render_params is called elsewhere in your codebase, you must now pass a suitable context string (e.g., strategy name, column name, etc.) to ensure uniqueness.

sourcery-ai · 2025-08-14T13:26:09Z

+
+        # Parameter specs per strategy
+        PARAM_SPECS = {
+            "none": {},
+            "mean": {},
+            "median": {},
+            "mode": {},
+            "knn": {
+                "n_neighbors": {"widget": "int", "default": 5, "min": 1, "max": 100},
+                "weights": {"widget": "select", "options": ["uniform", "distance"], "default": "uniform"},


suggestion (bug_risk): Silent fallback to empty dict on JSON parse error may hide user mistakes.

Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.

Suggested implementation:

+ col_params = json.loads(row.get("params") or "{}") + if not isinstance(col_params, dict): + raise ValueError("params must be a JSON object") except Exception as e: + import logging + logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.") + col_params = {} + per_column[col] = {"strategy": col_strategy, "params": col_params}

If the file does not already import the logging module at the top, you should add:

import logging

at the top of the file.

sourcery-ai · 2025-08-14T13:26:09Z

+    env = os.environ.copy()
+    env["PYTHONPATH"] = SRC_PATH + (os.pathsep + env.get("PYTHONPATH", ""))
+    print("[INFO] Running:", " ".join(cmd))
+    proc = subprocess.run(cmd, capture_output=True, text=True, env=env)


security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

Source: opengrep

jorgeMFS added 3 commits August 13, 2025 15:57

docs: add metrics toggles and examples in README/usage; cli: add --me…

3f6c11a

…trics alias; report: respect dict-style enable flags for additional quality tables

GUI: strategy-agnostic imputation panel; fix quality metrics defaults…

7a6dc96

…; Reporting: add categorical PSI/Cramér’s V to bias rule header and per-row triggers

Scripts: add clinical_all_features_e2e end-to-end exam (dataset gener…

56de9b5

…ation, full schema/config/custom mappings, online+offline runs, per-column imputation coverage, fallback output discovery)

sourcery-ai Bot requested changes Aug 14, 2025

View reviewed changes

jorgeMFS merged commit a901959 into main Aug 14, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/imputation improvements#78

Feat/imputation improvements#78
jorgeMFS merged 3 commits into
mainfrom
feat/imputation-improvements

jorgeMFS commented Aug 14, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Aug 14, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov-commenter commented Aug 14, 2025 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Aug 14, 2025

Uh oh!

sourcery-ai Bot Aug 14, 2025

Uh oh!

sourcery-ai Bot Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorgeMFS commented Aug 14, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for the new imputation configuration flow in the GUI

Entity relationship diagram for imputation bias thresholds

Class diagram for the updated imputation configuration data structure

Class diagram for quality metrics normalization and selection

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov-commenter commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jorgeMFS commented Aug 14, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Aug 14, 2025 •

edited

Loading

codecov-commenter commented Aug 14, 2025 •

edited

Loading