Skip to content

Feat/imputation improvements#78

Merged
jorgeMFS merged 3 commits into
mainfrom
feat/imputation-improvements
Aug 14, 2025
Merged

Feat/imputation improvements#78
jorgeMFS merged 3 commits into
mainfrom
feat/imputation-improvements

Conversation

@jorgeMFS

@jorgeMFS jorgeMFS commented Aug 14, 2025

Copy link
Copy Markdown
Owner

Summary by Sourcery

Revamp imputation configuration in the GUI to a flexible, parameter-driven interface, enhance quality metrics normalization and reporting, add CLI usability improvements, and introduce a full end-to-end clinical features script with accompanying example configs.

New Features:

  • Introduce a strategy-agnostic, config-driven imputation panel in the GUI with dynamic parameter widgets, per-column overrides via a data editor, and integrated tuning controls
  • Add a comprehensive end-to-end clinical_all_features script to generate test data, schemas, configs, and validate CLI runs

Enhancements:

  • Improve quality metrics widget to normalize both list and dict configurations and filter out unsupported metrics
  • Extend imputation bias reporting in PDF outputs to include PSI and Cramér’s V thresholds and triggers
  • Support a --metrics alias for the --quality-metrics CLI option
  • Lazy-load the GUI entrypoint to prevent heavy dependencies from loading on import

Documentation:

  • Update README with the --metrics alias, new imputation bias thresholds, and expanded example configuration sections

Chores:

  • Include example configuration and schema files in scripts/config for the clinical_all_features end-to-end script

…trics alias; report: respect dict-style enable flags for additional quality tables
…; Reporting: add categorical PSI/Cramér’s V to bias rule header and per-row triggers
…ation, full schema/config/custom mappings, online+offline runs, per-column imputation coverage, fallback output discovery)
@sourcery-ai

sourcery-ai Bot commented Aug 14, 2025

Copy link
Copy Markdown
Contributor

Reviewer's Guide

This PR overhauls the imputation configuration UI to a strategy-agnostic, schema-driven panel with unified parameter rendering and per-column overrides, extends quality metrics normalization in both GUI and reporting, adds support for categorical bias thresholds, updates CLI aliases and documentation, and introduces a comprehensive clinical end-to-end testing script.

Sequence diagram for the new imputation configuration flow in the GUI

sequenceDiagram
    actor User
    participant StreamlitUI
    participant ConfigDict
    User->>StreamlitUI: Open imputation configuration panel
    StreamlitUI->>User: Show global strategy selectbox
    User->>StreamlitUI: Select strategy and set parameters
    StreamlitUI->>User: Show per-column overrides table
    User->>StreamlitUI: Edit per-column strategies and params
    StreamlitUI->>User: Open tuning expander and set tuning params
    User->>StreamlitUI: Save configuration
    StreamlitUI->>ConfigDict: Update imputation config with strategy, params, per_column, tuning
    StreamlitUI->>User: Show success message
Loading

Entity relationship diagram for imputation bias thresholds

erDiagram
    IMPUTATION_BIAS {
      smd_threshold float
      var_ratio_low float
      var_ratio_high float
      ks_alpha float
      psi_threshold float
      cramer_threshold float
    }
    QUALITY_METRICS {
      imputation_bias dict
      imputation_stability dict
    }
    QUALITY_METRICS ||--|{ IMPUTATION_BIAS : contains
Loading

Class diagram for the updated imputation configuration data structure

classDiagram
    class ImputationConfig {
      +strategy: str
      +params: dict
      +per_column: dict
      +tuning: dict
    }
    class PerColumnOverride {
      +strategy: str
      +params: dict
    }
    class TuningConfig {
      +enable: bool
      +mask_fraction: float
      +scoring: str
      +max_cells: int
      +random_state: int
      +grid: dict
    }
    ImputationConfig --> PerColumnOverride : per_column
    ImputationConfig --> TuningConfig : tuning
Loading

Class diagram for quality metrics normalization and selection

classDiagram
    class QualityMetricsConfig {
      +imputation_bias: dict
      +imputation_stability: dict
      +redundancy: dict
      +accuracy: dict
      +traceability: dict
      +timeliness: dict
    }
    class QualityMetricsSelection {
      +options: list
      +selected: list
    }
    QualityMetricsConfig <|-- QualityMetricsSelection
Loading

File-Level Changes

Change Details Files
Revamped imputation configuration panel to a generic schema-driven implementation
  • Introduced PARAM_SPECS mapping for strategy parameters
  • Implemented _render_params helper for dynamic widget rendering
  • Replaced manual per-column expanders with st.data_editor for overrides
  • Consolidated global strategy, params, per-column overrides, and tuning into unified config persistence
src/phenoqc/gui/gui.py
Enhanced quality metrics widget to normalize dict/list inputs
  • Support list- and dict-style metric configs
  • Filter out unknown keys and handle nested enable flags
src/phenoqc/gui/views.py
Expanded reporting bias diagnostics with categorical thresholds
  • Appended PSI and CramérV thresholds to warning rules text
  • Added trigger logic for psi and cramers_v values in bias tables
src/phenoqc/reporting.py
Added CLI alias and updated usage documentation
  • Introduced --metrics as alias for --quality-metrics
  • Updated README and usage.rst to reflect new alias and threshold options
src/phenoqc/cli.py
README.md
docs/source/usage.rst
Enabled lazy GUI import to avoid heavy dependencies at package import
  • Wrapped gui.main in lazy import stub in init.py
src/phenoqc/gui/__init__.py
Added comprehensive clinical end-to-end testing script
  • Script generates dataset, schema, config, and custom mapping
  • Runs CLI in online/offline modes and verifies outputs
  • Includes sample config and mapping files under scripts/config
scripts/clinical_all_features_e2e.py
scripts/config/clinical_all_features_config.yaml
scripts/config/clinical_all_features_schema.json
scripts/config/clinical_all_features_custom_mapping.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov-commenter

codecov-commenter commented Aug 14, 2025

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 3.09278% with 94 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/phenoqc/gui/gui.py 0.00% 64 Missing ⚠️
src/phenoqc/reporting.py 0.00% 16 Missing ⚠️
src/phenoqc/gui/views.py 14.28% 10 Missing and 2 partials ⚠️
src/phenoqc/gui/__init__.py 33.33% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Files with missing lines Coverage Δ
src/phenoqc/cli.py 45.25% <ø> (ø)
src/phenoqc/gui/__init__.py 50.00% <33.33%> (-50.00%) ⬇️
src/phenoqc/gui/views.py 46.15% <14.28%> (-38.47%) ⬇️
src/phenoqc/reporting.py 38.72% <0.00%> (-1.02%) ⬇️
src/phenoqc/gui/gui.py 0.00% <0.00%> (-3.76%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
  • Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
  • Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
- Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
- Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.

## Individual Comments

### Comment 1
<location> `src/phenoqc/gui/gui.py:733` </location>
<code_context>
+        def _render_params(spec: dict, initial: Optional[dict] = None) -> dict:
</code_context>

<issue_to_address>
Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.

Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.

Suggested implementation:

```python
        def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict:
            initial = initial or {}
            context = context or "default"
            values: dict = {}
            for name, meta in spec.items():
                w = meta["widget"]
                key = f"{w}_{context}_{name}"
                if w == "int":
                    values[name] = st.number_input(
                        name,
                        value=int(initial.get(name, meta.get("default", 0))),
                        min_value=int(meta.get("min", -10000)),
                        max_value=int(meta.get("max", 10000)),
                        key=key,

```

1. For other widget types (e.g., "float", "selectbox", etc.) inside `_render_params`, update their calls to also use the unique `key=key` argument.
2. Wherever `_render_params` is called elsewhere in your codebase, you must now pass a suitable `context` string (e.g., strategy name, column name, etc.) to ensure uniqueness.
</issue_to_address>

### Comment 2
<location> `src/phenoqc/gui/gui.py:709` </location>
<code_context>
+            if not col:
+                continue
+            col_strategy = row.get("strategy") or strategy
             try:
-                grid_vals = [int(x.strip()) for x in grid_n.split(',') if x.strip()]
+                col_params = json.loads(row.get("params") or "{}")
+                if not isinstance(col_params, dict):
+                    raise ValueError("params must be a JSON object")
             except Exception:
-                grid_vals = [3, 5, 7]
-            tuning_cfg = {
-                'enable': True,
-                'mask_fraction': float(mask_fraction),
-                'scoring': scoring,
-                'max_cells': int(max_cells),
-                'random_state': int(random_state),
-                'grid': {'n_neighbors': grid_vals}
+                col_params = {}
+            per_column[col] = {"strategy": col_strategy, "params": col_params}
+
</code_context>

<issue_to_address>
Silent fallback to empty dict on JSON parse error may hide user mistakes.

Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.

Suggested implementation:

```python
+                col_params = json.loads(row.get("params") or "{}")
+                if not isinstance(col_params, dict):
+                    raise ValueError("params must be a JSON object")
             except Exception as e:
+                import logging
+                logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.")
+                col_params = {}
+            per_column[col] = {"strategy": col_strategy, "params": col_params}

```

If the file does not already import the `logging` module at the top, you should add:
```python
import logging
```
at the top of the file.
</issue_to_address>

## Security Issues

### Issue 1
<location> `scripts/clinical_all_features_e2e.py:326` </location>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/phenoqc/gui/gui.py
Comment on lines +733 to +742
def _render_params(spec: dict, initial: Optional[dict] = None) -> dict:
initial = initial or {}
values: dict = {}
for name, meta in spec.items():
w = meta["widget"]
if w == "int":
values[name] = st.number_input(
name,
value=int(initial.get(name, meta.get("default", 0))),
min_value=int(meta.get("min", -10000)),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.

Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.

Suggested implementation:

        def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict:
            initial = initial or {}
            context = context or "default"
            values: dict = {}
            for name, meta in spec.items():
                w = meta["widget"]
                key = f"{w}_{context}_{name}"
                if w == "int":
                    values[name] = st.number_input(
                        name,
                        value=int(initial.get(name, meta.get("default", 0))),
                        min_value=int(meta.get("min", -10000)),
                        max_value=int(meta.get("max", 10000)),
                        key=key,
  1. For other widget types (e.g., "float", "selectbox", etc.) inside _render_params, update their calls to also use the unique key=key argument.
  2. Wherever _render_params is called elsewhere in your codebase, you must now pass a suitable context string (e.g., strategy name, column name, etc.) to ensure uniqueness.

Comment thread src/phenoqc/gui/gui.py
Comment on lines +709 to +718

# Parameter specs per strategy
PARAM_SPECS = {
"none": {},
"mean": {},
"median": {},
"mode": {},
"knn": {
"n_neighbors": {"widget": "int", "default": 5, "min": 1, "max": 100},
"weights": {"widget": "select", "options": ["uniform", "distance"], "default": "uniform"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Silent fallback to empty dict on JSON parse error may hide user mistakes.

Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.

Suggested implementation:

+                col_params = json.loads(row.get("params") or "{}")
+                if not isinstance(col_params, dict):
+                    raise ValueError("params must be a JSON object")
             except Exception as e:
+                import logging
+                logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.")
+                col_params = {}
+            per_column[col] = {"strategy": col_strategy, "params": col_params}

If the file does not already import the logging module at the top, you should add:

import logging

at the top of the file.

env = os.environ.copy()
env["PYTHONPATH"] = SRC_PATH + (os.pathsep + env.get("PYTHONPATH", ""))
print("[INFO] Running:", " ".join(cmd))
proc = subprocess.run(cmd, capture_output=True, text=True, env=env)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

Source: opengrep

@jorgeMFS jorgeMFS merged commit a901959 into main Aug 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants