Feat/imputation improvements#78
Conversation
…trics alias; report: respect dict-style enable flags for additional quality tables
…; Reporting: add categorical PSI/Cramér’s V to bias rule header and per-row triggers
…ation, full schema/config/custom mappings, online+offline runs, per-column imputation coverage, fallback output discovery)
Reviewer's GuideThis PR overhauls the imputation configuration UI to a strategy-agnostic, schema-driven panel with unified parameter rendering and per-column overrides, extends quality metrics normalization in both GUI and reporting, adds support for categorical bias thresholds, updates CLI aliases and documentation, and introduces a comprehensive clinical end-to-end testing script. Sequence diagram for the new imputation configuration flow in the GUIsequenceDiagram
actor User
participant StreamlitUI
participant ConfigDict
User->>StreamlitUI: Open imputation configuration panel
StreamlitUI->>User: Show global strategy selectbox
User->>StreamlitUI: Select strategy and set parameters
StreamlitUI->>User: Show per-column overrides table
User->>StreamlitUI: Edit per-column strategies and params
StreamlitUI->>User: Open tuning expander and set tuning params
User->>StreamlitUI: Save configuration
StreamlitUI->>ConfigDict: Update imputation config with strategy, params, per_column, tuning
StreamlitUI->>User: Show success message
Entity relationship diagram for imputation bias thresholdserDiagram
IMPUTATION_BIAS {
smd_threshold float
var_ratio_low float
var_ratio_high float
ks_alpha float
psi_threshold float
cramer_threshold float
}
QUALITY_METRICS {
imputation_bias dict
imputation_stability dict
}
QUALITY_METRICS ||--|{ IMPUTATION_BIAS : contains
Class diagram for the updated imputation configuration data structureclassDiagram
class ImputationConfig {
+strategy: str
+params: dict
+per_column: dict
+tuning: dict
}
class PerColumnOverride {
+strategy: str
+params: dict
}
class TuningConfig {
+enable: bool
+mask_fraction: float
+scoring: str
+max_cells: int
+random_state: int
+grid: dict
}
ImputationConfig --> PerColumnOverride : per_column
ImputationConfig --> TuningConfig : tuning
Class diagram for quality metrics normalization and selectionclassDiagram
class QualityMetricsConfig {
+imputation_bias: dict
+imputation_stability: dict
+redundancy: dict
+accuracy: dict
+traceability: dict
+timeliness: dict
}
class QualityMetricsSelection {
+options: list
+selected: list
}
QualityMetricsConfig <|-- QualityMetricsSelection
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Codecov Report❌ Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
- Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
- Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the PARAM_SPECS and _render_params logic into a shared module or utility so it’s easier to test, reuse, and keep the GUI code focused on rendering.
- Namespace Streamlit widget keys (e.g. int_{name}, sel_{name}, etc.) by strategy or clear relevant session_state on strategy changes to avoid key collisions when switching contexts.
- Remove the hard‐coded absolute ontology file paths in scripts/config/clinical_all_features_config.yaml or convert them to relative/configurable paths to avoid environment‐specific breaks.
## Individual Comments
### Comment 1
<location> `src/phenoqc/gui/gui.py:733` </location>
<code_context>
+ def _render_params(spec: dict, initial: Optional[dict] = None) -> dict:
</code_context>
<issue_to_address>
Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.
Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.
Suggested implementation:
```python
def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict:
initial = initial or {}
context = context or "default"
values: dict = {}
for name, meta in spec.items():
w = meta["widget"]
key = f"{w}_{context}_{name}"
if w == "int":
values[name] = st.number_input(
name,
value=int(initial.get(name, meta.get("default", 0))),
min_value=int(meta.get("min", -10000)),
max_value=int(meta.get("max", 10000)),
key=key,
```
1. For other widget types (e.g., "float", "selectbox", etc.) inside `_render_params`, update their calls to also use the unique `key=key` argument.
2. Wherever `_render_params` is called elsewhere in your codebase, you must now pass a suitable `context` string (e.g., strategy name, column name, etc.) to ensure uniqueness.
</issue_to_address>
### Comment 2
<location> `src/phenoqc/gui/gui.py:709` </location>
<code_context>
+ if not col:
+ continue
+ col_strategy = row.get("strategy") or strategy
try:
- grid_vals = [int(x.strip()) for x in grid_n.split(',') if x.strip()]
+ col_params = json.loads(row.get("params") or "{}")
+ if not isinstance(col_params, dict):
+ raise ValueError("params must be a JSON object")
except Exception:
- grid_vals = [3, 5, 7]
- tuning_cfg = {
- 'enable': True,
- 'mask_fraction': float(mask_fraction),
- 'scoring': scoring,
- 'max_cells': int(max_cells),
- 'random_state': int(random_state),
- 'grid': {'n_neighbors': grid_vals}
+ col_params = {}
+ per_column[col] = {"strategy": col_strategy, "params": col_params}
+
</code_context>
<issue_to_address>
Silent fallback to empty dict on JSON parse error may hide user mistakes.
Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.
Suggested implementation:
```python
+ col_params = json.loads(row.get("params") or "{}")
+ if not isinstance(col_params, dict):
+ raise ValueError("params must be a JSON object")
except Exception as e:
+ import logging
+ logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.")
+ col_params = {}
+ per_column[col] = {"strategy": col_strategy, "params": col_params}
```
If the file does not already import the `logging` module at the top, you should add:
```python
import logging
```
at the top of the file.
</issue_to_address>
## Security Issues
### Issue 1
<location> `scripts/clinical_all_features_e2e.py:326` </location>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _render_params(spec: dict, initial: Optional[dict] = None) -> dict: | ||
| initial = initial or {} | ||
| values: dict = {} | ||
| for name, meta in spec.items(): | ||
| w = meta["widget"] | ||
| if w == "int": | ||
| values[name] = st.number_input( | ||
| name, | ||
| value=int(initial.get(name, meta.get("default", 0))), | ||
| min_value=int(meta.get("min", -10000)), |
There was a problem hiding this comment.
suggestion (bug_risk): Parameter rendering function uses static keys for Streamlit widgets, which may cause key collisions.
Keys like f"int_{name}" are reused across contexts, risking collisions if parameter names overlap. Please add context (e.g., strategy or column name) to each key for uniqueness.
Suggested implementation:
def _render_params(spec: dict, initial: Optional[dict] = None, context: Optional[str] = None) -> dict:
initial = initial or {}
context = context or "default"
values: dict = {}
for name, meta in spec.items():
w = meta["widget"]
key = f"{w}_{context}_{name}"
if w == "int":
values[name] = st.number_input(
name,
value=int(initial.get(name, meta.get("default", 0))),
min_value=int(meta.get("min", -10000)),
max_value=int(meta.get("max", 10000)),
key=key,- For other widget types (e.g., "float", "selectbox", etc.) inside
_render_params, update their calls to also use the uniquekey=keyargument. - Wherever
_render_paramsis called elsewhere in your codebase, you must now pass a suitablecontextstring (e.g., strategy name, column name, etc.) to ensure uniqueness.
|
|
||
| # Parameter specs per strategy | ||
| PARAM_SPECS = { | ||
| "none": {}, | ||
| "mean": {}, | ||
| "median": {}, | ||
| "mode": {}, | ||
| "knn": { | ||
| "n_neighbors": {"widget": "int", "default": 5, "min": 1, "max": 100}, | ||
| "weights": {"widget": "select", "options": ["uniform", "distance"], "default": "uniform"}, |
There was a problem hiding this comment.
suggestion (bug_risk): Silent fallback to empty dict on JSON parse error may hide user mistakes.
Instead of defaulting to an empty dict, display a warning or error when JSON parsing fails to ensure users are notified of invalid input.
Suggested implementation:
+ col_params = json.loads(row.get("params") or "{}")
+ if not isinstance(col_params, dict):
+ raise ValueError("params must be a JSON object")
except Exception as e:
+ import logging
+ logging.warning(f"Failed to parse JSON for column '{col}': {e}. Please check your input.")
+ col_params = {}
+ per_column[col] = {"strategy": col_strategy, "params": col_params}If the file does not already import the logging module at the top, you should add:
import loggingat the top of the file.
| env = os.environ.copy() | ||
| env["PYTHONPATH"] = SRC_PATH + (os.pathsep + env.get("PYTHONPATH", "")) | ||
| print("[INFO] Running:", " ".join(cmd)) | ||
| proc = subprocess.run(cmd, capture_output=True, text=True, env=env) |
There was a problem hiding this comment.
security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
Source: opengrep
Summary by Sourcery
Revamp imputation configuration in the GUI to a flexible, parameter-driven interface, enhance quality metrics normalization and reporting, add CLI usability improvements, and introduce a full end-to-end clinical features script with accompanying example configs.
New Features:
Enhancements:
--metricsalias for the--quality-metricsCLI optionDocumentation:
--metricsalias, new imputation bias thresholds, and expanded example configuration sectionsChores: