Feat/imputation improvements by jorgeMFS · Pull Request #77 · jorgeMFS/PhenoQC

jorgeMFS · 2025-08-10T23:11:11Z

Summary by Sourcery

Add comprehensive imputation improvements including repeatability diagnostics, protected-column exclusion, and configurable redundancy metrics, and integrate them into CLI, config, GUI, and reporting with accompanying documentation and tests.

New Features:

Introduce imputation stability diagnostic via repeated mask-and-impute cross-validation
Allow users to specify protected columns to exclude from imputation and tuning to prevent label leakage
Enable redundancy detection using configurable correlation threshold and method (pearson or spearman)

Enhancements:

Extend imputation bias report with Wasserstein distance, low sample flag, and unified threshold merging from CLI and config
Merge tuning and diagnostics parameters from engine, config, and CLI with sensible defaults including automatic random_state
Merge CLI overrides for diagnostics thresholds and redundancy settings with YAML config, giving priority to flags

Documentation:

Update README, config schema, and GUI instructions to document new CLI flags, YAML imputation block, diagnostics settings, protected columns, and redundancy parameters

Tests:

Add unit tests for imputation stability CV, bias report edge cases, protected columns exclusion, redundancy config, CLI parsing, GUI state, and report generation
Include an end-to-end diagnostics demo script to validate pipeline behavior and reproducibility

…eakage via protected columns; make redundancy method/threshold configurable; surface imputation settings (incl. random_state) in reports; CLI flags and docs; add complex e2e diagnostics demo; tests for bias/stability

…redundancy settings; render stability table in results

…ng, and report imputation settings

…ate integration

sourcery-ai · 2025-08-10T23:11:16Z

Reviewer's Guide

This PR extends the imputation workflow by adding a repeatable stability diagnostic, protected-column exclusions, and configurable redundancy metrics. It refactors the imputation summary builder to merge engine tuning details with user config, exposes new CLI and GUI controls for diagnostics, protected columns, and redundancy parameters, and integrates stability/bias results into both PDF and Markdown reports. Comprehensive tests and an end-to-end demo script ensure coverage and reproducibility.

Sequence diagram for CLI/GUI configuration and imputation workflow with diagnostics

sequenceDiagram
    actor User
    participant CLI
    participant GUI
    participant BatchProcessor
    participant ImputationEngine
    participant QualityMetrics
    User->>CLI: Provide CLI args (impute, diagnostics, protected columns, redundancy)
    CLI->>BatchProcessor: Pass config and CLI overrides
    User->>GUI: Set config via GUI (strategy, diagnostics, protected columns, redundancy)
    GUI->>BatchProcessor: Pass merged config
    BatchProcessor->>ImputationEngine: Build engine with protected columns excluded
    BatchProcessor->>QualityMetrics: Run bias and stability diagnostics
    QualityMetrics->>BatchProcessor: Return diagnostics results
    BatchProcessor->>Report: Integrate diagnostics into PDF/Markdown

ER diagram for protected columns and redundancy config in YAML

erDiagram
    CONFIG ||--o{ IMPUTATION : contains
    CONFIG ||--o{ QUALITY_METRICS : contains
    CONFIG ||--o{ PROTECTED_COLUMNS : contains
    CONFIG ||--o{ REDUNDANCY : contains
    IMPUTATION {
      string strategy
      dict params
      dict tuning
      dict per_column
    }
    QUALITY_METRICS {
      dict imputation_bias
      dict imputation_stability
      dict redundancy
    }
    PROTECTED_COLUMNS {
      string[] columns
    }
    REDUNDANCY {
      float threshold
      string method
    }

Class diagram for ImputationEngine and new diagnostics

classDiagram
    class ImputationEngine {
        - cfg: dict
        - exclude_columns: set
        + __init__(cfg, exclude_columns)
        + fit_transform(df)
    }
    class imputation_bias_report {
        + imputation_bias_report(original_df, imputed_df, imputation_mask, smd_threshold, var_ratio_low, var_ratio_high, ks_alpha)
    }
    class imputation_stability_cv {
        + imputation_stability_cv(df, strategy, params, repeats, mask_fraction, scoring, random_state, columns)
    }
    ImputationEngine --> imputation_bias_report : uses
    ImputationEngine --> imputation_stability_cv : uses

File-Level Changes

Change	Details	Files
Add imputation stability diagnostic	Implemented imputation_stability_cv to perform repeated mask-and-impute and compute per-column CV errors Integrated stability DataFrame into batch_processing._append_metric alongside bias diagnostics Exposed CLI flags (--impute-diagnostics, --diag-repeats, --diag-mask-fraction, --diag-scoring) and GUI controls Rendered stability sections in PDF/Markdown reports and in the Streamlit GUI quality metrics pane Created unit tests (test_stability_cv.py, test_imputation_stability.py) and included in end-to-end demo	`src/phenoqc/quality_metrics.py` `src/phenoqc/batch_processing.py` `src/phenoqc/cli.py` `src/phenoqc/gui/gui.py` `src/phenoqc/reporting.py` `tests/test_stability_cv.py` `tests/test_imputation_stability.py` `scripts/end_to_end_diagnostics_demo.sh`
Introduce protected columns exclusion	Added --protected-columns flag with normalization logic in CLI Merged CLI and config protected_columns with CLI taking precedence in batch_process Excluded protected columns from imputation feature matrix and tuning in process_file Emitted warnings in ImputationEngine when protected columns are configured for imputation Added GUI multiselect for protected columns and corresponding tests	`src/phenoqc/cli.py` `src/phenoqc/batch_processing.py` `src/phenoqc/missing_data.py` `src/phenoqc/gui/gui.py` `config.yaml` `tests/test_protected_columns.py`
Support configurable redundancy metrics	Extended detect_redundancy to accept a method (pearson or spearman) and dynamic threshold Merged redundancy_threshold/method CLI overrides into config in batch_process Applied redundancy settings in DataValidator._apply_quality_metrics using top-level or nested config Exposed redundancy controls in GUI and updated tests for both function and validation	`src/phenoqc/quality_metrics.py` `src/phenoqc/validation.py` `src/phenoqc/batch_processing.py` `src/phenoqc/gui/gui.py` `src/phenoqc/cli.py` `tests/test_redundancy_config.py`
Refactor imputation summary and tuning merge	Reworked _build_imputation_summary to merge engine.tuning_summary with selective config fields Ensured default random_state is set when tuning is enabled but not specified	`src/phenoqc/batch_processing.py`
Enhance CLI/GUI options and documentation	Added new CLI arguments for diagnostics, protected columns, and redundancy in cli.py Updated README.md and config.yaml to document new options and CLI vs. config precedence Adjusted GUI workflow steps to include stability diagnostics, protected columns, and redundancy	`src/phenoqc/cli.py` `src/phenoqc/gui/gui.py` `README.md` `config.yaml`
Integrate stability into reporting	Modified reporting.py to insert a combined "Imputation Stability & Bias" section Populated top variables by CV error in PDF and Markdown outputs	`src/phenoqc/reporting.py`
Add tests and end-to-end demo script	Expanded unit tests for bias report, protected columns, redundancy, stability, and report generation Included scripts/end_to_end_diagnostics_demo.sh to validate diagnostics pipeline and reproducibility	`tests/test_imputation_bias.py` `tests/test_cli_protected_columns.py` `tests/test_imputation_settings_report.py` `tests/test_protected_columns.py` `tests/test_redundancy_config.py` `tests/test_stability_cv.py` `tests/test_imputation_stability.py` `scripts/end_to_end_diagnostics_demo.sh`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @jorgeMFS - I've reviewed your changes - here's some feedback:

A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
- The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
- This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.

## Individual Comments

### Comment 1
<location> `src/phenoqc/batch_processing.py:907` </location>
<code_context>
-                    'var_ratio_low': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_low', 0.5)) if isinstance(metrics_cfg, dict) else 0.5,
-                    'var_ratio_high': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_high', 2.0)) if isinstance(metrics_cfg, dict) else 2.0,
-                    'ks_alpha': float(metrics_cfg.get('imputation_bias', {}).get('ks_alpha', 0.05)) if isinstance(metrics_cfg, dict) else 0.05,
+                stability_diagnostics=stability_df,
+                bias_thresholds=merged_bias_thresholds if 'merged_bias_thresholds' in locals() else {
+                    'smd_threshold': float(bias_smd_threshold),
</code_context>

<issue_to_address>
stability_diagnostics is added to the report, but may be None if not run.

Consider returning an empty DataFrame instead of None to ensure consistent output and simplify downstream handling.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                stability_diagnostics=stability_df,
=======
                stability_diagnostics=stability_df if stability_df is not None else pd.DataFrame(),
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `src/phenoqc/quality_metrics.py:70` </location>
<code_context>


-def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
+def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:
     """Detect highly correlated or identical columns.

</code_context>

<issue_to_address>
Correlation method for redundancy detection is now configurable.

If an invalid method is provided, the code defaults to 'pearson' without notification. Please add a warning or message to inform users when this fallback occurs.
</issue_to_address>

### Comment 3
<location> `src/phenoqc/quality_metrics.py:343` </location>
<code_context>
             if len(obs) == 0 or len(imp) == 0:
                 continue

+            low_n = bool(len(obs) < 10 or len(imp) < 10)
             if len(obs) < 3 or len(imp) < 3:
                 ks_stat = None
</code_context>

<issue_to_address>
Auxiliary fields wdist and low_n are appended to the row dicts.

Modifying __dict__ directly is fragile and may cause issues if the row type changes. Instead, convert the dataclass instance to a dict before adding extra fields.

Suggested implementation:

```python
            # Optional Wasserstein distance for distributional shift (not part of warn thresholds)
            try:
                wdist = float(stats.wasserstein_distance(obs.values, imp.values)) if len(obs) and len(imp) else np.nan
            except Exception:
                wdist = np.nan

            # Convert row dataclass to dict and append auxiliary fields
            import dataclasses
            row_dict = dataclasses.asdict(row)
            row_dict["wdist"] = wdist
            row_dict["low_n"] = low_n
            # Now use row_dict instead of modifying row.__dict__

```

You will need to replace any code that appends or processes `row` (as a dataclass) with `row_dict` (as a dict) in the rest of this function or loop. For example, if you have `rows.append(row)`, change it to `rows.append(row_dict)`. If you need to preserve the dataclass elsewhere, consider keeping both representations.
</issue_to_address>

### Comment 4
<location> `src/phenoqc/missing_data.py:204` </location>
<code_context>
         self.cfg = cfg or {}
         self.exclude_columns = set(exclude_columns or [])
+        
+        # Warn if protected columns are requested as imputation targets
+        if exclude_columns:
+            per_column = cfg.get('per_column', {}) if cfg else {}
+            for col in exclude_columns:
+                if col in per_column:
+                    message = f"Protected column '{col}' is configured for imputation but will be excluded"
+                    logging.warning(message)
+                    warnings.warn(message, UserWarning)
+        
         self.chosen_params: dict = {}
</code_context>

<issue_to_address>
Warning is issued if protected columns are configured for imputation.

Consider using either logging.warning or warnings.warn, rather than both, to prevent duplicate notifications.
</issue_to_address>

### Comment 5
<location> `tests/test_imputation_stability.py:18` </location>
<code_context>
+            'g2': rng.normal(5, 2.0, n),
+        })
+
+    def test_stability_converges_with_repeats(self):
+        # With more repeats, the SD of MAE across repeats should be non-increasing on average
+        res_low = imputation_stability_cv(self.df, strategy='mean', repeats=3, mask_fraction=0.1, scoring='MAE', random_state=7)
+        res_high = imputation_stability_cv(self.df, strategy='mean', repeats=10, mask_fraction=0.1, scoring='MAE', random_state=7)
</code_context>

<issue_to_address>
Consider adding tests for edge cases such as empty DataFrames, non-numeric columns, and columns with all missing values.

Current tests only cover standard numeric data. Please add cases for empty DataFrames, non-numeric columns, and columns with all missing values to verify robust handling of these inputs.
</issue_to_address>

### Comment 6
<location> `tests/test_imputation_settings_report.py:6` </location>
<code_context>
+def test_report_includes_imputation_settings_pdf():
</code_context>

<issue_to_address>
Add assertions to verify that imputation tuning details are present in the generated PDF report.

The test should validate that the PDF contains the expected imputation tuning details, not just that the file is created. Use a PDF parser or string search to confirm these details are present.

Suggested implementation:

```python
import PyPDF2

def test_report_includes_imputation_settings_pdf():
    validation_results = {
        "Format Validation": True,
        "Duplicate Records": pd.DataFrame(),
        "Conflicting Records": pd.DataFrame(),
        "Integrity Issues": pd.DataFrame(),
        "Referential Integrity Issues": pd.DataFrame(),
        "Anomalies Detected": pd.DataFrame(),
        "Invalid Mask": pd.DataFrame(),
    }
    missing_data = pd.Series(dtype=int)

    # Example imputation settings
    imputation_settings = {
        "method": "KNN",
        "n_neighbors": 5,
        "weights": "uniform",
        "missing_values": "NaN"
    }

    # Generate PDF report in-memory
    pdf_buffer = io.BytesIO()
    generate_qc_report(
        validation_results=validation_results,
        missing_data=missing_data,
        imputation_settings=imputation_settings,
        output_format="pdf",
        output_stream=pdf_buffer
    )
    pdf_buffer.seek(0)

    # Parse PDF and extract text
    reader = PyPDF2.PdfReader(pdf_buffer)
    pdf_text = ""
    for page in reader.pages:
        pdf_text += page.extract_text()

    # Assert that imputation tuning details are present
    assert "Imputation Method: KNN" in pdf_text
    assert "Number of Neighbors: 5" in pdf_text
    assert "Weights: uniform" in pdf_text
    assert "Missing Values: NaN" in pdf_text

```

- Ensure that `generate_qc_report` supports the `output_stream` and `imputation_settings` arguments and that it writes the imputation settings in the expected format to the PDF.
- You may need to adjust the assertion strings to match the actual formatting in the generated PDF.
- Make sure `PyPDF2` is installed in your test environment.
</issue_to_address>

### Comment 7
<location> `tests/test_cli_protected_columns.py:5` </location>
<code_context>
+from phenoqc.cli import parse_arguments
+
+
+def test_cli_protected_columns_commas_and_spaces(monkeypatch):
+    argv = [
+        'phenoqc',
+        '--input', 'a.csv',
+        '--schema', 'schema.json',
+        '--unique_identifiers', 'id',
+        '--protected-columns', 'label,outcome', 'group'
+    ]
+    monkeypatch.setattr('sys.argv', argv)
+    args = parse_arguments()
+    assert args.protected_columns == ['label', 'outcome', 'group']
+
+
</code_context>

<issue_to_address>
Add tests for edge cases such as empty protected columns, duplicate entries, and mixed whitespace.

Please include tests for empty input, duplicate column names, and entries with extra whitespace to cover additional edge cases.
</issue_to_address>

### Comment 8
<location> `tests/test_imputation_bias.py:4` </location>
<code_context>
+- Imputation-bias thresholds: `--bias-smd-threshold`, `--bias-var-low`, `--bias-var-high`, `--bias-ks-alpha`
+- Imputation stability diagnostics: `--impute-diagnostics on|off`, `--diag-repeats`, `--diag-mask-fraction`, `--diag-scoring`
+- Redundancy tuning: `--redundancy-threshold`, `--redundancy-method {pearson,spearman}`

 ### Config (new `imputation:` block)

</code_context>

<issue_to_address>
No new tests added for imputation_bias_report edge cases; consider adding tests for low sample size, all missing, and non-numeric columns.

Please ensure tests cover columns with <10 observations, all missing values, and non-numeric data, verifying that expected fields and warnings are present in the output.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…ample, protected columns & redundancy flags, and GUI highlights

…GUI sections

…o avoid __dict__ mutation; minor GUI nesting tweak; add edge-case tests for stability CV and bias; expand CLI protected columns tests

jorgeMFS · 2025-08-10T23:31:54Z

@sourcery-ai resolve

jorgeMFS · 2025-08-10T23:32:03Z

@sourcery-ai dismiss

jorgeMFS · 2025-08-10T23:32:27Z

@sourcery-ai review

sourcery-ai

Hey @jorgeMFS - I've reviewed your changes - here's some feedback:

Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
- Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
- The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.

## Individual Comments

### Comment 1
<location> `src/phenoqc/quality_metrics.py:345` </location>
<code_context>
             if len(obs) == 0 or len(imp) == 0:
                 continue

+            low_n = (len(obs) < 10 or len(imp) < 10)
             if len(obs) < 3 or len(imp) < 3:
                 ks_stat = None
</code_context>

<issue_to_address>
The 'low_n' flag is added to bias diagnostics but not documented.

Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.
</issue_to_address>

### Comment 2
<location> `src/phenoqc/quality_metrics.py:397` </location>
<code_context>
+def imputation_stability_cv(
</code_context>

<issue_to_address>
Imputation stability CV uses a fixed random_state for all repeats.

Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.

Suggested implementation:

```python
def imputation_stability_cv(
    df: pd.DataFrame,
    strategy: str,
    params: Optional[Dict[str, Any]] = None,
    repeats: int = 5,
    mask_fraction: float = 0.10,
    scoring: str = 'MAE',
    random_state: int = 42,
    columns: Optional[List[str]] = None,
) -> pd.DataFrame:
    """Estimate imputation repeatability via repeated mask-and-impute.

```

```python
    for repeat in range(repeats):
        # Vary random_state for each repeat to ensure independent masking
        repeat_random_state = random_state + repeat if random_state is not None else None
        mask = create_mask(df, mask_fraction, random_state=repeat_random_state)
        imputed = impute(df, mask, strategy, params, columns)
        score = score_imputation(df, imputed, mask, scoring)
        scores.append(score)

```
</issue_to_address>

### Comment 3
<location> `src/phenoqc/batch_processing.py:1055` </location>
<code_context>
             tun = cfg_imp.setdefault('tuning', {})
             tun['enable'] = True

+    # Merge redundancy metric overrides (CLI > YAML)
+    try:
+        if redundancy_threshold is not None or redundancy_method is not None:
+            qm = config.setdefault('quality_metrics', [])
+            # Ensure redundancy metric is enabled in config if CLI tweaks are given
+            if isinstance(qm, list):
+                if 'redundancy' not in qm:
+                    qm.append('redundancy')
+            elif isinstance(qm, dict):
+                qm.setdefault('redundancy', {})
+            # Attach parameters in dict style for validator lookup
+            config.setdefault('redundancy', {})
+            if redundancy_threshold is not None:
+                config['redundancy']['threshold'] = float(redundancy_threshold)
+            if redundancy_method is not None:
+                config['redundancy']['method'] = str(redundancy_method)
+    except Exception:
+        pass
+
</code_context>

<issue_to_address>
Redundancy metric override logic may silently fail.

Instead of suppressing all exceptions, log any errors during the merge to improve traceability and debugging.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    try:
        if redundancy_threshold is not None or redundancy_method is not None:
            qm = config.setdefault('quality_metrics', [])
            # Ensure redundancy metric is enabled in config if CLI tweaks are given
            if isinstance(qm, list):
                if 'redundancy' not in qm:
                    qm.append('redundancy')
            elif isinstance(qm, dict):
                qm.setdefault('redundancy', {})
            # Attach parameters in dict style for validator lookup
            config.setdefault('redundancy', {})
            if redundancy_threshold is not None:
                config['redundancy']['threshold'] = float(redundancy_threshold)
            if redundancy_method is not None:
                config['redundancy']['method'] = str(redundancy_method)
    except Exception:
        pass
=======
    try:
        if redundancy_threshold is not None or redundancy_method is not None:
            qm = config.setdefault('quality_metrics', [])
            # Ensure redundancy metric is enabled in config if CLI tweaks are given
            if isinstance(qm, list):
                if 'redundancy' not in qm:
                    qm.append('redundancy')
            elif isinstance(qm, dict):
                qm.setdefault('redundancy', {})
            # Attach parameters in dict style for validator lookup
            config.setdefault('redundancy', {})
            if redundancy_threshold is not None:
                config['redundancy']['threshold'] = float(redundancy_threshold)
            if redundancy_method is not None:
                config['redundancy']['method'] = str(redundancy_method)
    except Exception as e:
        log_activity(f"[RedundancyOverride] Error merging redundancy metric overrides: {e}", level="error")
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `tests/test_protected_columns.py:102` </location>
<code_context>
+    def test_protected_columns_warning_when_configured(self):
</code_context>

<issue_to_address>
Consider adding a test for multiple protected columns configured for imputation.

Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.

Suggested implementation:

```python
    def test_protected_columns_warning_when_configured(self):
        """Test that warning is issued when protected columns are configured for imputation."""
        protected_cols = ['outcome']

        # Configure the protected column for imputation
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'}  # This should trigger a warning
            }
        }

    def test_multiple_protected_columns_warning_when_configured(self):
        """Test that warnings are issued for multiple protected columns configured for imputation."""
        protected_cols = ['outcome', 'gender', 'race']
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'},
                'gender': {'strategy': 'median'},
                'race': {'strategy': 'most_frequent'}
            }
        }
        # Assuming the imputation function is called `impute_data` and emits warnings
        import warnings
        from your_module import impute_data  # Replace with actual import

        with warnings.catch_warnings(record=True) as w:
            warnings.simplefilter("always")
            impute_data(data=None, config=config, protected_cols=protected_cols)  # Replace data=None with actual test data

            # Check that a warning was issued for each protected column
            warning_msgs = [str(warning.message) for warning in w]
            for col in protected_cols:
                assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"

```

- Replace `from your_module import impute_data` with the actual import path for your imputation function.
- Replace `data=None` with appropriate test data as required by your imputation function.
- If your warning messages have a specific format, adjust the assertion to match the expected message.
</issue_to_address>

### Comment 5
<location> `tests/test_stability_cv.py:46` </location>
<code_context>
+    assert out.empty
+
+
+def test_stability_cv_all_missing_numeric():
+    df = pd.DataFrame({'x': [np.nan, np.nan, np.nan]})
+    out = imputation_stability_cv(df, strategy='mean')
+    assert isinstance(out, pd.DataFrame)
+    # No observed cells to mask => empty result
+    assert out.empty
+
+
</code_context>

<issue_to_address>
Missing test for non-numeric columns in imputation_stability_cv.

Add a test with a DataFrame containing only non-numeric columns to ensure imputation_stability_cv returns an empty DataFrame without raising errors.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH

def test_stability_cv_basic_and_columns():
=======
def test_stability_cv_non_numeric_only():
    # DataFrame with only non-numeric columns
    df = pd.DataFrame({'a': ['foo', 'bar', 'baz'], 'b': ['x', 'y', 'z']})
    out = imputation_stability_cv(df, strategy='mean')
    assert isinstance(out, pd.DataFrame)
    assert out.empty

def test_stability_cv_basic_and_columns():
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/test_bias_report_edges.py:19` </location>
<code_context>
+        assert row.get("ks_p", None) in (None, np.nan)
+
+
+def test_bias_report_warn_logic_and_columns():
+    # Construct data where imputed mean differs markedly
+    rng = np.random.RandomState(0)
+    orig = pd.DataFrame({"a": rng.normal(0, 1, 200)})
+    imp = orig.copy()
+    # Simulate imputed values at masked positions shifted by +1.0
+    mask_positions = pd.Series([False] * 200)
+    mask_positions.iloc[:100] = True
+    imp.loc[mask_positions, "a"] = imp.loc[mask_positions, "a"] + 1.0
+    out = imputation_bias_report(original_df=orig, imputed_df=imp, imputation_mask={"a": mask_positions}, columns=["a"], smd_threshold=0.2)
+    assert not out.empty
+    assert set(["column","n_obs","n_imp","smd","var_ratio","ks_stat","ks_p","warn"]).issubset(out.columns)
+    assert bool(out["warn"].iloc[0]) is True
+
+
</code_context>

<issue_to_address>
Missing test for bias report with all columns having no missing values.

Add a test case with no missing values to verify that imputation_bias_report returns an empty DataFrame without errors.
</issue_to_address>

### Comment 7
<location> `tests/test_cli_protected_columns.py:18` </location>
<code_context>
+    assert args.protected_columns == ['label', 'outcome', 'group']
+
+
+def test_cli_protected_columns_duplicates_whitespace(monkeypatch):
+    argv = [
+        'phenoqc',
+        '--input', 'a.csv',
+        '--schema', 'schema.json',
+        '--unique_identifiers', 'id',
+        '--protected-columns', ' label , outcome ', 'label', '  outcome  ', ' '
+    ]
+    monkeypatch.setattr('sys.argv', argv)
+    args = parse_arguments()
+    # Normalization keeps order of appearance but strips empties
+    assert args.protected_columns == ['label', 'outcome', 'label', 'outcome']
+
+
</code_context>

<issue_to_address>
Consider adding a test for empty protected columns argument.

Please add a test case where --protected-columns is omitted or set to an empty string, verifying that the parser returns an empty list without raising an error.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-10T23:34:10Z

                (ks_p is not None and ks_p < ks_alpha)
            )

-            rows.append(
-                BiasRow(
-                    column=col,
-                    n_obs=len(obs),
-                    n_imp=len(imp),
-                    mean_diff=mean_diff,
-                    smd=float(smd) if not np.isnan(smd) else np.nan,
-                    var_ratio=(
-                        float(var_ratio) if not np.isnan(var_ratio) else np.nan
-                    ),
-                    ks_stat=float(ks_stat) if ks_stat is not None else None,
-                    ks_p=float(ks_p) if ks_p is not None else None,
-                    warn=warn,
-                )
+            row_obj = BiasRow(
+                column=col,
+                n_obs=len(obs),
+                n_imp=len(imp),
+                mean_diff=mean_diff,
+                smd=float(smd) if not np.isnan(smd) else np.nan,
+                var_ratio=(


question: The 'low_n' flag is added to bias diagnostics but not documented.

Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.

sourcery-ai · 2025-08-10T23:34:10Z

-    df_out = pd.DataFrame([r.__dict__ for r in rows])
+    # rows is a list of dictionaries after refactor above
+    df_out = pd.DataFrame(rows)
    if not df_out.empty:


suggestion: Imputation stability CV uses a fixed random_state for all repeats.

Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.

Suggested implementation:

def imputation_stability_cv( df: pd.DataFrame, strategy: str, params: Optional[Dict[str, Any]] = None, repeats: int = 5, mask_fraction: float = 0.10, scoring: str = 'MAE', random_state: int = 42, columns: Optional[List[str]] = None, ) -> pd.DataFrame: """Estimate imputation repeatability via repeated mask-and-impute.

for repeat in range(repeats): # Vary random_state for each repeat to ensure independent masking repeat_random_state = random_state + repeat if random_state is not None else None mask = create_mask(df, mask_fraction, random_state=repeat_random_state) imputed = impute(df, mask, strategy, params, columns) score = score_imputation(df, imputed, mask, scoring) scores.append(score)

sourcery-ai · 2025-08-10T23:34:10Z

+    def test_protected_columns_warning_when_configured(self):
+        """Test that warning is issued when protected columns are configured for imputation."""
+        protected_cols = ['outcome']
+
+        # Configure the protected column for imputation
+        config = {
+            'strategy': 'mean',
+            'per_column': {
+                'outcome': {'strategy': 'mode'}  # This should trigger a warning
+            }


suggestion (testing): Consider adding a test for multiple protected columns configured for imputation.

Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.

Suggested implementation:

def test_protected_columns_warning_when_configured(self): """Test that warning is issued when protected columns are configured for imputation.""" protected_cols = ['outcome'] # Configure the protected column for imputation config = { 'strategy': 'mean', 'per_column': { 'outcome': {'strategy': 'mode'} # This should trigger a warning } } def test_multiple_protected_columns_warning_when_configured(self): """Test that warnings are issued for multiple protected columns configured for imputation.""" protected_cols = ['outcome', 'gender', 'race'] config = { 'strategy': 'mean', 'per_column': { 'outcome': {'strategy': 'mode'}, 'gender': {'strategy': 'median'}, 'race': {'strategy': 'most_frequent'} } } # Assuming the imputation function is called `impute_data` and emits warnings import warnings from your_module import impute_data # Replace with actual import with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") impute_data(data=None, config=config, protected_cols=protected_cols) # Replace data=None with actual test data # Check that a warning was issued for each protected column warning_msgs = [str(warning.message) for warning in w] for col in protected_cols: assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"

Replace from your_module import impute_data with the actual import path for your imputation function.

Replace data=None with appropriate test data as required by your imputation function.

If your warning messages have a specific format, adjust the assertion to match the expected message.

sourcery-ai · 2025-08-10T23:34:11Z

+    for col in df.columns:
+        idx = rng.choice(n, size=int(0.2*n), replace=False)
+        df.loc[idx, col] = np.nan


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-10T23:34:11Z



-def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
+def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:


issue (code-quality): Low code quality found in detect_redundancy - 21% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-08-10T23:34:11Z

    return df_out
+
+
+def imputation_stability_cv(


issue (code-quality): We've found these issues:

Remove unnecessary casts to int, str, float or bool [×3] (remove-unnecessary-cast)

Replace unused for index with underscore (for-index-underscore)

Replace if statement with if expression (assign-if-exp)

Low code quality found in imputation_stability_cv - 19% (low-code-quality)

Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

…olumns; minor style/clarity tweaks per PR review; all tests passing

… tests; minor positive-if style tweaks in bias rows; align with PR feedback

…d offline note in config; update README and usage

codecov-commenter · 2025-08-11T15:23:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 37.44395% with 279 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/phenoqc/reporting.py	0.00%	90 Missing and 1 partial ⚠️
src/phenoqc/batch_processing.py	6.25%	72 Missing and 3 partials ⚠️
src/phenoqc/gui/gui.py	0.00%	57 Missing ⚠️
src/phenoqc/quality_metrics.py	74.83%	24 Missing and 14 partials ⚠️
src/phenoqc/validation.py	46.15%	6 Missing and 1 partial ⚠️
src/phenoqc/mapping.py	76.19%	3 Missing and 2 partials ⚠️
src/phenoqc/cli.py	81.81%	2 Missing and 2 partials ⚠️
src/phenoqc/missing_data.py	81.81%	1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Files with missing lines	Coverage Δ
src/phenoqc/missing_data.py	`66.12% <81.81%> (+0.58%)`	⬆️
src/phenoqc/cli.py	`45.25% <81.81%> (+45.25%)`	⬆️
src/phenoqc/mapping.py	`76.61% <76.19%> (+8.23%)`	⬆️
src/phenoqc/validation.py	`50.20% <46.15%> (+0.84%)`	⬆️
src/phenoqc/quality_metrics.py	`80.83% <74.83%> (-4.47%)`	⬇️
src/phenoqc/gui/gui.py	`3.75% <0.00%> (+3.75%)`	⬆️
src/phenoqc/batch_processing.py	`14.73% <6.25%> (-1.67%)`	⬇️
src/phenoqc/reporting.py	`39.73% <0.00%> (+1.76%)`	⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eated MICE; stability CV fail threshold + wiring; docs/rendering tweaks; tests added; all tests passing

…inty enable/repeats/params; render MI uncertainty; stability CV fail threshold end-to-end; add CLI test; 117 tests passing

…V fail threshold; add corresponding GUI controls; tests still green

…t DataFrame guards in imputation and batch path; clearer error surfacing

… uncertainty flags to processing; include stability/MI/bias in GUI PDF; restore full diagnostics in results

jorgeMFS added 5 commits August 10, 2025 23:54

feat(gui): add controls for imputation stability, protected columns, …

35cbae6

…redundancy settings; render stability table in results

test: add coverage for redundancy config, CLI protected-columns parsi…

ddea406

…ng, and report imputation settings

test(gui): add unit tests for quality metrics widget helpers

1c7c645

test: expand coverage for bias report edges, stability CV, and GUI st…

ffa6a1b

…ate integration

sourcery-ai Bot reviewed Aug 10, 2025

View reviewed changes

jorgeMFS added 3 commits August 11, 2025 00:14

docs(README): clarify CLI vs config precedence, add stability/bias ex…

e0d3bb3

…ample, protected columns & redundancy flags, and GUI highlights

docs(README): resolve merge conflicts with main; consolidate CLI and …

9f514dc

…GUI sections

style(qm): warn on redundancy method fallback; refactor bias report t…

56fa30f

…o avoid __dict__ mutation; minor GUI nesting tweak; add edge-case tests for stability CV and bias; expand CLI protected columns tests

sourcery-ai Bot reviewed Aug 10, 2025

View reviewed changes

jorgeMFS and others added 8 commits August 11, 2025 00:36

Update src/phenoqc/batch_processing.py

1b20951

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update tests/test_stability_cv.py

7f60eed

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update tests/test_bias_report_edges.py

28f3288

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update tests/test_stability_cv.py

7f1749c

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

refactor: prefer warnings.warn over duplicate logging for protected c…

59f2b12

…olumns; minor style/clarity tweaks per PR review; all tests passing

tests: add empty/omitted protected-columns cases; simplify low-n bias…

ebc3da9

… tests; minor positive-if style tweaks in bias rows; align with PR feedback

docs: document --offline flag, caching behavior and retry/backoff; ad…

6ea43c8

…d offline note in config; update README and usage

tests: add offline/retry tests; all passing

8c30d15

jorgeMFS added 5 commits August 11, 2025 16:48

feat(qm): categorical bias PSI and Cramér’s V; MI uncertainty via rep…

98a1a30

…eated MICE; stability CV fail threshold + wiring; docs/rendering tweaks; tests added; all tests passing

feat(cli/batch/qm/report): wire PSI/Cramér thresholds; add MI uncerta…

68ff6fb

…inty enable/repeats/params; render MI uncertainty; stability CV fail threshold end-to-end; add CLI test; 117 tests passing

docs+gui: document PSI/Cramér thresholds, MI uncertainty, stability C…

3f4e615

…V fail threshold; add corresponding GUI controls; tests still green

fix(gui+engine): unique keys for GUI widgets already added; add stric…

7de978b

…t DataFrame guards in imputation and batch path; clearer error surfacing

fix(gui): ensure bias enable flag set for dict-style configs; pass MI…

5ee41a2

… uncertainty flags to processing; include stability/MI/bias in GUI PDF; restore full diagnostics in results

jorgeMFS merged commit 1640f81 into main Aug 11, 2025
4 checks passed



		def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
		def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:

Conversation

jorgeMFS commented Aug 10, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for CLI/GUI configuration and imputation workflow with diagnostics

ER diagram for protected columns and redundancy config in YAML

Class diagram for ImputationEngine and new diagnostics

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorgeMFS commented Aug 10, 2025

Uh oh!

jorgeMFS commented Aug 10, 2025

Uh oh!

jorgeMFS commented Aug 10, 2025

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jorgeMFS commented Aug 10, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Aug 10, 2025 •

edited

Loading

codecov-commenter commented Aug 11, 2025 •

edited

Loading