Skip to content

Feat/imputation improvements#77

Merged
jorgeMFS merged 21 commits into
mainfrom
feat/imputation-improvements
Aug 11, 2025
Merged

Feat/imputation improvements#77
jorgeMFS merged 21 commits into
mainfrom
feat/imputation-improvements

Conversation

@jorgeMFS

@jorgeMFS jorgeMFS commented Aug 10, 2025

Copy link
Copy Markdown
Owner

Summary by Sourcery

Add comprehensive imputation improvements including repeatability diagnostics, protected-column exclusion, and configurable redundancy metrics, and integrate them into CLI, config, GUI, and reporting with accompanying documentation and tests.

New Features:

  • Introduce imputation stability diagnostic via repeated mask-and-impute cross-validation
  • Allow users to specify protected columns to exclude from imputation and tuning to prevent label leakage
  • Enable redundancy detection using configurable correlation threshold and method (pearson or spearman)

Enhancements:

  • Extend imputation bias report with Wasserstein distance, low sample flag, and unified threshold merging from CLI and config
  • Merge tuning and diagnostics parameters from engine, config, and CLI with sensible defaults including automatic random_state
  • Merge CLI overrides for diagnostics thresholds and redundancy settings with YAML config, giving priority to flags

Documentation:

  • Update README, config schema, and GUI instructions to document new CLI flags, YAML imputation block, diagnostics settings, protected columns, and redundancy parameters

Tests:

  • Add unit tests for imputation stability CV, bias report edge cases, protected columns exclusion, redundancy config, CLI parsing, GUI state, and report generation
  • Include an end-to-end diagnostics demo script to validate pipeline behavior and reproducibility

…eakage via protected columns; make redundancy method/threshold configurable; surface imputation settings (incl. random_state) in reports; CLI flags and docs; add complex e2e diagnostics demo; tests for bias/stability
…redundancy settings; render stability table in results
@sourcery-ai

sourcery-ai Bot commented Aug 10, 2025

Copy link
Copy Markdown
Contributor

Reviewer's Guide

This PR extends the imputation workflow by adding a repeatable stability diagnostic, protected-column exclusions, and configurable redundancy metrics. It refactors the imputation summary builder to merge engine tuning details with user config, exposes new CLI and GUI controls for diagnostics, protected columns, and redundancy parameters, and integrates stability/bias results into both PDF and Markdown reports. Comprehensive tests and an end-to-end demo script ensure coverage and reproducibility.

Sequence diagram for CLI/GUI configuration and imputation workflow with diagnostics

sequenceDiagram
    actor User
    participant CLI
    participant GUI
    participant BatchProcessor
    participant ImputationEngine
    participant QualityMetrics
    User->>CLI: Provide CLI args (impute, diagnostics, protected columns, redundancy)
    CLI->>BatchProcessor: Pass config and CLI overrides
    User->>GUI: Set config via GUI (strategy, diagnostics, protected columns, redundancy)
    GUI->>BatchProcessor: Pass merged config
    BatchProcessor->>ImputationEngine: Build engine with protected columns excluded
    BatchProcessor->>QualityMetrics: Run bias and stability diagnostics
    QualityMetrics->>BatchProcessor: Return diagnostics results
    BatchProcessor->>Report: Integrate diagnostics into PDF/Markdown
Loading

ER diagram for protected columns and redundancy config in YAML

erDiagram
    CONFIG ||--o{ IMPUTATION : contains
    CONFIG ||--o{ QUALITY_METRICS : contains
    CONFIG ||--o{ PROTECTED_COLUMNS : contains
    CONFIG ||--o{ REDUNDANCY : contains
    IMPUTATION {
      string strategy
      dict params
      dict tuning
      dict per_column
    }
    QUALITY_METRICS {
      dict imputation_bias
      dict imputation_stability
      dict redundancy
    }
    PROTECTED_COLUMNS {
      string[] columns
    }
    REDUNDANCY {
      float threshold
      string method
    }
Loading

Class diagram for ImputationEngine and new diagnostics

classDiagram
    class ImputationEngine {
        - cfg: dict
        - exclude_columns: set
        + __init__(cfg, exclude_columns)
        + fit_transform(df)
    }
    class imputation_bias_report {
        + imputation_bias_report(original_df, imputed_df, imputation_mask, smd_threshold, var_ratio_low, var_ratio_high, ks_alpha)
    }
    class imputation_stability_cv {
        + imputation_stability_cv(df, strategy, params, repeats, mask_fraction, scoring, random_state, columns)
    }
    ImputationEngine --> imputation_bias_report : uses
    ImputationEngine --> imputation_stability_cv : uses
Loading

File-Level Changes

Change Details Files
Add imputation stability diagnostic
  • Implemented imputation_stability_cv to perform repeated mask-and-impute and compute per-column CV errors
  • Integrated stability DataFrame into batch_processing._append_metric alongside bias diagnostics
  • Exposed CLI flags (--impute-diagnostics, --diag-repeats, --diag-mask-fraction, --diag-scoring) and GUI controls
  • Rendered stability sections in PDF/Markdown reports and in the Streamlit GUI quality metrics pane
  • Created unit tests (test_stability_cv.py, test_imputation_stability.py) and included in end-to-end demo
src/phenoqc/quality_metrics.py
src/phenoqc/batch_processing.py
src/phenoqc/cli.py
src/phenoqc/gui/gui.py
src/phenoqc/reporting.py
tests/test_stability_cv.py
tests/test_imputation_stability.py
scripts/end_to_end_diagnostics_demo.sh
Introduce protected columns exclusion
  • Added --protected-columns flag with normalization logic in CLI
  • Merged CLI and config protected_columns with CLI taking precedence in batch_process
  • Excluded protected columns from imputation feature matrix and tuning in process_file
  • Emitted warnings in ImputationEngine when protected columns are configured for imputation
  • Added GUI multiselect for protected columns and corresponding tests
src/phenoqc/cli.py
src/phenoqc/batch_processing.py
src/phenoqc/missing_data.py
src/phenoqc/gui/gui.py
config.yaml
tests/test_protected_columns.py
Support configurable redundancy metrics
  • Extended detect_redundancy to accept a method (pearson or spearman) and dynamic threshold
  • Merged redundancy_threshold/method CLI overrides into config in batch_process
  • Applied redundancy settings in DataValidator._apply_quality_metrics using top-level or nested config
  • Exposed redundancy controls in GUI and updated tests for both function and validation
src/phenoqc/quality_metrics.py
src/phenoqc/validation.py
src/phenoqc/batch_processing.py
src/phenoqc/gui/gui.py
src/phenoqc/cli.py
tests/test_redundancy_config.py
Refactor imputation summary and tuning merge
  • Reworked _build_imputation_summary to merge engine.tuning_summary with selective config fields
  • Ensured default random_state is set when tuning is enabled but not specified
src/phenoqc/batch_processing.py
Enhance CLI/GUI options and documentation
  • Added new CLI arguments for diagnostics, protected columns, and redundancy in cli.py
  • Updated README.md and config.yaml to document new options and CLI vs. config precedence
  • Adjusted GUI workflow steps to include stability diagnostics, protected columns, and redundancy
src/phenoqc/cli.py
src/phenoqc/gui/gui.py
README.md
config.yaml
Integrate stability into reporting
  • Modified reporting.py to insert a combined "Imputation Stability & Bias" section
  • Populated top variables by CV error in PDF and Markdown outputs
src/phenoqc/reporting.py
Add tests and end-to-end demo script
  • Expanded unit tests for bias report, protected columns, redundancy, stability, and report generation
  • Included scripts/end_to_end_diagnostics_demo.sh to validate diagnostics pipeline and reproducibility
tests/test_imputation_bias.py
tests/test_cli_protected_columns.py
tests/test_imputation_settings_report.py
tests/test_protected_columns.py
tests/test_redundancy_config.py
tests/test_stability_cv.py
tests/test_imputation_stability.py
scripts/end_to_end_diagnostics_demo.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jorgeMFS - I've reviewed your changes - here's some feedback:

  • A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
  • The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
  • This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
- The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
- This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.

## Individual Comments

### Comment 1
<location> `src/phenoqc/batch_processing.py:907` </location>
<code_context>
-                    'var_ratio_low': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_low', 0.5)) if isinstance(metrics_cfg, dict) else 0.5,
-                    'var_ratio_high': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_high', 2.0)) if isinstance(metrics_cfg, dict) else 2.0,
-                    'ks_alpha': float(metrics_cfg.get('imputation_bias', {}).get('ks_alpha', 0.05)) if isinstance(metrics_cfg, dict) else 0.05,
+                stability_diagnostics=stability_df,
+                bias_thresholds=merged_bias_thresholds if 'merged_bias_thresholds' in locals() else {
+                    'smd_threshold': float(bias_smd_threshold),
</code_context>

<issue_to_address>
stability_diagnostics is added to the report, but may be None if not run.

Consider returning an empty DataFrame instead of None to ensure consistent output and simplify downstream handling.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                stability_diagnostics=stability_df,
=======
                stability_diagnostics=stability_df if stability_df is not None else pd.DataFrame(),
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `src/phenoqc/quality_metrics.py:70` </location>
<code_context>


-def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
+def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:
     """Detect highly correlated or identical columns.

</code_context>

<issue_to_address>
Correlation method for redundancy detection is now configurable.

If an invalid method is provided, the code defaults to 'pearson' without notification. Please add a warning or message to inform users when this fallback occurs.
</issue_to_address>

### Comment 3
<location> `src/phenoqc/quality_metrics.py:343` </location>
<code_context>
             if len(obs) == 0 or len(imp) == 0:
                 continue

+            low_n = bool(len(obs) < 10 or len(imp) < 10)
             if len(obs) < 3 or len(imp) < 3:
                 ks_stat = None
</code_context>

<issue_to_address>
Auxiliary fields wdist and low_n are appended to the row dicts.

Modifying __dict__ directly is fragile and may cause issues if the row type changes. Instead, convert the dataclass instance to a dict before adding extra fields.

Suggested implementation:

```python
            # Optional Wasserstein distance for distributional shift (not part of warn thresholds)
            try:
                wdist = float(stats.wasserstein_distance(obs.values, imp.values)) if len(obs) and len(imp) else np.nan
            except Exception:
                wdist = np.nan

            # Convert row dataclass to dict and append auxiliary fields
            import dataclasses
            row_dict = dataclasses.asdict(row)
            row_dict["wdist"] = wdist
            row_dict["low_n"] = low_n
            # Now use row_dict instead of modifying row.__dict__

```

You will need to replace any code that appends or processes `row` (as a dataclass) with `row_dict` (as a dict) in the rest of this function or loop. For example, if you have `rows.append(row)`, change it to `rows.append(row_dict)`. If you need to preserve the dataclass elsewhere, consider keeping both representations.
</issue_to_address>

### Comment 4
<location> `src/phenoqc/missing_data.py:204` </location>
<code_context>
         self.cfg = cfg or {}
         self.exclude_columns = set(exclude_columns or [])
+        
+        # Warn if protected columns are requested as imputation targets
+        if exclude_columns:
+            per_column = cfg.get('per_column', {}) if cfg else {}
+            for col in exclude_columns:
+                if col in per_column:
+                    message = f"Protected column '{col}' is configured for imputation but will be excluded"
+                    logging.warning(message)
+                    warnings.warn(message, UserWarning)
+        
         self.chosen_params: dict = {}
</code_context>

<issue_to_address>
Warning is issued if protected columns are configured for imputation.

Consider using either logging.warning or warnings.warn, rather than both, to prevent duplicate notifications.
</issue_to_address>

### Comment 5
<location> `tests/test_imputation_stability.py:18` </location>
<code_context>
+            'g2': rng.normal(5, 2.0, n),
+        })
+
+    def test_stability_converges_with_repeats(self):
+        # With more repeats, the SD of MAE across repeats should be non-increasing on average
+        res_low = imputation_stability_cv(self.df, strategy='mean', repeats=3, mask_fraction=0.1, scoring='MAE', random_state=7)
+        res_high = imputation_stability_cv(self.df, strategy='mean', repeats=10, mask_fraction=0.1, scoring='MAE', random_state=7)
</code_context>

<issue_to_address>
Consider adding tests for edge cases such as empty DataFrames, non-numeric columns, and columns with all missing values.

Current tests only cover standard numeric data. Please add cases for empty DataFrames, non-numeric columns, and columns with all missing values to verify robust handling of these inputs.
</issue_to_address>

### Comment 6
<location> `tests/test_imputation_settings_report.py:6` </location>
<code_context>
+def test_report_includes_imputation_settings_pdf():
</code_context>

<issue_to_address>
Add assertions to verify that imputation tuning details are present in the generated PDF report.

The test should validate that the PDF contains the expected imputation tuning details, not just that the file is created. Use a PDF parser or string search to confirm these details are present.

Suggested implementation:

```python
import PyPDF2

def test_report_includes_imputation_settings_pdf():
    validation_results = {
        "Format Validation": True,
        "Duplicate Records": pd.DataFrame(),
        "Conflicting Records": pd.DataFrame(),
        "Integrity Issues": pd.DataFrame(),
        "Referential Integrity Issues": pd.DataFrame(),
        "Anomalies Detected": pd.DataFrame(),
        "Invalid Mask": pd.DataFrame(),
    }
    missing_data = pd.Series(dtype=int)

    # Example imputation settings
    imputation_settings = {
        "method": "KNN",
        "n_neighbors": 5,
        "weights": "uniform",
        "missing_values": "NaN"
    }

    # Generate PDF report in-memory
    pdf_buffer = io.BytesIO()
    generate_qc_report(
        validation_results=validation_results,
        missing_data=missing_data,
        imputation_settings=imputation_settings,
        output_format="pdf",
        output_stream=pdf_buffer
    )
    pdf_buffer.seek(0)

    # Parse PDF and extract text
    reader = PyPDF2.PdfReader(pdf_buffer)
    pdf_text = ""
    for page in reader.pages:
        pdf_text += page.extract_text()

    # Assert that imputation tuning details are present
    assert "Imputation Method: KNN" in pdf_text
    assert "Number of Neighbors: 5" in pdf_text
    assert "Weights: uniform" in pdf_text
    assert "Missing Values: NaN" in pdf_text

```

- Ensure that `generate_qc_report` supports the `output_stream` and `imputation_settings` arguments and that it writes the imputation settings in the expected format to the PDF.
- You may need to adjust the assertion strings to match the actual formatting in the generated PDF.
- Make sure `PyPDF2` is installed in your test environment.
</issue_to_address>

### Comment 7
<location> `tests/test_cli_protected_columns.py:5` </location>
<code_context>
+from phenoqc.cli import parse_arguments
+
+
+def test_cli_protected_columns_commas_and_spaces(monkeypatch):
+    argv = [
+        'phenoqc',
+        '--input', 'a.csv',
+        '--schema', 'schema.json',
+        '--unique_identifiers', 'id',
+        '--protected-columns', 'label,outcome', 'group'
+    ]
+    monkeypatch.setattr('sys.argv', argv)
+    args = parse_arguments()
+    assert args.protected_columns == ['label', 'outcome', 'group']
+
+
</code_context>

<issue_to_address>
Add tests for edge cases such as empty protected columns, duplicate entries, and mixed whitespace.

Please include tests for empty input, duplicate column names, and entries with extra whitespace to cover additional edge cases.
</issue_to_address>

### Comment 8
<location> `tests/test_imputation_bias.py:4` </location>
<code_context>
+- Imputation-bias thresholds: `--bias-smd-threshold`, `--bias-var-low`, `--bias-var-high`, `--bias-ks-alpha`
+- Imputation stability diagnostics: `--impute-diagnostics on|off`, `--diag-repeats`, `--diag-mask-fraction`, `--diag-scoring`
+- Redundancy tuning: `--redundancy-threshold`, `--redundancy-method {pearson,spearman}`

 ### Config (new `imputation:` block)

</code_context>

<issue_to_address>
No new tests added for imputation_bias_report edge cases; consider adding tests for low sample size, all missing, and non-numeric columns.

Please ensure tests cover columns with <10 observations, all missing values, and non-numeric data, verifying that expected fields and warnings are present in the output.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/phenoqc/batch_processing.py
Comment thread src/phenoqc/quality_metrics.py
Comment thread src/phenoqc/quality_metrics.py Outdated
Comment thread src/phenoqc/missing_data.py
Comment thread tests/test_imputation_stability.py
Comment thread src/phenoqc/batch_processing.py Outdated
Comment thread src/phenoqc/gui/gui.py Outdated
Comment thread src/phenoqc/quality_metrics.py
Comment thread src/phenoqc/quality_metrics.py Outdated
Comment thread src/phenoqc/quality_metrics.py
…ample, protected columns & redundancy flags, and GUI highlights
…o avoid __dict__ mutation; minor GUI nesting tweak; add edge-case tests for stability CV and bias; expand CLI protected columns tests
@jorgeMFS

Copy link
Copy Markdown
Owner Author

@sourcery-ai resolve

@jorgeMFS

Copy link
Copy Markdown
Owner Author

@sourcery-ai dismiss

@jorgeMFS

Copy link
Copy Markdown
Owner Author

@sourcery-ai review

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jorgeMFS - I've reviewed your changes - here's some feedback:

  • Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
  • Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
  • The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
- Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
- The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.

## Individual Comments

### Comment 1
<location> `src/phenoqc/quality_metrics.py:345` </location>
<code_context>
             if len(obs) == 0 or len(imp) == 0:
                 continue

+            low_n = (len(obs) < 10 or len(imp) < 10)
             if len(obs) < 3 or len(imp) < 3:
                 ks_stat = None
</code_context>

<issue_to_address>
The 'low_n' flag is added to bias diagnostics but not documented.

Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.
</issue_to_address>

### Comment 2
<location> `src/phenoqc/quality_metrics.py:397` </location>
<code_context>
+def imputation_stability_cv(
</code_context>

<issue_to_address>
Imputation stability CV uses a fixed random_state for all repeats.

Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.

Suggested implementation:

```python
def imputation_stability_cv(
    df: pd.DataFrame,
    strategy: str,
    params: Optional[Dict[str, Any]] = None,
    repeats: int = 5,
    mask_fraction: float = 0.10,
    scoring: str = 'MAE',
    random_state: int = 42,
    columns: Optional[List[str]] = None,
) -> pd.DataFrame:
    """Estimate imputation repeatability via repeated mask-and-impute.

```

```python
    for repeat in range(repeats):
        # Vary random_state for each repeat to ensure independent masking
        repeat_random_state = random_state + repeat if random_state is not None else None
        mask = create_mask(df, mask_fraction, random_state=repeat_random_state)
        imputed = impute(df, mask, strategy, params, columns)
        score = score_imputation(df, imputed, mask, scoring)
        scores.append(score)

```
</issue_to_address>

### Comment 3
<location> `src/phenoqc/batch_processing.py:1055` </location>
<code_context>
             tun = cfg_imp.setdefault('tuning', {})
             tun['enable'] = True

+    # Merge redundancy metric overrides (CLI > YAML)
+    try:
+        if redundancy_threshold is not None or redundancy_method is not None:
+            qm = config.setdefault('quality_metrics', [])
+            # Ensure redundancy metric is enabled in config if CLI tweaks are given
+            if isinstance(qm, list):
+                if 'redundancy' not in qm:
+                    qm.append('redundancy')
+            elif isinstance(qm, dict):
+                qm.setdefault('redundancy', {})
+            # Attach parameters in dict style for validator lookup
+            config.setdefault('redundancy', {})
+            if redundancy_threshold is not None:
+                config['redundancy']['threshold'] = float(redundancy_threshold)
+            if redundancy_method is not None:
+                config['redundancy']['method'] = str(redundancy_method)
+    except Exception:
+        pass
+
</code_context>

<issue_to_address>
Redundancy metric override logic may silently fail.

Instead of suppressing all exceptions, log any errors during the merge to improve traceability and debugging.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    try:
        if redundancy_threshold is not None or redundancy_method is not None:
            qm = config.setdefault('quality_metrics', [])
            # Ensure redundancy metric is enabled in config if CLI tweaks are given
            if isinstance(qm, list):
                if 'redundancy' not in qm:
                    qm.append('redundancy')
            elif isinstance(qm, dict):
                qm.setdefault('redundancy', {})
            # Attach parameters in dict style for validator lookup
            config.setdefault('redundancy', {})
            if redundancy_threshold is not None:
                config['redundancy']['threshold'] = float(redundancy_threshold)
            if redundancy_method is not None:
                config['redundancy']['method'] = str(redundancy_method)
    except Exception:
        pass
=======
    try:
        if redundancy_threshold is not None or redundancy_method is not None:
            qm = config.setdefault('quality_metrics', [])
            # Ensure redundancy metric is enabled in config if CLI tweaks are given
            if isinstance(qm, list):
                if 'redundancy' not in qm:
                    qm.append('redundancy')
            elif isinstance(qm, dict):
                qm.setdefault('redundancy', {})
            # Attach parameters in dict style for validator lookup
            config.setdefault('redundancy', {})
            if redundancy_threshold is not None:
                config['redundancy']['threshold'] = float(redundancy_threshold)
            if redundancy_method is not None:
                config['redundancy']['method'] = str(redundancy_method)
    except Exception as e:
        log_activity(f"[RedundancyOverride] Error merging redundancy metric overrides: {e}", level="error")
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `tests/test_protected_columns.py:102` </location>
<code_context>
+    def test_protected_columns_warning_when_configured(self):
</code_context>

<issue_to_address>
Consider adding a test for multiple protected columns configured for imputation.

Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.

Suggested implementation:

```python
    def test_protected_columns_warning_when_configured(self):
        """Test that warning is issued when protected columns are configured for imputation."""
        protected_cols = ['outcome']

        # Configure the protected column for imputation
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'}  # This should trigger a warning
            }
        }

    def test_multiple_protected_columns_warning_when_configured(self):
        """Test that warnings are issued for multiple protected columns configured for imputation."""
        protected_cols = ['outcome', 'gender', 'race']
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'},
                'gender': {'strategy': 'median'},
                'race': {'strategy': 'most_frequent'}
            }
        }
        # Assuming the imputation function is called `impute_data` and emits warnings
        import warnings
        from your_module import impute_data  # Replace with actual import

        with warnings.catch_warnings(record=True) as w:
            warnings.simplefilter("always")
            impute_data(data=None, config=config, protected_cols=protected_cols)  # Replace data=None with actual test data

            # Check that a warning was issued for each protected column
            warning_msgs = [str(warning.message) for warning in w]
            for col in protected_cols:
                assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"

```

- Replace `from your_module import impute_data` with the actual import path for your imputation function.
- Replace `data=None` with appropriate test data as required by your imputation function.
- If your warning messages have a specific format, adjust the assertion to match the expected message.
</issue_to_address>

### Comment 5
<location> `tests/test_stability_cv.py:46` </location>
<code_context>
+    assert out.empty
+
+
+def test_stability_cv_all_missing_numeric():
+    df = pd.DataFrame({'x': [np.nan, np.nan, np.nan]})
+    out = imputation_stability_cv(df, strategy='mean')
+    assert isinstance(out, pd.DataFrame)
+    # No observed cells to mask => empty result
+    assert out.empty
+
+
</code_context>

<issue_to_address>
Missing test for non-numeric columns in imputation_stability_cv.

Add a test with a DataFrame containing only non-numeric columns to ensure imputation_stability_cv returns an empty DataFrame without raising errors.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH

def test_stability_cv_basic_and_columns():
=======
def test_stability_cv_non_numeric_only():
    # DataFrame with only non-numeric columns
    df = pd.DataFrame({'a': ['foo', 'bar', 'baz'], 'b': ['x', 'y', 'z']})
    out = imputation_stability_cv(df, strategy='mean')
    assert isinstance(out, pd.DataFrame)
    assert out.empty

def test_stability_cv_basic_and_columns():
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/test_bias_report_edges.py:19` </location>
<code_context>
+        assert row.get("ks_p", None) in (None, np.nan)
+
+
+def test_bias_report_warn_logic_and_columns():
+    # Construct data where imputed mean differs markedly
+    rng = np.random.RandomState(0)
+    orig = pd.DataFrame({"a": rng.normal(0, 1, 200)})
+    imp = orig.copy()
+    # Simulate imputed values at masked positions shifted by +1.0
+    mask_positions = pd.Series([False] * 200)
+    mask_positions.iloc[:100] = True
+    imp.loc[mask_positions, "a"] = imp.loc[mask_positions, "a"] + 1.0
+    out = imputation_bias_report(original_df=orig, imputed_df=imp, imputation_mask={"a": mask_positions}, columns=["a"], smd_threshold=0.2)
+    assert not out.empty
+    assert set(["column","n_obs","n_imp","smd","var_ratio","ks_stat","ks_p","warn"]).issubset(out.columns)
+    assert bool(out["warn"].iloc[0]) is True
+
+
</code_context>

<issue_to_address>
Missing test for bias report with all columns having no missing values.

Add a test case with no missing values to verify that imputation_bias_report returns an empty DataFrame without errors.
</issue_to_address>

### Comment 7
<location> `tests/test_cli_protected_columns.py:18` </location>
<code_context>
+    assert args.protected_columns == ['label', 'outcome', 'group']
+
+
+def test_cli_protected_columns_duplicates_whitespace(monkeypatch):
+    argv = [
+        'phenoqc',
+        '--input', 'a.csv',
+        '--schema', 'schema.json',
+        '--unique_identifiers', 'id',
+        '--protected-columns', ' label , outcome ', 'label', '  outcome  ', ' '
+    ]
+    monkeypatch.setattr('sys.argv', argv)
+    args = parse_arguments()
+    # Normalization keeps order of appearance but strips empties
+    assert args.protected_columns == ['label', 'outcome', 'label', 'outcome']
+
+
</code_context>

<issue_to_address>
Consider adding a test for empty protected columns argument.

Please add a test case where --protected-columns is omitted or set to an empty string, verifying that the parser returns an empty list without raising an error.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 367 to +376
(ks_p is not None and ks_p < ks_alpha)
)

rows.append(
BiasRow(
column=col,
n_obs=len(obs),
n_imp=len(imp),
mean_diff=mean_diff,
smd=float(smd) if not np.isnan(smd) else np.nan,
var_ratio=(
float(var_ratio) if not np.isnan(var_ratio) else np.nan
),
ks_stat=float(ks_stat) if ks_stat is not None else None,
ks_p=float(ks_p) if ks_p is not None else None,
warn=warn,
)
row_obj = BiasRow(
column=col,
n_obs=len(obs),
n_imp=len(imp),
mean_diff=mean_diff,
smd=float(smd) if not np.isnan(smd) else np.nan,
var_ratio=(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The 'low_n' flag is added to bias diagnostics but not documented.

Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.

df_out = pd.DataFrame([r.__dict__ for r in rows])
# rows is a list of dictionaries after refactor above
df_out = pd.DataFrame(rows)
if not df_out.empty:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Imputation stability CV uses a fixed random_state for all repeats.

Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.

Suggested implementation:

def imputation_stability_cv(
    df: pd.DataFrame,
    strategy: str,
    params: Optional[Dict[str, Any]] = None,
    repeats: int = 5,
    mask_fraction: float = 0.10,
    scoring: str = 'MAE',
    random_state: int = 42,
    columns: Optional[List[str]] = None,
) -> pd.DataFrame:
    """Estimate imputation repeatability via repeated mask-and-impute.
    for repeat in range(repeats):
        # Vary random_state for each repeat to ensure independent masking
        repeat_random_state = random_state + repeat if random_state is not None else None
        mask = create_mask(df, mask_fraction, random_state=repeat_random_state)
        imputed = impute(df, mask, strategy, params, columns)
        score = score_imputation(df, imputed, mask, scoring)
        scores.append(score)

Comment thread src/phenoqc/batch_processing.py Outdated
Comment on lines +102 to +111
def test_protected_columns_warning_when_configured(self):
"""Test that warning is issued when protected columns are configured for imputation."""
protected_cols = ['outcome']

# Configure the protected column for imputation
config = {
'strategy': 'mean',
'per_column': {
'outcome': {'strategy': 'mode'} # This should trigger a warning
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding a test for multiple protected columns configured for imputation.

Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.

Suggested implementation:

    def test_protected_columns_warning_when_configured(self):
        """Test that warning is issued when protected columns are configured for imputation."""
        protected_cols = ['outcome']

        # Configure the protected column for imputation
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'}  # This should trigger a warning
            }
        }

    def test_multiple_protected_columns_warning_when_configured(self):
        """Test that warnings are issued for multiple protected columns configured for imputation."""
        protected_cols = ['outcome', 'gender', 'race']
        config = {
            'strategy': 'mean',
            'per_column': {
                'outcome': {'strategy': 'mode'},
                'gender': {'strategy': 'median'},
                'race': {'strategy': 'most_frequent'}
            }
        }
        # Assuming the imputation function is called `impute_data` and emits warnings
        import warnings
        from your_module import impute_data  # Replace with actual import

        with warnings.catch_warnings(record=True) as w:
            warnings.simplefilter("always")
            impute_data(data=None, config=config, protected_cols=protected_cols)  # Replace data=None with actual test data

            # Check that a warning was issued for each protected column
            warning_msgs = [str(warning.message) for warning in w]
            for col in protected_cols:
                assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"
  • Replace from your_module import impute_data with the actual import path for your imputation function.
  • Replace data=None with appropriate test data as required by your imputation function.
  • If your warning messages have a specific format, adjust the assertion to match the expected message.

Comment thread tests/test_stability_cv.py
Comment on lines +15 to +17
for col in df.columns:
idx = rng.choice(n, size=int(0.2*n), replace=False)
df.loc[idx, col] = np.nan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests



def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in detect_redundancy - 21% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

return df_out


def imputation_stability_cv(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

Comment thread tests/test_bias_report_edges.py Outdated
Comment thread tests/test_stability_cv.py Outdated
jorgeMFS and others added 8 commits August 11, 2025 00:36
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
…olumns; minor style/clarity tweaks per PR review; all tests passing
… tests; minor positive-if style tweaks in bias rows; align with PR feedback
…d offline note in config; update README and usage
@codecov-commenter

codecov-commenter commented Aug 11, 2025

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 37.44395% with 279 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/phenoqc/reporting.py 0.00% 90 Missing and 1 partial ⚠️
src/phenoqc/batch_processing.py 6.25% 72 Missing and 3 partials ⚠️
src/phenoqc/gui/gui.py 0.00% 57 Missing ⚠️
src/phenoqc/quality_metrics.py 74.83% 24 Missing and 14 partials ⚠️
src/phenoqc/validation.py 46.15% 6 Missing and 1 partial ⚠️
src/phenoqc/mapping.py 76.19% 3 Missing and 2 partials ⚠️
src/phenoqc/cli.py 81.81% 2 Missing and 2 partials ⚠️
src/phenoqc/missing_data.py 81.81% 1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Files with missing lines Coverage Δ
src/phenoqc/missing_data.py 66.12% <81.81%> (+0.58%) ⬆️
src/phenoqc/cli.py 45.25% <81.81%> (+45.25%) ⬆️
src/phenoqc/mapping.py 76.61% <76.19%> (+8.23%) ⬆️
src/phenoqc/validation.py 50.20% <46.15%> (+0.84%) ⬆️
src/phenoqc/quality_metrics.py 80.83% <74.83%> (-4.47%) ⬇️
src/phenoqc/gui/gui.py 3.75% <0.00%> (+3.75%) ⬆️
src/phenoqc/batch_processing.py 14.73% <6.25%> (-1.67%) ⬇️
src/phenoqc/reporting.py 39.73% <0.00%> (+1.76%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eated MICE; stability CV fail threshold + wiring; docs/rendering tweaks; tests added; all tests passing
…inty enable/repeats/params; render MI uncertainty; stability CV fail threshold end-to-end; add CLI test; 117 tests passing
…V fail threshold; add corresponding GUI controls; tests still green
…t DataFrame guards in imputation and batch path; clearer error surfacing
… uncertainty flags to processing; include stability/MI/bias in GUI PDF; restore full diagnostics in results
@jorgeMFS jorgeMFS merged commit 1640f81 into main Aug 11, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants