Feat/imputation improvements#77
Conversation
…eakage via protected columns; make redundancy method/threshold configurable; surface imputation settings (incl. random_state) in reports; CLI flags and docs; add complex e2e diagnostics demo; tests for bias/stability
…redundancy settings; render stability table in results
…ng, and report imputation settings
Reviewer's GuideThis PR extends the imputation workflow by adding a repeatable stability diagnostic, protected-column exclusions, and configurable redundancy metrics. It refactors the imputation summary builder to merge engine tuning details with user config, exposes new CLI and GUI controls for diagnostics, protected columns, and redundancy parameters, and integrates stability/bias results into both PDF and Markdown reports. Comprehensive tests and an end-to-end demo script ensure coverage and reproducibility. Sequence diagram for CLI/GUI configuration and imputation workflow with diagnosticssequenceDiagram
actor User
participant CLI
participant GUI
participant BatchProcessor
participant ImputationEngine
participant QualityMetrics
User->>CLI: Provide CLI args (impute, diagnostics, protected columns, redundancy)
CLI->>BatchProcessor: Pass config and CLI overrides
User->>GUI: Set config via GUI (strategy, diagnostics, protected columns, redundancy)
GUI->>BatchProcessor: Pass merged config
BatchProcessor->>ImputationEngine: Build engine with protected columns excluded
BatchProcessor->>QualityMetrics: Run bias and stability diagnostics
QualityMetrics->>BatchProcessor: Return diagnostics results
BatchProcessor->>Report: Integrate diagnostics into PDF/Markdown
ER diagram for protected columns and redundancy config in YAMLerDiagram
CONFIG ||--o{ IMPUTATION : contains
CONFIG ||--o{ QUALITY_METRICS : contains
CONFIG ||--o{ PROTECTED_COLUMNS : contains
CONFIG ||--o{ REDUNDANCY : contains
IMPUTATION {
string strategy
dict params
dict tuning
dict per_column
}
QUALITY_METRICS {
dict imputation_bias
dict imputation_stability
dict redundancy
}
PROTECTED_COLUMNS {
string[] columns
}
REDUNDANCY {
float threshold
string method
}
Class diagram for ImputationEngine and new diagnosticsclassDiagram
class ImputationEngine {
- cfg: dict
- exclude_columns: set
+ __init__(cfg, exclude_columns)
+ fit_transform(df)
}
class imputation_bias_report {
+ imputation_bias_report(original_df, imputed_df, imputation_mask, smd_threshold, var_ratio_low, var_ratio_high, ks_alpha)
}
class imputation_stability_cv {
+ imputation_stability_cv(df, strategy, params, repeats, mask_fraction, scoring, random_state, columns)
}
ImputationEngine --> imputation_bias_report : uses
ImputationEngine --> imputation_stability_cv : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @jorgeMFS - I've reviewed your changes - here's some feedback:
- A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
- The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
- This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- A lot of the CLI/config override merging logic for protected columns, bias thresholds, redundancy, and tuning is duplicated across batch_process, child_process_run, and _append_metric—consider extracting this into shared helper functions to improve maintainability and reduce errors.
- The imputation stability diagnostic is deeply nested inside _append_metric with multiple try/catch blocks—extract it into a clear helper or service function to simplify error handling and improve readability.
- This PR bundles several large features (stability diagnostics, protected columns, redundancy metrics, GUI updates) in one change set—consider splitting into smaller focused PRs to make review and testing more manageable.
## Individual Comments
### Comment 1
<location> `src/phenoqc/batch_processing.py:907` </location>
<code_context>
- 'var_ratio_low': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_low', 0.5)) if isinstance(metrics_cfg, dict) else 0.5,
- 'var_ratio_high': float(metrics_cfg.get('imputation_bias', {}).get('var_ratio_high', 2.0)) if isinstance(metrics_cfg, dict) else 2.0,
- 'ks_alpha': float(metrics_cfg.get('imputation_bias', {}).get('ks_alpha', 0.05)) if isinstance(metrics_cfg, dict) else 0.05,
+ stability_diagnostics=stability_df,
+ bias_thresholds=merged_bias_thresholds if 'merged_bias_thresholds' in locals() else {
+ 'smd_threshold': float(bias_smd_threshold),
</code_context>
<issue_to_address>
stability_diagnostics is added to the report, but may be None if not run.
Consider returning an empty DataFrame instead of None to ensure consistent output and simplify downstream handling.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
stability_diagnostics=stability_df,
=======
stability_diagnostics=stability_df if stability_df is not None else pd.DataFrame(),
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/phenoqc/quality_metrics.py:70` </location>
<code_context>
-def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame:
+def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame:
"""Detect highly correlated or identical columns.
</code_context>
<issue_to_address>
Correlation method for redundancy detection is now configurable.
If an invalid method is provided, the code defaults to 'pearson' without notification. Please add a warning or message to inform users when this fallback occurs.
</issue_to_address>
### Comment 3
<location> `src/phenoqc/quality_metrics.py:343` </location>
<code_context>
if len(obs) == 0 or len(imp) == 0:
continue
+ low_n = bool(len(obs) < 10 or len(imp) < 10)
if len(obs) < 3 or len(imp) < 3:
ks_stat = None
</code_context>
<issue_to_address>
Auxiliary fields wdist and low_n are appended to the row dicts.
Modifying __dict__ directly is fragile and may cause issues if the row type changes. Instead, convert the dataclass instance to a dict before adding extra fields.
Suggested implementation:
```python
# Optional Wasserstein distance for distributional shift (not part of warn thresholds)
try:
wdist = float(stats.wasserstein_distance(obs.values, imp.values)) if len(obs) and len(imp) else np.nan
except Exception:
wdist = np.nan
# Convert row dataclass to dict and append auxiliary fields
import dataclasses
row_dict = dataclasses.asdict(row)
row_dict["wdist"] = wdist
row_dict["low_n"] = low_n
# Now use row_dict instead of modifying row.__dict__
```
You will need to replace any code that appends or processes `row` (as a dataclass) with `row_dict` (as a dict) in the rest of this function or loop. For example, if you have `rows.append(row)`, change it to `rows.append(row_dict)`. If you need to preserve the dataclass elsewhere, consider keeping both representations.
</issue_to_address>
### Comment 4
<location> `src/phenoqc/missing_data.py:204` </location>
<code_context>
self.cfg = cfg or {}
self.exclude_columns = set(exclude_columns or [])
+
+ # Warn if protected columns are requested as imputation targets
+ if exclude_columns:
+ per_column = cfg.get('per_column', {}) if cfg else {}
+ for col in exclude_columns:
+ if col in per_column:
+ message = f"Protected column '{col}' is configured for imputation but will be excluded"
+ logging.warning(message)
+ warnings.warn(message, UserWarning)
+
self.chosen_params: dict = {}
</code_context>
<issue_to_address>
Warning is issued if protected columns are configured for imputation.
Consider using either logging.warning or warnings.warn, rather than both, to prevent duplicate notifications.
</issue_to_address>
### Comment 5
<location> `tests/test_imputation_stability.py:18` </location>
<code_context>
+ 'g2': rng.normal(5, 2.0, n),
+ })
+
+ def test_stability_converges_with_repeats(self):
+ # With more repeats, the SD of MAE across repeats should be non-increasing on average
+ res_low = imputation_stability_cv(self.df, strategy='mean', repeats=3, mask_fraction=0.1, scoring='MAE', random_state=7)
+ res_high = imputation_stability_cv(self.df, strategy='mean', repeats=10, mask_fraction=0.1, scoring='MAE', random_state=7)
</code_context>
<issue_to_address>
Consider adding tests for edge cases such as empty DataFrames, non-numeric columns, and columns with all missing values.
Current tests only cover standard numeric data. Please add cases for empty DataFrames, non-numeric columns, and columns with all missing values to verify robust handling of these inputs.
</issue_to_address>
### Comment 6
<location> `tests/test_imputation_settings_report.py:6` </location>
<code_context>
+def test_report_includes_imputation_settings_pdf():
</code_context>
<issue_to_address>
Add assertions to verify that imputation tuning details are present in the generated PDF report.
The test should validate that the PDF contains the expected imputation tuning details, not just that the file is created. Use a PDF parser or string search to confirm these details are present.
Suggested implementation:
```python
import PyPDF2
def test_report_includes_imputation_settings_pdf():
validation_results = {
"Format Validation": True,
"Duplicate Records": pd.DataFrame(),
"Conflicting Records": pd.DataFrame(),
"Integrity Issues": pd.DataFrame(),
"Referential Integrity Issues": pd.DataFrame(),
"Anomalies Detected": pd.DataFrame(),
"Invalid Mask": pd.DataFrame(),
}
missing_data = pd.Series(dtype=int)
# Example imputation settings
imputation_settings = {
"method": "KNN",
"n_neighbors": 5,
"weights": "uniform",
"missing_values": "NaN"
}
# Generate PDF report in-memory
pdf_buffer = io.BytesIO()
generate_qc_report(
validation_results=validation_results,
missing_data=missing_data,
imputation_settings=imputation_settings,
output_format="pdf",
output_stream=pdf_buffer
)
pdf_buffer.seek(0)
# Parse PDF and extract text
reader = PyPDF2.PdfReader(pdf_buffer)
pdf_text = ""
for page in reader.pages:
pdf_text += page.extract_text()
# Assert that imputation tuning details are present
assert "Imputation Method: KNN" in pdf_text
assert "Number of Neighbors: 5" in pdf_text
assert "Weights: uniform" in pdf_text
assert "Missing Values: NaN" in pdf_text
```
- Ensure that `generate_qc_report` supports the `output_stream` and `imputation_settings` arguments and that it writes the imputation settings in the expected format to the PDF.
- You may need to adjust the assertion strings to match the actual formatting in the generated PDF.
- Make sure `PyPDF2` is installed in your test environment.
</issue_to_address>
### Comment 7
<location> `tests/test_cli_protected_columns.py:5` </location>
<code_context>
+from phenoqc.cli import parse_arguments
+
+
+def test_cli_protected_columns_commas_and_spaces(monkeypatch):
+ argv = [
+ 'phenoqc',
+ '--input', 'a.csv',
+ '--schema', 'schema.json',
+ '--unique_identifiers', 'id',
+ '--protected-columns', 'label,outcome', 'group'
+ ]
+ monkeypatch.setattr('sys.argv', argv)
+ args = parse_arguments()
+ assert args.protected_columns == ['label', 'outcome', 'group']
+
+
</code_context>
<issue_to_address>
Add tests for edge cases such as empty protected columns, duplicate entries, and mixed whitespace.
Please include tests for empty input, duplicate column names, and entries with extra whitespace to cover additional edge cases.
</issue_to_address>
### Comment 8
<location> `tests/test_imputation_bias.py:4` </location>
<code_context>
+- Imputation-bias thresholds: `--bias-smd-threshold`, `--bias-var-low`, `--bias-var-high`, `--bias-ks-alpha`
+- Imputation stability diagnostics: `--impute-diagnostics on|off`, `--diag-repeats`, `--diag-mask-fraction`, `--diag-scoring`
+- Redundancy tuning: `--redundancy-threshold`, `--redundancy-method {pearson,spearman}`
### Config (new `imputation:` block)
</code_context>
<issue_to_address>
No new tests added for imputation_bias_report edge cases; consider adding tests for low sample size, all missing, and non-numeric columns.
Please ensure tests cover columns with <10 observations, all missing values, and non-numeric data, verifying that expected fields and warnings are present in the output.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…ample, protected columns & redundancy flags, and GUI highlights
…o avoid __dict__ mutation; minor GUI nesting tweak; add edge-case tests for stability CV and bias; expand CLI protected columns tests
|
@sourcery-ai resolve |
|
@sourcery-ai dismiss |
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey @jorgeMFS - I've reviewed your changes - here's some feedback:
- Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
- Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
- The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Please remove the leftover merge conflict markers (<<<<<<<, =======, >>>>>>>) in the README to avoid build or parsing errors.
- Multiple bare except blocks swallow errors silently—consider catching specific exceptions or at least logging the exception details to aid future debugging.
- The CLI vs config precedence logic for protected columns and redundancy settings is duplicated in several places; extracting that into a shared utility would reduce maintenance overhead.
## Individual Comments
### Comment 1
<location> `src/phenoqc/quality_metrics.py:345` </location>
<code_context>
if len(obs) == 0 or len(imp) == 0:
continue
+ low_n = (len(obs) < 10 or len(imp) < 10)
if len(obs) < 3 or len(imp) < 3:
ks_stat = None
</code_context>
<issue_to_address>
The 'low_n' flag is added to bias diagnostics but not documented.
Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.
</issue_to_address>
### Comment 2
<location> `src/phenoqc/quality_metrics.py:397` </location>
<code_context>
+def imputation_stability_cv(
</code_context>
<issue_to_address>
Imputation stability CV uses a fixed random_state for all repeats.
Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.
Suggested implementation:
```python
def imputation_stability_cv(
df: pd.DataFrame,
strategy: str,
params: Optional[Dict[str, Any]] = None,
repeats: int = 5,
mask_fraction: float = 0.10,
scoring: str = 'MAE',
random_state: int = 42,
columns: Optional[List[str]] = None,
) -> pd.DataFrame:
"""Estimate imputation repeatability via repeated mask-and-impute.
```
```python
for repeat in range(repeats):
# Vary random_state for each repeat to ensure independent masking
repeat_random_state = random_state + repeat if random_state is not None else None
mask = create_mask(df, mask_fraction, random_state=repeat_random_state)
imputed = impute(df, mask, strategy, params, columns)
score = score_imputation(df, imputed, mask, scoring)
scores.append(score)
```
</issue_to_address>
### Comment 3
<location> `src/phenoqc/batch_processing.py:1055` </location>
<code_context>
tun = cfg_imp.setdefault('tuning', {})
tun['enable'] = True
+ # Merge redundancy metric overrides (CLI > YAML)
+ try:
+ if redundancy_threshold is not None or redundancy_method is not None:
+ qm = config.setdefault('quality_metrics', [])
+ # Ensure redundancy metric is enabled in config if CLI tweaks are given
+ if isinstance(qm, list):
+ if 'redundancy' not in qm:
+ qm.append('redundancy')
+ elif isinstance(qm, dict):
+ qm.setdefault('redundancy', {})
+ # Attach parameters in dict style for validator lookup
+ config.setdefault('redundancy', {})
+ if redundancy_threshold is not None:
+ config['redundancy']['threshold'] = float(redundancy_threshold)
+ if redundancy_method is not None:
+ config['redundancy']['method'] = str(redundancy_method)
+ except Exception:
+ pass
+
</code_context>
<issue_to_address>
Redundancy metric override logic may silently fail.
Instead of suppressing all exceptions, log any errors during the merge to improve traceability and debugging.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
try:
if redundancy_threshold is not None or redundancy_method is not None:
qm = config.setdefault('quality_metrics', [])
# Ensure redundancy metric is enabled in config if CLI tweaks are given
if isinstance(qm, list):
if 'redundancy' not in qm:
qm.append('redundancy')
elif isinstance(qm, dict):
qm.setdefault('redundancy', {})
# Attach parameters in dict style for validator lookup
config.setdefault('redundancy', {})
if redundancy_threshold is not None:
config['redundancy']['threshold'] = float(redundancy_threshold)
if redundancy_method is not None:
config['redundancy']['method'] = str(redundancy_method)
except Exception:
pass
=======
try:
if redundancy_threshold is not None or redundancy_method is not None:
qm = config.setdefault('quality_metrics', [])
# Ensure redundancy metric is enabled in config if CLI tweaks are given
if isinstance(qm, list):
if 'redundancy' not in qm:
qm.append('redundancy')
elif isinstance(qm, dict):
qm.setdefault('redundancy', {})
# Attach parameters in dict style for validator lookup
config.setdefault('redundancy', {})
if redundancy_threshold is not None:
config['redundancy']['threshold'] = float(redundancy_threshold)
if redundancy_method is not None:
config['redundancy']['method'] = str(redundancy_method)
except Exception as e:
log_activity(f"[RedundancyOverride] Error merging redundancy metric overrides: {e}", level="error")
>>>>>>> REPLACE
</suggested_fix>
### Comment 4
<location> `tests/test_protected_columns.py:102` </location>
<code_context>
+ def test_protected_columns_warning_when_configured(self):
</code_context>
<issue_to_address>
Consider adding a test for multiple protected columns configured for imputation.
Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.
Suggested implementation:
```python
def test_protected_columns_warning_when_configured(self):
"""Test that warning is issued when protected columns are configured for imputation."""
protected_cols = ['outcome']
# Configure the protected column for imputation
config = {
'strategy': 'mean',
'per_column': {
'outcome': {'strategy': 'mode'} # This should trigger a warning
}
}
def test_multiple_protected_columns_warning_when_configured(self):
"""Test that warnings are issued for multiple protected columns configured for imputation."""
protected_cols = ['outcome', 'gender', 'race']
config = {
'strategy': 'mean',
'per_column': {
'outcome': {'strategy': 'mode'},
'gender': {'strategy': 'median'},
'race': {'strategy': 'most_frequent'}
}
}
# Assuming the imputation function is called `impute_data` and emits warnings
import warnings
from your_module import impute_data # Replace with actual import
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
impute_data(data=None, config=config, protected_cols=protected_cols) # Replace data=None with actual test data
# Check that a warning was issued for each protected column
warning_msgs = [str(warning.message) for warning in w]
for col in protected_cols:
assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"
```
- Replace `from your_module import impute_data` with the actual import path for your imputation function.
- Replace `data=None` with appropriate test data as required by your imputation function.
- If your warning messages have a specific format, adjust the assertion to match the expected message.
</issue_to_address>
### Comment 5
<location> `tests/test_stability_cv.py:46` </location>
<code_context>
+ assert out.empty
+
+
+def test_stability_cv_all_missing_numeric():
+ df = pd.DataFrame({'x': [np.nan, np.nan, np.nan]})
+ out = imputation_stability_cv(df, strategy='mean')
+ assert isinstance(out, pd.DataFrame)
+ # No observed cells to mask => empty result
+ assert out.empty
+
+
</code_context>
<issue_to_address>
Missing test for non-numeric columns in imputation_stability_cv.
Add a test with a DataFrame containing only non-numeric columns to ensure imputation_stability_cv returns an empty DataFrame without raising errors.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_stability_cv_basic_and_columns():
=======
def test_stability_cv_non_numeric_only():
# DataFrame with only non-numeric columns
df = pd.DataFrame({'a': ['foo', 'bar', 'baz'], 'b': ['x', 'y', 'z']})
out = imputation_stability_cv(df, strategy='mean')
assert isinstance(out, pd.DataFrame)
assert out.empty
def test_stability_cv_basic_and_columns():
>>>>>>> REPLACE
</suggested_fix>
### Comment 6
<location> `tests/test_bias_report_edges.py:19` </location>
<code_context>
+ assert row.get("ks_p", None) in (None, np.nan)
+
+
+def test_bias_report_warn_logic_and_columns():
+ # Construct data where imputed mean differs markedly
+ rng = np.random.RandomState(0)
+ orig = pd.DataFrame({"a": rng.normal(0, 1, 200)})
+ imp = orig.copy()
+ # Simulate imputed values at masked positions shifted by +1.0
+ mask_positions = pd.Series([False] * 200)
+ mask_positions.iloc[:100] = True
+ imp.loc[mask_positions, "a"] = imp.loc[mask_positions, "a"] + 1.0
+ out = imputation_bias_report(original_df=orig, imputed_df=imp, imputation_mask={"a": mask_positions}, columns=["a"], smd_threshold=0.2)
+ assert not out.empty
+ assert set(["column","n_obs","n_imp","smd","var_ratio","ks_stat","ks_p","warn"]).issubset(out.columns)
+ assert bool(out["warn"].iloc[0]) is True
+
+
</code_context>
<issue_to_address>
Missing test for bias report with all columns having no missing values.
Add a test case with no missing values to verify that imputation_bias_report returns an empty DataFrame without errors.
</issue_to_address>
### Comment 7
<location> `tests/test_cli_protected_columns.py:18` </location>
<code_context>
+ assert args.protected_columns == ['label', 'outcome', 'group']
+
+
+def test_cli_protected_columns_duplicates_whitespace(monkeypatch):
+ argv = [
+ 'phenoqc',
+ '--input', 'a.csv',
+ '--schema', 'schema.json',
+ '--unique_identifiers', 'id',
+ '--protected-columns', ' label , outcome ', 'label', ' outcome ', ' '
+ ]
+ monkeypatch.setattr('sys.argv', argv)
+ args = parse_arguments()
+ # Normalization keeps order of appearance but strips empties
+ assert args.protected_columns == ['label', 'outcome', 'label', 'outcome']
+
+
</code_context>
<issue_to_address>
Consider adding a test for empty protected columns argument.
Please add a test case where --protected-columns is omitted or set to an empty string, verifying that the parser returns an empty list without raising an error.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| (ks_p is not None and ks_p < ks_alpha) | ||
| ) | ||
|
|
||
| rows.append( | ||
| BiasRow( | ||
| column=col, | ||
| n_obs=len(obs), | ||
| n_imp=len(imp), | ||
| mean_diff=mean_diff, | ||
| smd=float(smd) if not np.isnan(smd) else np.nan, | ||
| var_ratio=( | ||
| float(var_ratio) if not np.isnan(var_ratio) else np.nan | ||
| ), | ||
| ks_stat=float(ks_stat) if ks_stat is not None else None, | ||
| ks_p=float(ks_p) if ks_p is not None else None, | ||
| warn=warn, | ||
| ) | ||
| row_obj = BiasRow( | ||
| column=col, | ||
| n_obs=len(obs), | ||
| n_imp=len(imp), | ||
| mean_diff=mean_diff, | ||
| smd=float(smd) if not np.isnan(smd) else np.nan, | ||
| var_ratio=( |
There was a problem hiding this comment.
question: The 'low_n' flag is added to bias diagnostics but not documented.
Please clarify the purpose of 'low_n' in the output schema and verify that downstream consumers interpret it correctly.
| df_out = pd.DataFrame([r.__dict__ for r in rows]) | ||
| # rows is a list of dictionaries after refactor above | ||
| df_out = pd.DataFrame(rows) | ||
| if not df_out.empty: |
There was a problem hiding this comment.
suggestion: Imputation stability CV uses a fixed random_state for all repeats.
Consider varying the random_state for each repeat to increase mask diversity and ensure independent masking.
Suggested implementation:
def imputation_stability_cv(
df: pd.DataFrame,
strategy: str,
params: Optional[Dict[str, Any]] = None,
repeats: int = 5,
mask_fraction: float = 0.10,
scoring: str = 'MAE',
random_state: int = 42,
columns: Optional[List[str]] = None,
) -> pd.DataFrame:
"""Estimate imputation repeatability via repeated mask-and-impute. for repeat in range(repeats):
# Vary random_state for each repeat to ensure independent masking
repeat_random_state = random_state + repeat if random_state is not None else None
mask = create_mask(df, mask_fraction, random_state=repeat_random_state)
imputed = impute(df, mask, strategy, params, columns)
score = score_imputation(df, imputed, mask, scoring)
scores.append(score)| def test_protected_columns_warning_when_configured(self): | ||
| """Test that warning is issued when protected columns are configured for imputation.""" | ||
| protected_cols = ['outcome'] | ||
|
|
||
| # Configure the protected column for imputation | ||
| config = { | ||
| 'strategy': 'mean', | ||
| 'per_column': { | ||
| 'outcome': {'strategy': 'mode'} # This should trigger a warning | ||
| } |
There was a problem hiding this comment.
suggestion (testing): Consider adding a test for multiple protected columns configured for imputation.
Adding a test with multiple protected columns will help verify that warnings are correctly emitted for each column.
Suggested implementation:
def test_protected_columns_warning_when_configured(self):
"""Test that warning is issued when protected columns are configured for imputation."""
protected_cols = ['outcome']
# Configure the protected column for imputation
config = {
'strategy': 'mean',
'per_column': {
'outcome': {'strategy': 'mode'} # This should trigger a warning
}
}
def test_multiple_protected_columns_warning_when_configured(self):
"""Test that warnings are issued for multiple protected columns configured for imputation."""
protected_cols = ['outcome', 'gender', 'race']
config = {
'strategy': 'mean',
'per_column': {
'outcome': {'strategy': 'mode'},
'gender': {'strategy': 'median'},
'race': {'strategy': 'most_frequent'}
}
}
# Assuming the imputation function is called `impute_data` and emits warnings
import warnings
from your_module import impute_data # Replace with actual import
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
impute_data(data=None, config=config, protected_cols=protected_cols) # Replace data=None with actual test data
# Check that a warning was issued for each protected column
warning_msgs = [str(warning.message) for warning in w]
for col in protected_cols:
assert any(col in msg for msg in warning_msgs), f"No warning for protected column: {col}"- Replace
from your_module import impute_datawith the actual import path for your imputation function. - Replace
data=Nonewith appropriate test data as required by your imputation function. - If your warning messages have a specific format, adjust the assertion to match the expected message.
| for col in df.columns: | ||
| idx = rng.choice(n, size=int(0.2*n), replace=False) | ||
| df.loc[idx, col] = np.nan |
There was a problem hiding this comment.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
|
|
||
|
|
||
| def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98) -> pd.DataFrame: | ||
| def detect_redundancy(df: pd.DataFrame, threshold: float = 0.98, method: str = 'pearson') -> pd.DataFrame: |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in detect_redundancy - 21% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| return df_out | ||
|
|
||
|
|
||
| def imputation_stability_cv( |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Remove unnecessary casts to int, str, float or bool [×3] (
remove-unnecessary-cast) - Replace unused for index with underscore (
for-index-underscore) - Replace if statement with if expression (
assign-if-exp) - Low code quality found in imputation_stability_cv - 19% (
low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
…olumns; minor style/clarity tweaks per PR review; all tests passing
… tests; minor positive-if style tweaks in bias rows; align with PR feedback
…d offline note in config; update README and usage
…eated MICE; stability CV fail threshold + wiring; docs/rendering tweaks; tests added; all tests passing
…inty enable/repeats/params; render MI uncertainty; stability CV fail threshold end-to-end; add CLI test; 117 tests passing
…V fail threshold; add corresponding GUI controls; tests still green
…t DataFrame guards in imputation and batch path; clearer error surfacing
… uncertainty flags to processing; include stability/MI/bias in GUI PDF; restore full diagnostics in results
Summary by Sourcery
Add comprehensive imputation improvements including repeatability diagnostics, protected-column exclusion, and configurable redundancy metrics, and integrate them into CLI, config, GUI, and reporting with accompanying documentation and tests.
New Features:
Enhancements:
Documentation:
Tests: