ClawBio Remediation Plan

Response to the clawbio_bench external audit by Sergey Kornilov (Biostochastics, LLC).


Audit date	2026-04-05
Audit commit	`1481fb4`
Benchmark	biostochastics/clawbio_bench v0.1.0
Result	80/140 tests passing (57.1%)
This plan	13 tasks, 3 priority tiers

Executive Summary

An independent benchmark suite tested 7 ClawBio skills across three dimensions: safety (does it reject unsafe inputs?), correctness (does it produce the right numerical answer?), and honesty (does it report what it actually did?). Two skills failed critically: equity-scorer (20% pass rate) and fine-mapping (25% pass rate). PharmGx is at 42%. The remaining skills are above 75%.

The most serious finding is a scientific honesty failure: equity-scorer computes Nei's GST but labels the output as Hudson FST. This is not a bug; it is a mislabeling of the statistical method used.

Scorecard

Skill	Pass	Fail	Rate	Worst Finding
bio-orchestrator	41	13	75.9%	stub_silent, routed_wrong
equity-scorer	3	12	20.0%	fst_mislabeled, heim_unbounded, edge_crash
nutrigx-advisor	8	2	80.0%	snp_invalid, score_incorrect
pharmgx-reporter	14	19	42.4%	correct_determinate, disclosure_failure
claw-metagenomics	6	1	85.7%	exit_suppressed
fine-mapping	4	12	25.0%	pathology_flagged
clinical-variant	4	1	80.0%	report_structure_complete

P0: Fix Before Next Release

These findings undermine scientific credibility. No release should ship until they are resolved.

1. Equity-scorer: FST mislabeling (honesty failure)

Finding: C-06 fst_mislabeled. Output computes Nei's GST (1973) but the report header and JSON field say "Hudson FST".

File: skills/equity-scorer/equity_scorer.py (1,125 lines)

Fix options:

(a) Rename output to "Nei's GST" throughout (report header, JSON key, CSV column). One-liner.
(b) Implement actual Hudson FST (1992). Requires rewriting the FST calculation to use the Hudson, Slatkin, Maddison formula. More work but more correct.

Recommended: Option (a) first (fixes the honesty failure immediately), then option (b) as a follow-up (adds the correct statistic).

References:

Nei, M. (1973). Analysis of Gene Diversity in Subdivided Populations. PNAS, 70(12), 3321-3323.
Hudson, R.R., Slatkin, M. & Maddison, W.P. (1992). Estimation of levels of gene flow from DNA sequence data. Genetics, 132(2), 583-589.

Tests: Update skills/equity-scorer/tests/test_equity_scorer.py (254 lines). Add test that asserts output label matches the formula used.

Verification: clawbio-bench --smoke --harness equity_scorer --repo .

2. Equity-scorer: HEIM unbounded with custom weights

Finding: U-2/F-27 heim_unbounded. Passing --weights 1,1,1,1 produces HEIM scores exceeding 100 (observed: ~330). The HEIM score is defined on [0, 100] but there is no validation or normalization of user-supplied weights.

File: skills/equity-scorer/equity_scorer.py

Fix:

Validate that weights sum to 1.0 (or normalize them if they do not)
Clamp final HEIM score to [0, 100]
Raise ValueError if any weight is negative

Tests: Add test_heim_custom_weights_bounded, test_heim_negative_weight_rejected, test_heim_weights_normalized.

3. Equity-scorer: 9 edge case crashes

Finding: edge_crash on 9 of 15 tests. The tool crashes with traceback (exit code 1) instead of producing output or a clean error on edge case inputs.

File: skills/equity-scorer/equity_scorer.py

Failing inputs and required fixes:

Test Case	Input	Required Behaviour
eq_01_fst_known_af	2 pops, known allele frequencies	Compute FST, do not crash
eq_02_fst_identical	Identical populations	Return FST = 0
eq_03_fst_monomorphic	Monomorphic sites	Return NaN FST without crash
eq_04_fst_single_sample	n=1 per population	Warn unreliable, do not crash
eq_05_het_all_het	All heterozygous genotypes	Produce bounded HEIM
eq_06_het_all_hom	All homozygous genotypes	Produce bounded HEIM
eq_07_heim_balanced	5-superpop balanced dataset	Produce high HEIM
eq_08_heim_single_pop	Single population	Produce low HEIM, stay bounded
eq_11_multiallelic	Multiallelic sites (AF > 1.0 naive)	Handle without crash

Approach: Add input validation at the top of the computation pipeline. Each edge case needs a guard: check for division by zero (monomorphic), check sample size (n=1), check allele count (multiallelic), check population count (single pop). Return NaN or a warning instead of crashing.

Tests: One test per edge case. All 9 must pass.

4. Fine-mapping: SuSiE null component (architectural)

Finding: pathology_flagged. The SuSiE implementation has no null component in each single-effect regression. Null loci (z=0 everywhere) return phantom PIPs (~0.70 per variant). Single-causal loci produce spurious secondary signals. The reference implementation (susieR) mitigates this via null_weight and susie_get_cs purity pruning.

File: skills/fine-mapping/core/susie.py (4,834 lines)

Fix: This is the most complex fix in the plan. Two changes needed:

Add a null_weight parameter to the single-effect regression. When fitting each effect, include a null hypothesis (no effect at any variant). This prevents the model from forcing all posterior mass onto variants when the true signal is zero.
Add susie_get_cs-style purity pruning to the credible set extraction. After fitting, compute purity for each credible set and discard sets below a minimum purity threshold (default 0.5 in susieR).

Reference implementation: susieR R package, specifically susie() function's null_weight parameter and susie_get_cs() function.

Tests: Add test_null_locus_no_phantom_pip (z=0 locus should produce PIPs near 0), test_single_causal_no_spurious_secondary (one causal variant should produce one credible set).

5. Fine-mapping: PIP calculation (alpha vs true PIP)

Finding: The credible set pip field stores single-effect alpha rather than true PIP. True PIP formula: PIP_j = 1 - product(1 - alpha_lj) across all L effects.

File: skills/fine-mapping/core/susie.py or skills/fine-mapping/core/credible_sets.py

Fix: Replace the PIP assignment with the correct aggregation formula. This is a one-liner once you find where PIPs are extracted from the fitted model.

Tests: Add test_pip_aggregation_formula with a known two-effect example where true PIP differs from single-effect alpha.

6. Fine-mapping: purity mean vs min

Finding: Purity computed as mean pairwise |r| instead of minimum pairwise |r| per Wang et al. 2020 section 3.2. This promotes credible sets that span independent LD blocks (false positives).

File: skills/fine-mapping/core/credible_sets.py (or wherever purity is computed)

Fix: Change mean() to min() in the purity calculation. One-liner.

Also fix: mu stores alpha-weighted mixture contributions rather than conditional posterior means (contrary to docstring and Wang eq. 4).

Reference: Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. JRSS-B.

Tests: Add test_purity_is_minimum_r with a credible set spanning two LD blocks where mean > 0.5 but min < 0.5 (should be pruned).

7. Fine-mapping: input validation

Finding: Zero or negative SE, zero or negative n, and coverage=0 all produce numeric output without raising. Reference implementations reject these at validation time.

File: skills/fine-mapping/fine_mapping.py (394 lines) and/or skills/fine-mapping/core/io.py

Fix: Add ValueError contracts at input parsing:

if (se <= 0).any():
    raise ValueError("Standard errors must be positive")
if n <= 0:
    raise ValueError("Sample size must be positive")
if coverage <= 0 or coverage > 1:
    raise ValueError("Coverage must be in (0, 1]")

Tests: test_negative_se_raises, test_zero_n_raises, test_zero_coverage_raises.

P1: Fix Within 2 Weeks

These findings degrade trust but do not block the release.

8. PharmGx-reporter: CPIC compliance gaps (19 failures)

Finding: 42.4% pass rate (14/33). Finding categories: correct_determinate and disclosure_failure. The specific per-test failures have been requested from Sergey.

File: skills/pharmgx-reporter/pharmgx_reporter.py (2,002 lines)

Action:

Wait for Sergey's per-test breakdown (requested in reply email)
Triage each of the 19 failures against CPIC guidelines
Fix classification logic for each failing drug-gene pair
Update tests in skills/pharmgx-reporter/tests/test_pharmgx.py (529 lines)

Blocked by: Sergey's detailed PGx findings.

9. Bio-orchestrator: stub_silent + routing collision

Finding: 8 stub_silent findings (routes correctly to stub skills but emits no warning that the skill has no executable). 1 routed_wrong (kw_14: "variant+diversity" routes to equity-scorer instead of vcf-annotator). 1 unroutable_crash (inj_03: flock routing hijack crashes).

File: skills/bio-orchestrator/orchestrator.py (662 lines)

Fixes:

stub_silent: When the orchestrator routes to a skill that has no Python script (SKILL.md only), emit a warning: "Note: [skill-name] is a SKILL.md-only stub. No executable is available. Applying methodology from SKILL.md."
routed_wrong: Fix keyword priority in the KEYWORD_MAP. "variant" should prioritise vcf-annotator over equity-scorer when "diversity" is not also present. Add disambiguation logic.
unroutable_crash: The flock routing hijack test (inj_03) crashes instead of gracefully handling the injection attempt. Add input sanitization.

Tests: Update skills/bio-orchestrator/tests/test_orchestrator.py (237 lines). Add test_stub_routing_warns, test_variant_keyword_routes_to_vcf_annotator, test_injection_handled_gracefully.

10. NutriGx-advisor: empty input crash + allele mismatch

Finding: (1) snp_invalid (ng_07): Empty input crashes instead of clean error. (2) score_incorrect (ng_09): Hom-ref genotypes (all reference alleles) are treated as allele_mismatch and return "Unknown" instead of score 0.

Files:

skills/nutrigx_advisor/nutrigx_advisor.py (161 lines)
skills/nutrigx_advisor/parse_input.py
skills/nutrigx_advisor/score_variants.py

Fixes:

Add empty file check in parse_input.py: if no variants parsed, raise a clean error with message "No variants found in input file"
Fix genotype classification in score_variants.py: hom-ref (e.g. AA when ref=A) should score 0, not be classified as allele_mismatch

Tests: Update skills/nutrigx_advisor/tests/test_nutrigx.py (123 lines). Add test_empty_input_error, test_homref_scores_zero.

P2: Infrastructure

11. Metagenomics: exit code suppression

Finding: exit_suppressed. Shell exit codes from metagenomics tool calls are suppressed to warnings (critical=False). Non-zero exit codes should propagate as errors.

File: skills/claw-metagenomics/ (identify the subprocess call that swallows exit codes)

Fix: Change critical=False to critical=True or propagate subprocess.CalledProcessError.

12. Integrate clawbio_bench into CI

Action: Add Sergey's audit.yml GitHub Actions workflow to ClawBio's CI pipeline.

Workflow:

On every PR, run clawbio-bench --smoke --repo . against the PR branch
Exit 1 (findings exist): post advisory comment on PR with new vs resolved findings
Exit >= 2 (infrastructure failure): block merge until audit infrastructure is repaired
Upload verdict tree as artifact (30-day retention)

Implementation:

Add .github/workflows/audit.yml referencing biostochastics/clawbio_bench
Pin to a specific bench version tag for reproducibility
Run after existing CI tests pass (to avoid wasting bench time on broken code)

13. Contribute to biostochastics/clawbio_bench

Action: Manuel to join as collaborator on the benchmark repo.

Contributions planned:

Add test cases for skills not yet covered (gwas-prs, gwas-lookup, genome-compare, profile-report)
Verify fixes pass the harness as we ship them
Add SKILL.md conformance checks to the bench (complement the pr-audit)

GitHub handle: manuelcorpas (collaborator access requested from Sergey)

Verification Protocol

For every fix in this plan:

Write the clawbio_bench test first (or confirm Sergey's test covers the fix)
Run the specific harness to confirm the test fails (red)
Implement the fix following red/green TDD
Run the harness again to confirm the test passes (green)
Run the full smoke suite to confirm no regressions: clawbio-bench --smoke --repo .
Update the skill's SKILL.md Gotchas section with the failure mode that was fixed
Commit with reference to the bench finding ID (e.g. "fix: C-06 FST mislabeling")

Target: raise the overall pass rate from 57.1% to >90% within 2 weeks.

Timeline

Week	Target	Expected Pass Rate
Week 1	P0 complete (equity-scorer + fine-mapping)	~75%
Week 2	P1 complete (pharmgx + orchestrator + nutrigx)	~90%
Week 3	P2 complete (CI integration, metagenomics, bench contributions)	>90%

Contact

Benchmark maintainer: Sergey Kornilov, sergey.kornilov@biostochastics.com
ClawBio lead: Manuel Corpas, m.corpas@westminster.ac.uk
Benchmark repo: https://github.com/biostochastics/clawbio_bench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClawBio Remediation Plan

Executive Summary

Scorecard

P0: Fix Before Next Release

1. Equity-scorer: FST mislabeling (honesty failure)

2. Equity-scorer: HEIM unbounded with custom weights

3. Equity-scorer: 9 edge case crashes

4. Fine-mapping: SuSiE null component (architectural)

5. Fine-mapping: PIP calculation (alpha vs true PIP)

6. Fine-mapping: purity mean vs min

7. Fine-mapping: input validation

P1: Fix Within 2 Weeks

8. PharmGx-reporter: CPIC compliance gaps (19 failures)

9. Bio-orchestrator: stub_silent + routing collision

10. NutriGx-advisor: empty input crash + allele mismatch

P2: Infrastructure

11. Metagenomics: exit code suppression

12. Integrate clawbio_bench into CI

13. Contribute to biostochastics/clawbio_bench

Verification Protocol

Timeline

Contact

FilesExpand file tree

REMEDIATION-PLAN.md

Latest commit

History

REMEDIATION-PLAN.md

File metadata and controls

ClawBio Remediation Plan

Executive Summary

Scorecard

P0: Fix Before Next Release

1. Equity-scorer: FST mislabeling (honesty failure)

2. Equity-scorer: HEIM unbounded with custom weights

3. Equity-scorer: 9 edge case crashes

4. Fine-mapping: SuSiE null component (architectural)

5. Fine-mapping: PIP calculation (alpha vs true PIP)

6. Fine-mapping: purity mean vs min

7. Fine-mapping: input validation

P1: Fix Within 2 Weeks

8. PharmGx-reporter: CPIC compliance gaps (19 failures)

9. Bio-orchestrator: stub_silent + routing collision

10. NutriGx-advisor: empty input crash + allele mismatch

P2: Infrastructure

11. Metagenomics: exit code suppression

12. Integrate clawbio_bench into CI

13. Contribute to biostochastics/clawbio_bench

Verification Protocol

Timeline

Contact