Skip to content

feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors#321

Merged
breimanntools merged 10 commits into
masterfrom
feat/dpulearn-mined-mask
Jul 3, 2026
Merged

feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors#321
breimanntools merged 10 commits into
masterfrom
feat/dpulearn-mined-mask

Conversation

@breimanntools

@breimanntools breimanntools commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Part of #308. Three additive pieces for the positive/unlabeled workflow.

1. dPULearn.fit — positives/unlabeled split input + mask_neg_

For the common positives-vs-unlabeled setup, fit now accepts X_pos + X_unlabeled as an alternative to X + labels: the two matrices are stacked and marked (1/2) internally, so you don't hand-build the label vector. After fitting, the new dPULearn.mask_neg_ attribute is the boolean mask of reliable negatives — over the rows of X_unlabeled in the split mode, over X otherwise (equals the manual labels_[len(X_pos):] == 0 exactly).

fit still returns self (scikit-learn contract preserved — no output-flag anti-pattern), and the existing fit(X, labels=...) path is byte-identical.

dpul.fit(X_pos=X_pos, X_unlabeled=X_pool, n_neg=49)
X_neg = X_pool[dpul.mask_neg_]

Supersedes the earlier standalone dPULearn.mine_negatives method (removed): the convenience now lives on fit itself rather than adding a second entry point.

2. aa.get_labels(df, positive_label=1, col_label="label")

One-call binary label vector for the recurring (df[col]==x).astype(int) pattern.

3. COLOR_SAMPLES_* constants

Canonical aa.COLOR_SAMPLES_POS/NEG/UNL/REL_NEG colors so notebooks reference a named constant.

Tests cover the split-mode ↔ manual equivalence, mask_neg_ semantics in both modes, the returns-self + both-modes-rejected guards, and get_labels edge cases. 271 dpulearn + api-meta tests green.

🤖 Generated with Claude Code

…totype #308)

Three additive conveniences for the positive/unlabelled -> mined-negatives flow,
removing the recurring vstack/label-vector/color-lookup plumbing in the
gamma-secretase use case. All existing APIs stay byte-identical.

- dPULearn.mine_negatives(X_pos, X_unlabelled, ...): one-call sugar over fit that
  returns the reliable-negative boolean mask over X_unlabelled. Equals the manual
  labels_[len(X_pos):]==0 result exactly (regression-tested). fit(X, labels=...)
  unchanged.
- get_labels(df, positive_label=1, col_label="label"): binary int label vector,
  the single-call form of (df[col]==x).astype(int).to_numpy().
- COLOR_SAMPLES_POS/NEG/UNL/REL_NEG: public named aliases for the canonical sample
  colors, equal to plot_get_cdict("DICT_COLOR")["SAMPLES_*"] (golden-tested).

Wired get_labels + the 4 color constants into __init__/__all__ (on the #308
wire-to-public-API list). Ripple: numpydoc + 2 executed example notebooks
(get_labels, dpul_mine_negatives), 39 unit tests (positive+negative+regression),
release-notes Unreleased entries, cheat-sheet rows.

Part of #305 / prototype for #308

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.95%. Comparing base (1a152de) to head (830879a).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   94.93%   94.95%   +0.01%     
==========================================
  Files         185      186       +1     
  Lines       17883    17910      +27     
  Branches     3038     3034       -4     
==========================================
+ Hits        16978    17007      +29     
+ Misses        598      597       -1     
+ Partials      307      306       -1     
Files with missing lines Coverage Δ
aaanalysis/__init__.py 97.29% <100.00%> (+0.03%) ⬆️
aaanalysis/_constants.py 100.00% <100.00%> (ø)
aaanalysis/data_handling/__init__.py 100.00% <100.00%> (ø)
aaanalysis/data_handling/_get_labels.py 100.00% <100.00%> (ø)
aaanalysis/pu_learning/_dpulearn.py 98.06% <100.00%> (+0.32%) ⬆️

... and 2 files with indirect coverage changes

Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools and others added 7 commits July 1, 2026 04:50
… in get_labels

Reorder the get_labels Validate block so check_str(col_label) runs before
check_df(cols_required=col_label). A non-str col_label now surfaces a clear
'col_label' error instead of an internal 'cols_required' one. No behaviour
change on valid input.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mine_negatives validated X_pos and X_unlabelled separately with the default
check_X min_n_samples=3, so it rejected n_pos<3 inputs that the manual stacking
path accepts (the >=3 floor belongs to the stacked matrix, which fit enforces).
Relax the per-matrix check to min_n_samples=1 to restore exact equivalence; add
tests for the small-positive-set equivalence and get_labels single-class/NaN
mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sistency

The rest of the package spells it 'unlabeled' (American, 85 uses) and abbreviates
the marker as label_unl / n_unl_to_neg; the new public mine_negatives parameter
used the British two-L 'X_unlabelled'. Rename the new/unreleased parameter, its
match helper, docstrings, tests, cheat-sheet and release-notes entries, and
re-execute the example notebook.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With no pre-labeled negatives, n_neg (total) and n_unl_to_neg (from the pool)
are always equivalent in mine_negatives, so exposing both was redundant. Replace
them with a single required n_neg (the method is new/unreleased, so non-breaking);
it calls fit(n_unl_to_neg=n_neg) internally. Update docstring, tests, and the
re-executed example notebook.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d error

mine_negatives delegated n_neg validation to fit (which sees it as n_unl_to_neg),
so an invalid n_neg raised an error naming the internal parameter. Validate n_neg
explicitly in the frontend so the message names n_neg, and assert the name in the
negative test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-mask

# Conflicts:
#	docs/source/index/release_notes.rst
@breimanntools breimanntools marked this pull request as ready for review July 2, 2026 12:48
…d + mask_neg_); remove the method

Instead of a separate mine_negatives method, extend fit to accept the positives/unlabeled
split directly: pass X_pos + X_unlabeled (an alternative to X + labels), and read the new
dPULearn.mask_neg_ attribute for the boolean mask of reliable negatives (over X_unlabeled in
the split mode, over X otherwise). fit still returns self (sklearn contract preserved), so no
output-parameter anti-pattern. Removes the mine_negatives method + its example notebook; folds
the demo into dpul_fit; updates get_labels docstring, cheat sheet, and release notes. Tests
rewritten to the split-mode + mask_neg_ (incl. manual-equivalence, both-modes guard, returns-self);
271 dpulearn + api-meta tests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@breimanntools breimanntools changed the title feat: dPULearn.mine_negatives + get_labels + named sample colors (prototype #308) feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors Jul 3, 2026

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please make sure that the alternative is cleraly described in the example notebook and both options are clearly introduced as two options (perhaps bullet points in the beginning of the example notebook before each paratmer is introduced iteslef!

…p front (review)

Per PR review: the dpul_fit example notebook now opens by presenting the two input
options (Option 1: X + labels; Option 2: X_pos + X_unlabeled -> mask_neg_) as bullet
points before any parameter is demonstrated, and labels the split-input section as
Option 2 so the alternative is clearly described.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@breimanntools breimanntools merged commit 0877b3d into master Jul 3, 2026
13 checks passed
@breimanntools breimanntools deleted the feat/dpulearn-mined-mask branch July 3, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant