Skip to content

feat: dPULearn pos/unlabelled mined-mask convenience + get_labels + named sample colors #308

Description

@breimanntools

Part of #305.

Problem

Using dPULearn to mine reliable negatives currently forces the user to pack a
combined matrix, build a 1/2 label vector, fit, then slice the mined rows
back out by index — repeated in notebook cells 18 and 24:

X = np.vstack([X_cpp_pos, X_cpp_oth])
y = np.array([1]*len(X_cpp_pos) + [2]*len(X_cpp_oth))
dpul = aa.dPULearn(...).fit(X, labels=y)
mined = np.asarray(dpul.labels_)[len(X_cpp_pos):] == 0

The same notebook also writes (df["label"] == x).astype(int).to_numpy() in
4+ places and pulls canonical sample colors out of a string-keyed dict
(aa.plot_get_cdict(...)["SAMPLES_POS"]) in cells 6 and 22.

Goal

Make the positive/unlabelled → mined-negatives flow a single call, and remove
the recurring label-vector and color-lookup plumbing — additively, leaving the
current dPULearn.fit(X, labels=...) API unchanged.

Requirements

  • A dPULearn convenience accepting X_pos + X_unlabelled separately and
    returning the mined reliable-negative boolean mask (and/or subframe),
    so users don't vstack + label 1/2 + slice by hand.
  • aa.get_labels(df, positive_label=1) returning the binary int label
    vector for the (df["label"]==x).astype(int).to_numpy() pattern.
  • Expose the canonical sample colors as named constants (e.g.
    aa.COLOR_SAMPLES_POS/NEG/UNL/REL_NEG) or plot_get_clist(name="samples"),
    so users stop indexing plot_get_cdict(...) by string key.

KPIs / Acceptance criteria

  • The mined mask equals the current labels_[len(X_pos):]==0 result exactly
    on the canonical fixture (regression).
  • get_labels matches the manual expression on ≥2 label encodings.
  • Named color constants equal today's plot_get_cdict values (golden test).

Scope / non-goals

  • No change to the dPULearn algorithm or its existing fit signature.

Dependencies

Standards checklist

  • frontend/backend · validation block · CONFIRM-FIRST (new public symbols →
    __init__.py/__all__) · numpydoc · tests · no-print

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions