insurance-cure

Mixture cure models for insurance non-claimer scoring.

The problem

Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:

Structural non-claimers — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
Lucky susceptibles — policyholders who are genuinely at risk but happened not to claim this year.

These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.

A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.

What this library does

insurance-cure fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.

The population survival function:

S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]

pi(z) = P(susceptible), logistic regression incidence sub-model
S_u(t | x) = survival for susceptibles, Weibull/log-normal/Cox latency
[1 - pi(z)] = cure fraction: P(never experiences event)

Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.

No other pip-installable Python package provides covariate-aware MCM with actuarial output. R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.

Installation

pip install insurance-cure

Dependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.

Quick start

import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel

# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)

# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())

# Fit Weibull MCM
model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())

# Outputs
cure_scores = model.predict_cure_fraction(df)      # P(immune) per policy
suscept = model.predict_susceptibility(df)          # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])

# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())

Models

WeibullMixtureCure (recommended)

Weibull AFT latency. Clean parametric extrapolation. Best default choice.

from insurance_cure import WeibullMixtureCure

model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,        # EM restarts — use >=5 for production
    bootstrap_se=True,    # Bootstrap SEs — slow but rigorous
    n_bootstrap=200,
    n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")

LogNormalMixtureCure

Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.

from insurance_cure import LogNormalMixtureCure

model = LogNormalMixtureCure(
    incidence_formula="pet_age + breed_risk + indoor",
    latency_formula="pet_age + breed_risk",
)
model.fit(df)

CoxMixtureCure

Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.

from insurance_cure import CoxMixtureCure

model = CoxMixtureCure(
    incidence_formula="ncb_years + age",
    latency_formula="ncb_years",
)
model.fit(df)

PromotionTimeCure

Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.

from insurance_cure import PromotionTimeCure

model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)

Diagnostics

Sufficient follow-up test

The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.

from insurance_cure.diagnostics import sufficient_followup_test

result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
#   Qn statistic      : 3.2194
#   p-value           : 0.0006
#   ...
#   Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.

Cure scorecard

from insurance_cure.diagnostics import CureScorecard

scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.

Insurance applications

UK motor: First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.

Pet insurance: First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.

Travel insurance: Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.

Where MCM does NOT apply: Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.

Synthetic data

from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel

# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
    n_policies=5000,
    n_years=5,
    cure_fraction=0.40,
    weibull_shape=1.2,
    weibull_scale=36.0,    # months to first claim for susceptibles
    censoring_rate=0.15,   # annual lapse rate
    seed=42,
)

# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)

The true latent immune status is included as is_immune for validation. This column is not available in real data.

EM algorithm details

The EM algorithm decouples into two standard sub-problems at each iteration:

E-step: For censored observation i:

w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]

For observed events: w_i = 1 (certainly susceptible).

M-step:

Logistic regression for gamma using w_i as soft labels
Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights

The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.

Design choices

EM over direct MLE. Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.

Separate incidence and latency formulae. Following smcure's cureform / formula convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.

Multiple restarts. The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.

Bootstrap SEs. EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.

References

Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model

Burning Cost — actuarial Python for UK pricing teams.

Performance

No formal benchmark yet. Mixture cure models are slower than standard survival models because the EM algorithm iterates until convergence, and multiple restarts (n_em_starts=5) are required to avoid local maxima.

On 3,000-policy synthetic motor datasets, WeibullMixtureCure with 5 EM restarts and no bootstrap SEs converges in under 30 seconds. With bootstrap SEs (B=200, n_jobs=-1), expect 5–15 minutes on a 4-core machine. The EM typically converges in 30–80 iterations; convergence is faster when the cure fraction is near 0.3–0.5 and slower near the boundaries.

The library adds value over a standard Poisson GLM when: (1) the observation window is long enough for the Maller-Zhou Qn test to confirm sufficient follow-up, and (2) the cure fraction is substantial (> 15%). For UK motor with 5+ year panels, cure fractions of 35–50% are common for NCB band 5+. A Poisson GLM will overstate claim risk for this segment by a factor proportional to the cure fraction, which compounds over multi-year retention horizon projections.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
notebooks		notebooks
src/insurance_cure		src/insurance_cure
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml
run_databricks_tests.py		run_databricks_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

insurance-cure

The problem

What this library does

Installation

Quick start

Models

WeibullMixtureCure (recommended)

LogNormalMixtureCure

CoxMixtureCure

PromotionTimeCure

Diagnostics

Sufficient follow-up test

Cure scorecard

Insurance applications

Synthetic data

EM algorithm details

Design choices

References

Performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

insurance-cure

The problem

What this library does

Installation

Quick start

Models

WeibullMixtureCure (recommended)

LogNormalMixtureCure

CoxMixtureCure

PromotionTimeCure

Diagnostics

Sufficient follow-up test

Cure scorecard

Insurance applications

Synthetic data

EM algorithm details

Design choices

References

Performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages