Mixture cure models for insurance non-claimer scoring.
Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:
- Structural non-claimers — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
- Lucky susceptibles — policyholders who are genuinely at risk but happened not to claim this year.
These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.
A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.
insurance-cure fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.
The population survival function:
S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]
pi(z)= P(susceptible), logistic regression incidence sub-modelS_u(t | x)= survival for susceptibles, Weibull/log-normal/Cox latency[1 - pi(z)]= cure fraction: P(never experiences event)
Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.
No other pip-installable Python package provides covariate-aware MCM with actuarial output. R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.
pip install insurance-cureDependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.
import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel
# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)
# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())
# Fit Weibull MCM
model = WeibullMixtureCure(
incidence_formula="ncb_years + age + vehicle_age",
latency_formula="ncb_years + age",
n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())
# Outputs
cure_scores = model.predict_cure_fraction(df) # P(immune) per policy
suscept = model.predict_susceptibility(df) # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])
# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())Weibull AFT latency. Clean parametric extrapolation. Best default choice.
from insurance_cure import WeibullMixtureCure
model = WeibullMixtureCure(
incidence_formula="ncb_years + age + vehicle_age",
latency_formula="ncb_years + age",
n_em_starts=5, # EM restarts — use >=5 for production
bootstrap_se=True, # Bootstrap SEs — slow but rigorous
n_bootstrap=200,
n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.
from insurance_cure import LogNormalMixtureCure
model = LogNormalMixtureCure(
incidence_formula="pet_age + breed_risk + indoor",
latency_formula="pet_age + breed_risk",
)
model.fit(df)Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.
from insurance_cure import CoxMixtureCure
model = CoxMixtureCure(
incidence_formula="ncb_years + age",
latency_formula="ncb_years",
)
model.fit(df)Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.
from insurance_cure import PromotionTimeCure
model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.
from insurance_cure.diagnostics import sufficient_followup_test
result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
# Qn statistic : 3.2194
# p-value : 0.0006
# ...
# Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.from insurance_cure.diagnostics import CureScorecard
scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.UK motor: First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.
Pet insurance: First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.
Travel insurance: Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.
Where MCM does NOT apply: Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.
from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel
# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
n_policies=5000,
n_years=5,
cure_fraction=0.40,
weibull_shape=1.2,
weibull_scale=36.0, # months to first claim for susceptibles
censoring_rate=0.15, # annual lapse rate
seed=42,
)
# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)The true latent immune status is included as is_immune for validation. This column is not available in real data.
The EM algorithm decouples into two standard sub-problems at each iteration:
E-step: For censored observation i:
w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]
For observed events: w_i = 1 (certainly susceptible).
M-step:
- Logistic regression for gamma using w_i as soft labels
- Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights
The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.
EM over direct MLE. Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.
Separate incidence and latency formulae. Following smcure's cureform / formula convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.
Multiple restarts. The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.
Bootstrap SEs. EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.
- Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
- Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
- Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
- Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
- Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model
Burning Cost — actuarial Python for UK pricing teams.
No formal benchmark yet. Mixture cure models are slower than standard survival models because the EM algorithm iterates until convergence, and multiple restarts (n_em_starts=5) are required to avoid local maxima.
On 3,000-policy synthetic motor datasets, WeibullMixtureCure with 5 EM restarts and no bootstrap SEs converges in under 30 seconds. With bootstrap SEs (B=200, n_jobs=-1), expect 5–15 minutes on a 4-core machine. The EM typically converges in 30–80 iterations; convergence is faster when the cure fraction is near 0.3–0.5 and slower near the boundaries.
The library adds value over a standard Poisson GLM when: (1) the observation window is long enough for the Maller-Zhou Qn test to confirm sufficient follow-up, and (2) the cure fraction is substantial (> 15%). For UK motor with 5+ year panels, cure fractions of 35–50% are common for NCB band 5+. A Poisson GLM will overstate claim risk for this segment by a factor proportional to the cure fraction, which compounds over multi-year retention horizon projections.