This repository tests and compares several imputation methods for missing data before performing Linear Discriminant Analysis (LDA) classification.
The experiment follows three stages:
Two datasets are used:
- Banknote authentication — a real-world dataset with 4 numerical features and a binary class.
- Synthetic Gaussian — generated by sampling from 2 multivariate Gaussian distributions. The covariance structure, number of features, missingness probability, and missingness mechanism (MCAR, MAR, MNAR) are configurable via command line.
Missing values are then introduced into the data according to the chosen mechanism and probability.
Each missing value is filled in before classification. The methods compared are:
- Grand Mean — replace missing values with the overall feature mean.
- Conditional Mean — replace missing values with the mean of the feature within the sample's class.
- Nearest Neighbour — replace with the value from the closest observed sample.
- Regression — predict missing values from observed features using linear regression.
Multiple imputation variants (nearest neighbour and regression) repeat the process and aggregate predictions via majority vote.
After imputation, an LDA classifier is fit on the training data and evaluated on the test set.
Most imputation methods produce a single imputed matrix, so sklearn's standard LinearDiscriminantAnalysis is used. The exception is conditional mean: since each test sample is imputed twice (once assuming class 0, once assuming class 1), prediction requires a custom BinaryLDA classifier that scores each class's feature matrix separately.
The concept of conditional mean and the full experimental results are described in the PDF at the root of the repository.
cd ImputationForLDA
uv syncuv run python main.py --dimensions 5 --cov_matrice normal --probs_missingness 0.1 --type_missingness MCARuv run pytestMathieu Charbonnel
Robert J. Durrant, who supervised this scientific work.
Lohweg, Volker. (2013). Banknote authentication. UCI Machine Learning Repository.