Imputation before LDA classification

This repository tests and compares several imputation methods for missing data before performing Linear Discriminant Analysis (LDA) classification.

Pipeline

The experiment follows three stages:

1. Data

Two datasets are used:

Banknote authentication — a real-world dataset with 4 numerical features and a binary class.
Synthetic Gaussian — generated by sampling from 2 multivariate Gaussian distributions. The covariance structure, number of features, missingness probability, and missingness mechanism (MCAR, MAR, MNAR) are configurable via command line.

Missing values are then introduced into the data according to the chosen mechanism and probability.

2. Imputation

Each missing value is filled in before classification. The methods compared are:

Grand Mean — replace missing values with the overall feature mean.
Conditional Mean — replace missing values with the mean of the feature within the sample's class.
Nearest Neighbour — replace with the value from the closest observed sample.
Regression — predict missing values from observed features using linear regression.

Multiple imputation variants (nearest neighbour and regression) repeat the process and aggregate predictions via majority vote.

3. Prediction

After imputation, an LDA classifier is fit on the training data and evaluated on the test set.

Most imputation methods produce a single imputed matrix, so sklearn's standard LinearDiscriminantAnalysis is used. The exception is conditional mean: since each test sample is imputed twice (once assuming class 0, once assuming class 1), prediction requires a custom BinaryLDA classifier that scores each class's feature matrix separately.

The concept of conditional mean and the full experimental results are described in the PDF at the root of the repository.

Getting Started

Install

cd ImputationForLDA
uv sync

Run

uv run python main.py --dimensions 5 --cov_matrice normal --probs_missingness 0.1 --type_missingness MCAR

Test

uv run pytest

Authors

Mathieu Charbonnel

Acknowledgments

Robert J. Durrant, who supervised this scientific work.

Lohweg, Volker. (2013). Banknote authentication. UCI Machine Learning Repository.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
tests		tests
.gitignore		.gitignore
ImputationForLDA.pdf		ImputationForLDA.pdf
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imputation before LDA classification

Pipeline

1. Data

2. Imputation

3. Prediction

Getting Started

Install

Run

Test

Authors

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Imputation before LDA classification

Pipeline

1. Data

2. Imputation

3. Prediction

Getting Started

Install

Run

Test

Authors

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages