Skip to content

mathieu-charbonnel/ImputationForLDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Imputation before LDA classification

This repository tests and compares several imputation methods for missing data before performing Linear Discriminant Analysis (LDA) classification.

Pipeline

The experiment follows three stages:

1. Data

Two datasets are used:

  • Banknote authentication — a real-world dataset with 4 numerical features and a binary class.
  • Synthetic Gaussian — generated by sampling from 2 multivariate Gaussian distributions. The covariance structure, number of features, missingness probability, and missingness mechanism (MCAR, MAR, MNAR) are configurable via command line.

Missing values are then introduced into the data according to the chosen mechanism and probability.

2. Imputation

Each missing value is filled in before classification. The methods compared are:

  • Grand Mean — replace missing values with the overall feature mean.
  • Conditional Mean — replace missing values with the mean of the feature within the sample's class.
  • Nearest Neighbour — replace with the value from the closest observed sample.
  • Regression — predict missing values from observed features using linear regression.

Multiple imputation variants (nearest neighbour and regression) repeat the process and aggregate predictions via majority vote.

3. Prediction

After imputation, an LDA classifier is fit on the training data and evaluated on the test set.

Most imputation methods produce a single imputed matrix, so sklearn's standard LinearDiscriminantAnalysis is used. The exception is conditional mean: since each test sample is imputed twice (once assuming class 0, once assuming class 1), prediction requires a custom BinaryLDA classifier that scores each class's feature matrix separately.

The concept of conditional mean and the full experimental results are described in the PDF at the root of the repository.

Getting Started

Install

cd ImputationForLDA
uv sync

Run

uv run python main.py --dimensions 5 --cov_matrice normal --probs_missingness 0.1 --type_missingness MCAR

Test

uv run pytest

Authors

Mathieu Charbonnel

Acknowledgments

Robert J. Durrant, who supervised this scientific work.

Lohweg, Volker. (2013). Banknote authentication. UCI Machine Learning Repository.

About

Designed to generate graphs showing accuracy of LDA classification method on differently imputed datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages