A complete Principal Component Analysis (PCA) implementation from scratch using NumPy, with comprehensive testing and benchmarking against scikit-learn.
This project implements PCA without relying on machine learning libraries like scikit-learn. It includes:
- Pure NumPy implementation of PCA
- Comprehensive test suite (correctness, performance, integration)
- Performance benchmarking against scikit-learn
- Physics use case examples
pca engine_/
├── src/
│ ├── __init__.py
│ └── pca.py # Main PCAEngine implementation
├── tests/
│ ├── __init__.py
│ ├── conftest.py # Shared pytest fixtures
│ ├── test_pca.py # Basic unit tests
│ ├── test_correctness.py # Mathematical correctness tests
│ ├── test_benchmark.py # Performance comparison with scikit-learn
│ └── test_integration.py # End-to-end integration tests
├── benchmarks/
│ └── run_benchmarks.py # Standalone benchmark runner
├── data/ # Dataset storage
├── requirements.txt
├── pytest.ini # Pytest configuration
├── pyproject.toml # Project metadata
└── README.md
git clone <repository-url>
cd "pca engine_"# Windows
python -m venv .venv
.venv\Scripts\activate
# Linux/Mac
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtimport numpy as np
from src.pca import PCAEngine
# Generate sample data
X = np.random.randn(100, 10)
# Initialize PCA with desired number of components
pca = PCAEngine(n_components=3)
# Fit the model
pca.fit(X)
# Transform data to principal component space
X_transformed = pca.transform(X)
print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_transformed.shape}")# After fitting and transforming
X_transformed = pca.transform(X)
# Reconstruct original data from principal components
X_reconstructed = np.dot(X_transformed, pca.components.T) + pca.mean
# Calculate reconstruction error
reconstruction_error = np.mean((X - X_reconstructed) ** 2)
print(f"Reconstruction error: {reconstruction_error}")python -m pytest tests/ -v# Basic unit tests
python -m pytest tests/test_pca.py -v
# Mathematical correctness tests
python -m pytest tests/test_correctness.py -v
# Performance benchmarks
python -m pytest tests/test_benchmark.py -v -m benchmark
# Integration tests
python -m pytest tests/test_integration.py -v# Run only benchmark tests
pytest -v -m benchmark
# Run only correctness tests
pytest -v -m correctness
# Run only integration tests
pytest -v -m integration
# Run physics-related tests
pytest -v -m physicsRun comprehensive benchmarks comparing with scikit-learn:
python benchmarks/run_benchmarks.pyThis will test:
- Fit performance on various dataset sizes
- Transform performance
- Numerical accuracy comparison
- Scalability analysis
The PCAEngine implements the four core steps of PCA:
self.mean = np.mean(X, axis=0)
X_centered = X - self.meancovariance_matrix = np.dot(X_centered.T, X_centered) / (n_samples - 1)eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)# Sort by largest eigenvalues
idxs = np.argsort(eigenvalues)[::-1]
self.components = eigenvectors[:, idxs[0:self.n_components]]PCA is valuable for physics data analysis. See potential applications:
- Multi-detector data reduction: Combine signals from multiple particle detectors
- Event classification: Distinguish signal from background in collision events
- Feature extraction: Extract relevant features from high-dimensional particle data
- Wavefunction analysis: Reduce dimensionality of quantum state representations
- Spectroscopy data: Analyze spectral lines from multiple sources
- Correlation studies: Identify correlated quantum observables
- Phase transition detection: Identify order parameters in phase transitions
- Molecular dynamics: Reduce degrees of freedom in molecular simulations
- Time series analysis: Extract dominant modes from temperature/pressure data
- Galaxy morphology: Classify galaxy shapes from multi-band images
- Stellar spectra: Analyze stellar composition from spectroscopic data
- Cosmic microwave background: Extract signals from CMB temperature maps
- Sensor data fusion: Combine data from multiple experimental sensors
- Noise reduction: Filter experimental noise while preserving signal
- Calibration: Identify systematic errors across measurement channels
import numpy as np
from src.pca import PCAEngine
# Simulate multi-detector particle physics data
# 1000 events, 50 detector channels
n_events = 1000
n_detectors = 50
# Generate correlated detector signals (simulated particle tracks)
true_track = np.random.randn(n_events, 3) # 3 true parameters
detector_response = np.random.randn(3, n_detectors) # Response matrix
detector_data = np.dot(true_track, detector_response)
detector_data += np.random.randn(n_events, n_detectors) * 0.5 # Add noise
# Apply PCA to reduce dimensionality
pca = PCAEngine(n_components=3)
pca.fit(detector_data)
reduced_data = pca.transform(detector_data)
# Analyze variance captured
variances = np.var(reduced_data, axis=0)
total_variance = np.sum(np.var(detector_data - pca.mean, axis=0))
explained_ratio = np.sum(variances) / total_variance
print(f"Variance explained by 3 components: {explained_ratio*100:.2f}%")
print(f"Original dimensions: {n_detectors}")
print(f"Reduced dimensions: {pca.n_components}")The implementation guarantees:
- ✓ Principal components are orthonormal
- ✓ Components ordered by decreasing eigenvalues
- ✓ Total variance preserved (for full PCA)
- ✓ Transformed data centered at origin
- ✓ Numerical stability with various data scales
Benchmarks show (typical results):
- Small datasets (100×10): ~0.5ms fit time
- Medium datasets (1000×50): ~15ms fit time
- Large datasets (5000×100): ~200ms fit time
- Numerical accuracy: <10⁻¹⁰ difference from scikit-learn
Contributions are welcome! Areas for improvement:
- Additional physics use case examples
- Sparse PCA implementation
- Kernel PCA extension
- Incremental PCA for large datasets
- GPU acceleration
MIT License
- Jolliffe, I. T. (2002). Principal Component Analysis. Springer.
- Shlens, J. (2014). A Tutorial on Principal Component Analysis. arXiv:1404.1100
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
For questions or issues, please open an issue on the GitHub repository.