Skip to content

Latest commit

 

History

History
276 lines (207 loc) · 7.36 KB

File metadata and controls

276 lines (207 loc) · 7.36 KB

PCA Engine

A complete Principal Component Analysis (PCA) implementation from scratch using NumPy, with comprehensive testing and benchmarking against scikit-learn.

Overview

This project implements PCA without relying on machine learning libraries like scikit-learn. It includes:

  • Pure NumPy implementation of PCA
  • Comprehensive test suite (correctness, performance, integration)
  • Performance benchmarking against scikit-learn
  • Physics use case examples

Project Structure

pca engine_/
├── src/
│   ├── __init__.py
│   └── pca.py                  # Main PCAEngine implementation
├── tests/
│   ├── __init__.py
│   ├── conftest.py             # Shared pytest fixtures
│   ├── test_pca.py             # Basic unit tests
│   ├── test_correctness.py     # Mathematical correctness tests
│   ├── test_benchmark.py       # Performance comparison with scikit-learn
│   └── test_integration.py     # End-to-end integration tests
├── benchmarks/
│   └── run_benchmarks.py       # Standalone benchmark runner
├── data/                       # Dataset storage
├── requirements.txt
├── pytest.ini                  # Pytest configuration
├── pyproject.toml             # Project metadata
└── README.md

Installation

1. Clone the repository

git clone <repository-url>
cd "pca engine_"

2. Create and activate virtual environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# Linux/Mac
python -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Usage

Basic Usage

import numpy as np
from src.pca import PCAEngine

# Generate sample data
X = np.random.randn(100, 10)

# Initialize PCA with desired number of components
pca = PCAEngine(n_components=3)

# Fit the model
pca.fit(X)

# Transform data to principal component space
X_transformed = pca.transform(X)

print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_transformed.shape}")

Advanced Usage: Data Reconstruction

# After fitting and transforming
X_transformed = pca.transform(X)

# Reconstruct original data from principal components
X_reconstructed = np.dot(X_transformed, pca.components.T) + pca.mean

# Calculate reconstruction error
reconstruction_error = np.mean((X - X_reconstructed) ** 2)
print(f"Reconstruction error: {reconstruction_error}")

Testing

Run All Tests

python -m pytest tests/ -v

Run Specific Test Categories

# Basic unit tests
python -m pytest tests/test_pca.py -v

# Mathematical correctness tests
python -m pytest tests/test_correctness.py -v

# Performance benchmarks
python -m pytest tests/test_benchmark.py -v -m benchmark

# Integration tests
python -m pytest tests/test_integration.py -v

Run Tests by Marker

# Run only benchmark tests
pytest -v -m benchmark

# Run only correctness tests
pytest -v -m correctness

# Run only integration tests
pytest -v -m integration

# Run physics-related tests
pytest -v -m physics

Benchmarking

Run comprehensive benchmarks comparing with scikit-learn:

python benchmarks/run_benchmarks.py

This will test:

  • Fit performance on various dataset sizes
  • Transform performance
  • Numerical accuracy comparison
  • Scalability analysis

How It Works

The PCAEngine implements the four core steps of PCA:

1. Mean Centering

self.mean = np.mean(X, axis=0)
X_centered = X - self.mean

2. Covariance Matrix Calculation

covariance_matrix = np.dot(X_centered.T, X_centered) / (n_samples - 1)

3. Eigen Decomposition

eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

4. Component Selection

# Sort by largest eigenvalues
idxs = np.argsort(eigenvalues)[::-1]
self.components = eigenvectors[:, idxs[0:self.n_components]]

Physics Applications

PCA is valuable for physics data analysis. See potential applications:

1. Particle Physics

  • Multi-detector data reduction: Combine signals from multiple particle detectors
  • Event classification: Distinguish signal from background in collision events
  • Feature extraction: Extract relevant features from high-dimensional particle data

2. Quantum Mechanics

  • Wavefunction analysis: Reduce dimensionality of quantum state representations
  • Spectroscopy data: Analyze spectral lines from multiple sources
  • Correlation studies: Identify correlated quantum observables

3. Thermodynamics & Statistical Physics

  • Phase transition detection: Identify order parameters in phase transitions
  • Molecular dynamics: Reduce degrees of freedom in molecular simulations
  • Time series analysis: Extract dominant modes from temperature/pressure data

4. Astrophysics

  • Galaxy morphology: Classify galaxy shapes from multi-band images
  • Stellar spectra: Analyze stellar composition from spectroscopic data
  • Cosmic microwave background: Extract signals from CMB temperature maps

5. Experimental Physics

  • Sensor data fusion: Combine data from multiple experimental sensors
  • Noise reduction: Filter experimental noise while preserving signal
  • Calibration: Identify systematic errors across measurement channels

Example: Physics Use Case

import numpy as np
from src.pca import PCAEngine

# Simulate multi-detector particle physics data
# 1000 events, 50 detector channels
n_events = 1000
n_detectors = 50

# Generate correlated detector signals (simulated particle tracks)
true_track = np.random.randn(n_events, 3)  # 3 true parameters
detector_response = np.random.randn(3, n_detectors)  # Response matrix
detector_data = np.dot(true_track, detector_response)
detector_data += np.random.randn(n_events, n_detectors) * 0.5  # Add noise

# Apply PCA to reduce dimensionality
pca = PCAEngine(n_components=3)
pca.fit(detector_data)
reduced_data = pca.transform(detector_data)

# Analyze variance captured
variances = np.var(reduced_data, axis=0)
total_variance = np.sum(np.var(detector_data - pca.mean, axis=0))
explained_ratio = np.sum(variances) / total_variance

print(f"Variance explained by 3 components: {explained_ratio*100:.2f}%")
print(f"Original dimensions: {n_detectors}")
print(f"Reduced dimensions: {pca.n_components}")

Mathematical Properties

The implementation guarantees:

  • ✓ Principal components are orthonormal
  • ✓ Components ordered by decreasing eigenvalues
  • ✓ Total variance preserved (for full PCA)
  • ✓ Transformed data centered at origin
  • ✓ Numerical stability with various data scales

Performance

Benchmarks show (typical results):

  • Small datasets (100×10): ~0.5ms fit time
  • Medium datasets (1000×50): ~15ms fit time
  • Large datasets (5000×100): ~200ms fit time
  • Numerical accuracy: <10⁻¹⁰ difference from scikit-learn

Contributing

Contributions are welcome! Areas for improvement:

  • Additional physics use case examples
  • Sparse PCA implementation
  • Kernel PCA extension
  • Incremental PCA for large datasets
  • GPU acceleration

License

MIT License

References

  1. Jolliffe, I. T. (2002). Principal Component Analysis. Springer.
  2. Shlens, J. (2014). A Tutorial on Principal Component Analysis. arXiv:1404.1100
  3. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Contact

For questions or issues, please open an issue on the GitHub repository.