Skip to content

jualancilor/pca-engine

Repository files navigation

PCA Engine

A complete Principal Component Analysis (PCA) implementation from scratch using NumPy, with comprehensive testing and benchmarking against scikit-learn.

Overview

This project implements PCA without relying on machine learning libraries like scikit-learn. It includes:

  • Pure NumPy implementation of PCA
  • Comprehensive test suite (correctness, performance, integration)
  • Performance benchmarking against scikit-learn
  • Physics use case examples

Project Structure

pca engine_/
├── src/
│   ├── __init__.py
│   └── pca.py                  # Main PCAEngine implementation
├── tests/
│   ├── __init__.py
│   ├── conftest.py             # Shared pytest fixtures
│   ├── test_pca.py             # Basic unit tests
│   ├── test_correctness.py     # Mathematical correctness tests
│   ├── test_benchmark.py       # Performance comparison with scikit-learn
│   └── test_integration.py     # End-to-end integration tests
├── benchmarks/
│   └── run_benchmarks.py       # Standalone benchmark runner
├── data/                       # Dataset storage
├── requirements.txt
├── pytest.ini                  # Pytest configuration
├── pyproject.toml             # Project metadata
└── README.md

Installation

1. Clone the repository

git clone <repository-url>
cd "pca engine_"

2. Create and activate virtual environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# Linux/Mac
python -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Usage

Basic Usage

import numpy as np
from src.pca import PCAEngine

# Generate sample data
X = np.random.randn(100, 10)

# Initialize PCA with desired number of components
pca = PCAEngine(n_components=3)

# Fit the model
pca.fit(X)

# Transform data to principal component space
X_transformed = pca.transform(X)

print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_transformed.shape}")

Advanced Usage: Data Reconstruction

# After fitting and transforming
X_transformed = pca.transform(X)

# Reconstruct original data from principal components
X_reconstructed = np.dot(X_transformed, pca.components.T) + pca.mean

# Calculate reconstruction error
reconstruction_error = np.mean((X - X_reconstructed) ** 2)
print(f"Reconstruction error: {reconstruction_error}")

Testing

Run All Tests

python -m pytest tests/ -v

Run Specific Test Categories

# Basic unit tests
python -m pytest tests/test_pca.py -v

# Mathematical correctness tests
python -m pytest tests/test_correctness.py -v

# Performance benchmarks
python -m pytest tests/test_benchmark.py -v -m benchmark

# Integration tests
python -m pytest tests/test_integration.py -v

Run Tests by Marker

# Run only benchmark tests
pytest -v -m benchmark

# Run only correctness tests
pytest -v -m correctness

# Run only integration tests
pytest -v -m integration

# Run physics-related tests
pytest -v -m physics

Benchmarking

Run comprehensive benchmarks comparing with scikit-learn:

python benchmarks/run_benchmarks.py

This will test:

  • Fit performance on various dataset sizes
  • Transform performance
  • Numerical accuracy comparison
  • Scalability analysis

How It Works

The PCAEngine implements the four core steps of PCA:

1. Mean Centering

self.mean = np.mean(X, axis=0)
X_centered = X - self.mean

2. Covariance Matrix Calculation

covariance_matrix = np.dot(X_centered.T, X_centered) / (n_samples - 1)

3. Eigen Decomposition

eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

4. Component Selection

# Sort by largest eigenvalues
idxs = np.argsort(eigenvalues)[::-1]
self.components = eigenvectors[:, idxs[0:self.n_components]]

Physics Applications

PCA is valuable for physics data analysis. See potential applications:

1. Particle Physics

  • Multi-detector data reduction: Combine signals from multiple particle detectors
  • Event classification: Distinguish signal from background in collision events
  • Feature extraction: Extract relevant features from high-dimensional particle data

2. Quantum Mechanics

  • Wavefunction analysis: Reduce dimensionality of quantum state representations
  • Spectroscopy data: Analyze spectral lines from multiple sources
  • Correlation studies: Identify correlated quantum observables

3. Thermodynamics & Statistical Physics

  • Phase transition detection: Identify order parameters in phase transitions
  • Molecular dynamics: Reduce degrees of freedom in molecular simulations
  • Time series analysis: Extract dominant modes from temperature/pressure data

4. Astrophysics

  • Galaxy morphology: Classify galaxy shapes from multi-band images
  • Stellar spectra: Analyze stellar composition from spectroscopic data
  • Cosmic microwave background: Extract signals from CMB temperature maps

5. Experimental Physics

  • Sensor data fusion: Combine data from multiple experimental sensors
  • Noise reduction: Filter experimental noise while preserving signal
  • Calibration: Identify systematic errors across measurement channels

Example: Physics Use Case

import numpy as np
from src.pca import PCAEngine

# Simulate multi-detector particle physics data
# 1000 events, 50 detector channels
n_events = 1000
n_detectors = 50

# Generate correlated detector signals (simulated particle tracks)
true_track = np.random.randn(n_events, 3)  # 3 true parameters
detector_response = np.random.randn(3, n_detectors)  # Response matrix
detector_data = np.dot(true_track, detector_response)
detector_data += np.random.randn(n_events, n_detectors) * 0.5  # Add noise

# Apply PCA to reduce dimensionality
pca = PCAEngine(n_components=3)
pca.fit(detector_data)
reduced_data = pca.transform(detector_data)

# Analyze variance captured
variances = np.var(reduced_data, axis=0)
total_variance = np.sum(np.var(detector_data - pca.mean, axis=0))
explained_ratio = np.sum(variances) / total_variance

print(f"Variance explained by 3 components: {explained_ratio*100:.2f}%")
print(f"Original dimensions: {n_detectors}")
print(f"Reduced dimensions: {pca.n_components}")

Mathematical Properties

The implementation guarantees:

  • ✓ Principal components are orthonormal
  • ✓ Components ordered by decreasing eigenvalues
  • ✓ Total variance preserved (for full PCA)
  • ✓ Transformed data centered at origin
  • ✓ Numerical stability with various data scales

Performance

Benchmarks show (typical results):

  • Small datasets (100×10): ~0.5ms fit time
  • Medium datasets (1000×50): ~15ms fit time
  • Large datasets (5000×100): ~200ms fit time
  • Numerical accuracy: <10⁻¹⁰ difference from scikit-learn

Contributing

Contributions are welcome! Areas for improvement:

  • Additional physics use case examples
  • Sparse PCA implementation
  • Kernel PCA extension
  • Incremental PCA for large datasets
  • GPU acceleration

License

MIT License

References

  1. Jolliffe, I. T. (2002). Principal Component Analysis. Springer.
  2. Shlens, J. (2014). A Tutorial on Principal Component Analysis. arXiv:1404.1100
  3. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Contact

For questions or issues, please open an issue on the GitHub repository.

About

olah metode reduksi dimensi to uncover hidden information that could be uncover

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages