A production-ready fraud detection system built with machine learning for real-time transaction analysis. This project implements a complete end-to-end pipeline from data analysis to model deployment via REST API and web dashboard.
This fraud detection system processes credit card transactions in real-time to identify potentially fraudulent activity. The system uses LightGBM as the primary classifier, achieving 89.47% precision with only a 0.01% false positive rate.
- Real-time fraud prediction via REST API
- Comprehensive feature engineering (67 derived features from 15 base fields)
- Multiple ML models trained and evaluated (LightGBM, XGBoost, Random Forest, Logistic Regression)
- Interactive web dashboard for transaction analysis
- Sub-20ms prediction latency
- Automated model training and evaluation pipeline
Backend:
- Python 3.12
- Flask (REST API)
- LightGBM (primary model)
- scikit-learn (preprocessing)
- pandas, numpy (data processing)
Frontend:
- Streamlit (web dashboard)
- Plotly (visualizations)
Data:
- IBM Synthetic Credit Card Transactions Dataset
- 24M+ transactions spanning 1999-2020
- 0.13% fraud rate (highly imbalanced dataset)
Fraud-Detection-Project/
├── app/ # Flask API backend
│ ├── __init__.py # API initialization
│ ├── routes.py # REST endpoints
│ ├── model_loader.py # Model caching and loading
│ └── feature_service.py # Real-time feature computation
├── frontend/ # Streamlit dashboard
│ ├── app.py # Main dashboard application
│ ├── pages/ # Multi-page app sections
│ └── components/ # Reusable UI components
├── models/ # Trained model files (gitignored)
├── detection_data/ # Dataset files (gitignored)
├── data_analysis.py # Exploratory data analysis
├── feature_engineering.py # Feature creation pipeline
├── data_preprocessing.py # Data preparation and SMOTE
├── model_training.py # Model training and evaluation
├── run_api.py # API server entry point
└── test_api.py # API integration tests
- Python 3.11 or higher
- 4GB+ RAM (for model training)
- 3-4GB free disk space
- Clone the repository
git clone https://github.com/yourusername/Fraud-Detection-Project.git
cd Fraud-Detection-Project- Install dependencies
pip install -r requirements.txt- Download the dataset
The project uses the IBM Credit Card Transactions dataset from Kaggle. Download these files and place them in detection_data/:
- credit_card_transactions-ibm_v2.csv
- sd254_cards.csv
- sd254_users.csv
See DATA_SETUP.md for detailed instructions.
- Train the models
# Run feature engineering
python feature_engineering.py
# Train and evaluate models
python model_training.pypython run_api.pyThe API will start on http://localhost:5000
streamlit run frontend/app.pyAccess the dashboard at http://localhost:8501
import requests
response = requests.post('http://localhost:5000/api/predict', json={
'Amount': 150.75,
'Merchant State': 'CA',
'Merchant City': 'San Francisco',
'MCC': 5411,
'Use Chip': 'Chip Transaction'
})
result = response.json()
print(f"Fraud: {result['is_fraud']}")
print(f"Probability: {result['fraud_probability']:.2%}")- Precision: 89.47%
- Recall: 68.00%
- F1-Score: 77.27%
- False Positive Rate: 0.01% (2 out of 19,975)
- AUC-ROC: 0.9984
| Model | Precision | Recall | F1-Score | FPR |
|---|---|---|---|---|
| LightGBM | 89.47% | 68.00% | 77.27% | 0.01% |
| XGBoost | 81.82% | 72.00% | 76.60% | 0.02% |
| Random Forest | 22.08% | 68.00% | 33.33% | 0.30% |
| Logistic Regression | 9.71% | 80.00% | 17.32% | 0.93% |
The system creates 67 features from the base transaction data:
- Temporal Features (8): Hour, day of week, month, cyclical encodings
- Amount Features (6): Deviations from user/merchant norms
- Merchant Features (8): Risk scores, transaction counts
- Geographic Features (6): State/city risk indicators
- Card Features (9): Card type, chip usage patterns
- User Behavioral Features (11): Spending patterns, account age
- Velocity Features (5): Transaction frequency, time gaps
- Risk Scores (5): Composite risk indicators
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check and model status |
/api/predict |
POST | Single transaction prediction |
/api/predict/batch |
POST | Batch transaction predictions |
/api/model/info |
GET | Model metadata and version |
/api/statistics |
GET | API usage statistics |
Full API documentation: API_DOCUMENTATION.md
Run the test suite:
python test_api.pyExpected output: 5/5 tests passing
Started with comprehensive EDA to understand the dataset characteristics:
- Analyzed temporal patterns (fraud peaks during specific hours)
- Identified high-risk merchant categories
- Studied geographic distribution of fraud
- Examined transaction amount distributions
Created features that capture:
- User spending behavior deviations
- Merchant risk patterns
- Temporal anomalies
- Transaction velocity (frequency of transactions in time windows)
Evaluated multiple algorithms to find the best balance between precision and recall:
- Handled severe class imbalance (793:1) using SMOTE
- Trained baseline models for comparison
- Selected LightGBM for optimal false positive rate
- Saved best model for production deployment
Built a production-ready API with:
- Lazy model loading for faster startup
- Feature computation on-the-fly
- Caching for improved performance
- Comprehensive error handling
- Implement model retraining pipeline with new data
- Add user authentication for API
- Deploy to cloud (AWS/Azure/GCP)
- Integrate monitoring and alerting
- Implement A/B testing framework for model versions
@dataset{credit_card_transactions_2021,
author = {Altman, E.},
title = {Credit Card Transactions},
year = {2021},
publisher = {Kaggle},
url = {https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions}
}
This project is available for educational and research purposes.
- Dataset provided by Kaggle (IBM Synthetic Financial Transactions)
- Built as part of fraud detection research and ML pipeline development