This repository contains a complete machine learning project for predicting house prices using linear regression. The project demonstrates a professional ML workflow including data exploration, preprocessing, model training, and evaluation.
Key Features:
- Clean, production-ready code structure
- Comprehensive exploratory data analysis (EDA)
- Multiple evaluation metrics (R², MAE, MSE, RMSE)
- Professional visualizations of model performance
- Modular, reusable Python code
Dataset: USA Housing Market
Source: Housing price data from various regions across the USA
Features:
| Feature | Description | Data Type |
|---|---|---|
| Avg. Area Income | Average household income in the area | Numeric |
| Avg. Area House Age | Average age of houses in the area (years) | Numeric |
| Avg. Area Number of Rooms | Average number of rooms per house | Numeric |
| Avg. Area Number of Bedrooms | Average number of bedrooms per house | Numeric |
| Area Population | Population density of the area | Numeric |
| Price | House price (target variable) | Numeric |
Statistics:
- Total Samples: ~5,000 records
- Features: 5 numerical input features
- Target: House Price (continuous variable)
- Data Quality: No missing values
- Python 3.8+
- Pandas 1.x - Data manipulation and analysis
- NumPy 1.x - Numerical computations
- Scikit-Learn - Machine learning and model evaluation
- Matplotlib - Data visualization
- Seaborn - Statistical data visualization
- SciPy - Statistical analysis (Q-Q plots, etc.)
- Jupyter Notebook - Interactive analysis and experimentation
- Git - Version control
- Load dataset from CSV
- Examine data types, shape, and missing values
- Remove non-numeric features (Address)
- Calculate descriptive statistics
- Correlation Analysis: Identify relationships between features and target price
- Statistical Summary: Mean, standard deviation, percentiles for all features
- Visualization: Heatmap of feature correlations
- Feature selection: Choose 5 most relevant numeric features
- No scaling required for linear regression (model is invariant to feature scaling)
- Train-test split: 70% training, 30% testing (using random_state=101 for reproducibility)
Algorithm: Linear Regression (Ordinary Least Squares - OLS)
Rational:
- Simple yet interpretable model
- Linear relationship assumption fits the housing price problem well
- Provides clear feature coefficients for business insights
Model Parameters:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)- Fit linear regression model on training data (70% subset)
- Model learns optimal coefficients for each feature
- Training time: < 1 second on standard hardware
The model's performance is assessed using:
| Metric | Formula | Interpretation |
|---|---|---|
| R² Score | Proportion of variance explained (0-1, higher is better) | |
| RMSE | Root mean squared error in price dollars | |
| MAE | Average absolute prediction error in dollars | |
| MSE | Mean squared error (penalizes large errors) |
- Predicted vs Actual Plot: Scatter plot showing model predictions against ground truth
- Perfect Prediction Line: Reference line indicating perfect predictions
- Residual Plot: Analysis of prediction errors
- Residual Distribution: Histogram and Q-Q plot for normality assessment
Model Performance Metrics:
- R² Score: ~0.92 (explains 92% of price variance)
- MAE: ~$80,000-100,000 (average prediction error)
- RMSE: ~$100,000-130,000 (penalizes large errors)
Feature Importance (by coefficient magnitude):
- Avg. Area Income - Strongest positive correlation with price
- Area Population - Significant positive impact
- Avg. Area House Age - Houses age inversely affects price
- Avg. Area Number of Rooms - More rooms → higher price
- Avg. Area Number of Bedrooms - Moderate impact on price
- The linear model captures ~92% of price variation with just 5 features
- Residuals are relatively symmetric and centered around zero
- Model performs well for price predictions in the dataset's price range
- Model assumes linear relationships (reasonable assumption verified through EDA)
house-price-prediction/
├── README.md # Project documentation
├── house_price_prediction_clean.ipynb # Main interactive notebook
├── train.py # Training script
├── utilities.py # Helper functions
├── data/
│ └── USA_Housing.csv # Dataset file
└── results/
├── predictions.csv # Model predictions
├── metrics.json # Evaluation metrics
└── visualizations/ # Generated plots
- Python 3.8 or higher
- pip package manager
git clone https://github.com/yourusername/house-price-prediction.git
cd house-price-predictionpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtRequired packages (requirements.txt):
pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=0.24.0
matplotlib>=3.4.0
seaborn>=0.11.0
scipy>=1.7.0
jupyter>=1.0.0
jupyter notebook house_price_prediction_clean.ipynbThen execute cells sequentially from top to bottom.
from sklearn.linear_model import LinearRegression
import utilities
# Load data
df = utilities.load_data('data/USA_Housing.csv')
# Make predictions
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Evaluate
metrics = utilities.evaluate_model(y_test, predictions)
utilities.print_evaluation_metrics(metrics)- Linear Relationship: Features have linear relationship with target price
- Independence: Data samples are independent observations
- Homoscedasticity: Constant variance of residuals across prediction range
- Normality: Residuals approximately follow normal distribution
- Dataset Scope: Model trained on USA housing data (may not generalize to other markets)
- Feature Completeness: Missing features like location coordinates, property type, material quality
- External Factors: Does not account for market trends, interest rates, or economic conditions
- Outliers: Model performance may degrade with unusual/extreme properties
- Time Invariance: Assumes no temporal changes in market dynamics
✓ Quick price estimation for average residential properties
✓ Portfolio projects and academic demonstrations
✓ Understanding linear regression fundamentals
✗ Production deployment without additional validation
✗ Precise real estate valuation for transactions
✗ Markets significantly different from USA housing
- Add polynomial features to capture non-linear relationships
- Implement feature scaling and standardization
- Test alternative models (Ridge, Lasso, Random Forest, Gradient Boosting)
- Cross-validation for more robust performance estimates
- Handle outliers using robust regression techniques
- Add geographic features (latitude, longitude) if available
This project follows professional Python standards:
- Type Hints: Function signatures include type annotations
- Documentation: Comprehensive docstrings for all functions
- Modularity: Reusable utility functions separated from main pipeline
- Error Handling: Graceful handling of missing files and invalid inputs
This project is provided for educational and portfolio purposes. Feel free to use and modify for your own learning.
- Dataset source: USA Housing Market data
- Built with scikit-learn, pandas, and matplotlib
- Inspired by real-world ML development practices
Last Updated: April 2025
Python Version: 3.8+
Scikit-Learn Version: 0.24+