Skip to content

AbdullahGhulam/House-Price-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

House Price Prediction using Linear Regression

Project Overview

This repository contains a complete machine learning project for predicting house prices using linear regression. The project demonstrates a professional ML workflow including data exploration, preprocessing, model training, and evaluation.

Key Features:

  • Clean, production-ready code structure
  • Comprehensive exploratory data analysis (EDA)
  • Multiple evaluation metrics (R², MAE, MSE, RMSE)
  • Professional visualizations of model performance
  • Modular, reusable Python code

Dataset Description

Dataset: USA Housing Market

Source: Housing price data from various regions across the USA

Features:

Feature Description Data Type
Avg. Area Income Average household income in the area Numeric
Avg. Area House Age Average age of houses in the area (years) Numeric
Avg. Area Number of Rooms Average number of rooms per house Numeric
Avg. Area Number of Bedrooms Average number of bedrooms per house Numeric
Area Population Population density of the area Numeric
Price House price (target variable) Numeric

Statistics:

  • Total Samples: ~5,000 records
  • Features: 5 numerical input features
  • Target: House Price (continuous variable)
  • Data Quality: No missing values

Tech Stack

Core Libraries

  • Python 3.8+
  • Pandas 1.x - Data manipulation and analysis
  • NumPy 1.x - Numerical computations
  • Scikit-Learn - Machine learning and model evaluation
  • Matplotlib - Data visualization
  • Seaborn - Statistical data visualization
  • SciPy - Statistical analysis (Q-Q plots, etc.)

Development Tools

  • Jupyter Notebook - Interactive analysis and experimentation
  • Git - Version control

Methodology

1. Data Loading & Exploration

  • Load dataset from CSV
  • Examine data types, shape, and missing values
  • Remove non-numeric features (Address)
  • Calculate descriptive statistics

2. Exploratory Data Analysis (EDA)

  • Correlation Analysis: Identify relationships between features and target price
  • Statistical Summary: Mean, standard deviation, percentiles for all features
  • Visualization: Heatmap of feature correlations

3. Data Preprocessing

  • Feature selection: Choose 5 most relevant numeric features
  • No scaling required for linear regression (model is invariant to feature scaling)
  • Train-test split: 70% training, 30% testing (using random_state=101 for reproducibility)

4. Model Architecture

Algorithm: Linear Regression (Ordinary Least Squares - OLS)

Rational:

  • Simple yet interpretable model
  • Linear relationship assumption fits the housing price problem well
  • Provides clear feature coefficients for business insights

Model Parameters:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

5. Model Training

  • Fit linear regression model on training data (70% subset)
  • Model learns optimal coefficients for each feature
  • Training time: < 1 second on standard hardware

6. Evaluation Metrics

The model's performance is assessed using:

Metric Formula Interpretation
R² Score $1 - \frac{\sum(y_true - y_pred)^2}{\sum(y_true - \bar{y})^2}$ Proportion of variance explained (0-1, higher is better)
RMSE $\sqrt{\frac{1}{n}\sum(y_true - y_pred)^2}$ Root mean squared error in price dollars
MAE $\frac{1}{n}\sum|y_true - y_pred|$ Average absolute prediction error in dollars
MSE $\frac{1}{n}\sum(y_true - y_pred)^2$ Mean squared error (penalizes large errors)

7. Results Visualization

  • Predicted vs Actual Plot: Scatter plot showing model predictions against ground truth
  • Perfect Prediction Line: Reference line indicating perfect predictions
  • Residual Plot: Analysis of prediction errors
  • Residual Distribution: Histogram and Q-Q plot for normality assessment

Results & Performance

Key Findings

Model Performance Metrics:

  • R² Score: ~0.92 (explains 92% of price variance)
  • MAE: ~$80,000-100,000 (average prediction error)
  • RMSE: ~$100,000-130,000 (penalizes large errors)

Feature Importance (by coefficient magnitude):

  1. Avg. Area Income - Strongest positive correlation with price
  2. Area Population - Significant positive impact
  3. Avg. Area House Age - Houses age inversely affects price
  4. Avg. Area Number of Rooms - More rooms → higher price
  5. Avg. Area Number of Bedrooms - Moderate impact on price

Model Insights

  • The linear model captures ~92% of price variation with just 5 features
  • Residuals are relatively symmetric and centered around zero
  • Model performs well for price predictions in the dataset's price range
  • Model assumes linear relationships (reasonable assumption verified through EDA)

Project Structure

house-price-prediction/
├── README.md                          # Project documentation
├── house_price_prediction_clean.ipynb # Main interactive notebook
├── train.py                          # Training script
├── utilities.py                      # Helper functions
├── data/
│   └── USA_Housing.csv               # Dataset file
└── results/
    ├── predictions.csv               # Model predictions
    ├── metrics.json                  # Evaluation metrics
    └── visualizations/               # Generated plots

Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Step 1: Clone Repository

git clone https://github.com/yourusername/house-price-prediction.git
cd house-price-prediction

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Required packages (requirements.txt):

pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=0.24.0
matplotlib>=3.4.0
seaborn>=0.11.0
scipy>=1.7.0
jupyter>=1.0.0

Usage

Option 1: Run Jupyter Notebook

jupyter notebook house_price_prediction_clean.ipynb

Then execute cells sequentially from top to bottom.

Option 3: Use the Model Programmatically

from sklearn.linear_model import LinearRegression
import utilities

# Load data
df = utilities.load_data('data/USA_Housing.csv')

# Make predictions
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Evaluate
metrics = utilities.evaluate_model(y_test, predictions)
utilities.print_evaluation_metrics(metrics)

Model Assumptions & Limitations

Assumptions

  1. Linear Relationship: Features have linear relationship with target price
  2. Independence: Data samples are independent observations
  3. Homoscedasticity: Constant variance of residuals across prediction range
  4. Normality: Residuals approximately follow normal distribution

Limitations

  1. Dataset Scope: Model trained on USA housing data (may not generalize to other markets)
  2. Feature Completeness: Missing features like location coordinates, property type, material quality
  3. External Factors: Does not account for market trends, interest rates, or economic conditions
  4. Outliers: Model performance may degrade with unusual/extreme properties
  5. Time Invariance: Assumes no temporal changes in market dynamics

When to Use This Model

✓ Quick price estimation for average residential properties
✓ Portfolio projects and academic demonstrations
✓ Understanding linear regression fundamentals

✗ Production deployment without additional validation
✗ Precise real estate valuation for transactions
✗ Markets significantly different from USA housing


Future Improvements

  • Add polynomial features to capture non-linear relationships
  • Implement feature scaling and standardization
  • Test alternative models (Ridge, Lasso, Random Forest, Gradient Boosting)
  • Cross-validation for more robust performance estimates
  • Handle outliers using robust regression techniques
  • Add geographic features (latitude, longitude) if available

Code Quality

This project follows professional Python standards:

  • Type Hints: Function signatures include type annotations
  • Documentation: Comprehensive docstrings for all functions
  • Modularity: Reusable utility functions separated from main pipeline
  • Error Handling: Graceful handling of missing files and invalid inputs

License

This project is provided for educational and portfolio purposes. Feel free to use and modify for your own learning.


Acknowledgments

  • Dataset source: USA Housing Market data
  • Built with scikit-learn, pandas, and matplotlib
  • Inspired by real-world ML development practices

Last Updated: April 2025
Python Version: 3.8+
Scikit-Learn Version: 0.24+

About

machine learning project for predicting house prices using linear regression

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors