](https://github.com/natinew77-creator/Book-Recommendation-Engine-using-KNN/blob/main/README.md#results--performance)
This project is a robust Collaborative Filtering Recommendation System built to suggest books based on user rating patterns. By analyzing over 1.1 million ratings, the system identifies semantic similarities between books without needing to know the content (genre, author, etc.) of the books themselves.
Developed as part of the FreeCodeCamp Machine Learning with Python Certification, this solution demonstrates the practical application of unsupervised learning algorithms to solve real-world information retrieval problems.
Handling a dataset of this magnitude requires rigorous preprocessing to ensure model reliability.
- Statistical Filtering: Implemented logic to filter out noise by removing users with fewer than 200 ratings and books with fewer than 100 ratings. This ensures the model learns only from active users and well-established items.
- Data Integrity: Merged multiple CSV datasets (Books, Ratings, Users) while handling encoding issues (
ISO-8859-1) and missing values.
A common challenge in recommendation systems is the "sparsity" of the data (most users haven't read most books).
- Pivot Table Transformation: Converted the dataset into a 2D matrix (Index: Book Title, Columns: User ID, Values: Rating).
- CSR Implementation: Transformed the dense matrix into a SciPy Compressed Sparse Row (CSR) matrix. This drastically reduced memory usage and improved computational efficiency during model training.
- Algorithm: Utilized the K-Nearest Neighbors (KNN) algorithm (
sklearn.neighbors.NearestNeighbors). - Metric: Employed Cosine Similarity to measure the distance between book vectors. Unlike Euclidean distance, Cosine similarity focuses on the orientation of the vectors, making it ideal for rating data where the magnitude (number of ratings) might differ but the pattern is similar.
- Language: Python 3
- Machine Learning: Scikit-Learn (NearestNeighbors)
- Data Manipulation: Pandas, NumPy
- Scientific Computing: SciPy (Sparse Matrices)
- Visualization: Matplotlib
The model was validated against a strict testing suite to ensure accuracy.
Example Inference: When querying for the book "Where the Heart Is (Oprah's Book Club (Paperback))", the model successfully returns the following recommendations based on user affinity:
- I'll Be Seeing You (Distance: 0.80)
- The Weight of Water (Distance: 0.77)
- The Surgeon (Distance: 0.77)
- I Know This Much Is True (Distance: 0.77)
- The Lovely Bones: A Novel (Distance: 0.72)
Outcome: Passed all automated test cases with 100% accuracy.
You can view and execute the code directly in Google Colab, or run it locally.
- Clone the repository.
- Install dependencies:
pip install pandas numpy scikit-learn scipy matplotlib
- Run the Jupyter Notebook:
jupyter notebook Copy_of_fcc_book_recommendation_knn.ipynb
Natneal B.