Skip to content

BitterOcean/MIND-Recommendation-System

Repository files navigation

πŸ“° News Recommendation System

Python AUC License

Content-Based, Collaborative & Hybrid Models on MIND Dataset

A comparative study of recommender system strategies for news articles using the MIND dataset, focusing on both accuracy and beyond-accuracy metrics.


πŸ“š Table of Contents


1. Introduction

Recommender systems are a core component of modern digital platforms.

This project compares:

  • Content-Based Filtering (CBF)
  • Collaborative Filtering (CF)
  • Hybrid Models

on the MIND dataset, a widely used benchmark in news recommendation.


2. Problem Statement

Challenges

  • Data sparsity
  • Cold-start problem
  • Popularity bias
  • Trade-offs between accuracy, diversity, and novelty

Goal

Evaluate and compare recommender models across multiple metrics.


3. Approach

Models implemented:

  • Popularity Baseline
  • Content-Based (TF-IDF)
  • Collaborative Filtering (SVD)
  • Hybrid Model

Pipeline:

  1. EDA
  2. Model implementation
  3. Evaluation
  4. Comparison

4. Dataset & EDA

πŸ“Š Dataset: MIND

πŸ”— https://msnews.github.io/

  • ~1M users
  • ~160k articles
  • ~2.2M sessions

Key Insights

  • Long-tail distribution
  • Category imbalance
  • Sparse user interactions
  • Short textual features

πŸ‘‰ Leads to:

  • Strong CBF performance
  • Weak CF in small data
  • Need for hybrid model

5. Methods

Baseline

  • Popularity-based ranking

Content-Based (CBF)

  • TF-IDF on title, abstract, category
  • Cosine similarity
  • Recency-weighted user profile

Collaborative Filtering (CF)

  • User-item matrix
  • Truncated SVD (32 dims)
  • Embedding-based similarity

Hybrid

[ S = \omega_{cf} \cdot norm(S_{cf}) + \omega_{cbf} \cdot norm(S_{cbf}) ]

Best weights:

  • CF = 0.1
  • CBF = 0.9

6. Evaluation Metrics

Accuracy

  • AUC
  • MRR
  • nDCG@5 / nDCG@10

Beyond-Accuracy

  • Novelty
  • Diversity

7. Results

MIND Small

Model AUC MRR nDCG@10 Novelty
Baseline 0.5318 0.2671 0.3098 14.04
CF 0.5429 0.2632 0.3088 14.58
CBF 0.6073 0.3306 0.3718 16.03
Hybrid 0.6126 0.3310 0.3731 15.80

MIND Large

Model AUC MRR nDCG@10
Baseline 0.5385 0.2618 0.3079
CF 0.5541 0.2795 0.3223
CBF 0.6059 0.3300 0.3711
Hybrid 0.6084 0.3272 0.3692

πŸ“Œ Hybrid performs best overall.


8. Discussion

  • CBF dominates in sparse data
  • CF improves with scale
  • Hybrid balances both

Trade-offs:

  • Accuracy vs diversity
  • Novelty vs popularity

9. Conclusion

  • CBF = strongest standalone
  • CF = data-dependent
  • Hybrid = best overall

10. How to Run

πŸ“₯ 1. Download Dataset

Download from:

πŸ‘‰ https://msnews.github.io/

Files needed:

  • MINDsmall_train.zip
  • MINDsmall_dev.zip

πŸ“‚ 2. Extract Dataset

data/
β”œβ”€β”€ MINDsmall_train/
β”‚   β”œβ”€β”€ behaviors.tsv
β”‚   β”œβ”€β”€ news.tsv
β”‚   β”œβ”€β”€ entity_embedding.vec
β”‚   └── relation_embedding.vec
β”‚
β”œβ”€β”€ MINDsmall_dev/
β”‚   β”œβ”€β”€ behaviors.tsv
β”‚   β”œβ”€β”€ news.tsv
β”‚   β”œβ”€β”€ entity_embedding.vec
β”‚   └── relation_embedding.vec

File Descriptions

  • behaviors.tsv β†’ user clicks & impressions
  • news.tsv β†’ article metadata
  • entity_embedding.vec β†’ entity embeddings
  • relation_embedding.vec β†’ relation embeddings

⚠️ Only behaviors.tsv and news.tsv are used in this project.


πŸ“¦ 3. Install Dependencies

pip install -r requirements.txt

▢️ 4. Run

python run_all.py

βš™οΈ Optional

python run_all.py --max-eval-sessions 5000

πŸ’Ύ Save Results

python run_all.py --json-output results.json

πŸ“ Project Structure

project-root/
β”œβ”€β”€ run_all.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ data/
β”œβ”€β”€ baseline/
β”œβ”€β”€ collaborative_filtering/
β”œβ”€β”€ content_based_filtering/
β”œβ”€β”€ hybrid_filtering/

11. Future Work

  • Neural CF
  • Transformer embeddings
  • Temporal modeling
  • Better hybrid strategies

πŸ”— Resources


About

A comparative study of recommender system strategies for news articles using the MIND dataset, focusing on both accuracy and beyond-accuracy metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors