This project includes fast_mf_pipeline.ipynb, which implements the eALS method introduced in the original paper:
@inproceedings{he2016fast,
title={Fast matrix factorization for online recommendation with implicit feedback},
author={He, Xiangnan and Zhang, Hanwang and Kan, Min-Yen and Chua, Tat-Seng},
booktitle={SIGIR},
pages={549--558},
year={2016},
publisher={ACM},
address={Pisa, Italy},
doi={10.1145/2911451.2911489}
}- Same as the paper:
AmazonMoviesDataset.txt - Format: one review record per block (10 lines), records separated by blank lines.
- Fields parsed:
product/productId,review/userId,review/time.
- Go to the SNAP Amazon links page: https://snap.stanford.edu/data/web-Amazon-links.html
- Download the Amazon Movies dataset archive (
movies.txt.gz) from that page. - Place it in
data/and extract it asdata/AmazonMoviesDataset.txt.
Example:
mkdir -p data
curl -L https://snap.stanford.edu/data/movies.txt.gz -o data/movies.txt.gz
gunzip -c data/movies.txt.gz > data/AmazonMoviesDataset.txt
rm -f data/movies.txt.gz- eALS objective with paper's Amazon settings:
K=128,lambda=0.01,c0=64,alpha=0.5,w_ui=1,r_ui=1.
- Iterative 10-core filtering (
>=10interactions for both users and items). - Chronological leave-one-out split (latest item per user used as test).
- Map-and-broadcast training design:
- Interaction histories are RDDs.
PandQare NumPy arrays on the driver.S^qandS^pare computed on the driver and broadcast each phase.
- Offline ranking evaluation:
- score all items, mask training items, compute
HR@100andNDCG@100.
- score all items, mask training items, compute
src/data_ingest.pyparsing, filtering, indexing, split, history building, and data preparation.src/train_eals.pyEq.12/Eq.13 coordinate updates and training loop.src/evaluate.pyHR/NDCG evaluation.fast_mf_pipeline.ipynbnotebook with visible step-by-step execution.
Open and execute fast_mf_pipeline.ipynb.
Each run writes:
P.npy,Q.npy(trained factor matrices)metrics.json(hr,ndcg,evaluated_users)config.json(eALS + runtime settings)prepare_stats.json(stage-level preprocessing counts and timings)train_log.json(iteration timing)id_maps.json(user_ids,item_idsordered by internal index)run_summary.md